Chapter 8 Cleaning your data

We will now filter our data to remove any poor quality reads.

First set the path to a directory to store the filtered output files called filtered.

filtFs <- file.path(path.cut, "../filtered", basename(cutFs))
filtRs <- file.path(path.cut, "../filtered", basename(cutRs))

Now run filterAndTrim. This time we use the standard filtering parameters:

  • maxN=0 After truncation, sequences with more than 0 Ns will be discarded. (DADA2 requires sequences contain no Ns)
  • truncQ = 2 Truncate reads at the first instance of a quality score less than or equal to 2
  • rm.phix = TRUE Discard reads that match against the phiX genome
  • maxEE=c(2, 2) After truncation, reads with higher than 2 "expected errors" will be discarded
  • minLen = 60 Remove reads with length less than 60 (note these should have already been removed by cutadapt)
  • multithread = TRUE input files are filtered in parallel
out <- filterAndTrim(cutFs, filtFs, cutRs, filtRs, maxN = 0, maxEE = c(2, 2), 
                     truncQ = 2, minLen = 60, rm.phix = TRUE, compress = TRUE, 
                     multithread = TRUE)
out

Some samples have very low read numbers after this filtering step. These could be poor quality samples but we also have negatives controls in this dataset so we would expect these to contain zero or very few reads.