Chapter 6 Identifying and removing primers
Now you will carry out primer removal. DADA2 requires the primers you used to be trimmed off the forward and reverse of your reads. We will use the software cutadapt for this from within R.
Read in your forward and reverse primer sequences for our MiFish-U primerset. The MiFish-U primers do not contain any degenerate bases but cutadapt is able to handle any degenerate bases you may have in primer sequences for your own projects.
We have written the primer sequences out below with space dividers to make it easier to type in. Remember to remove the spaces when you type this into R!
Forward primer
- GTC GGT AAA ACT CGT GCC AGC
Reverse primer
- CAT AGT GGG GTA TCT AAT CCC AGT TTG
6.1 Primer orientation checking
It might be useful to check the orientation of the primers on our sequences so we know the best way to remove them. We know how our libraries were created and sequenced so we are expecting that the forward primer should be located in a forward orientation in the forward read and the reverse primer should be located in the forward orientation in the reverse read.
For your own data you might be less certain as to which orientation your primers are in your reads.
We use the following code to make a list of all orientations of our primer sequences (forward, complement, reverse and reverse completment).
allOrients <- function(primer) {
# Create all orientations of the input sequence
require(Biostrings)
dna <- DNAString(primer) # The Biostrings works w/ DNAString
#objects rather than character vectors
orients <- c(Forward = dna, Complement = complement(dna), Reverse = reverse(dna),
RevComp = reverseComplement(dna))
return(sapply(orients, toString)) # Convert back to character vector
}
FWD.orients <- allOrients(FWD)
FWD.orients # print all orientations of the forward primer to the console
REV.orients <- allOrients(REV)
REV.orients # print all orientations of the reverse primer to the console
It is a good idea to pre-filter your data before searching for the primer sequences to remove Ns (ambiguously called bases) from the dataset and improve mapping of the primers. To do this we can use the filterAndTrim
function. For more information on filterAndTrim
and the parameters you can specify type into the console.
(Close the pop-up window).
For now we are only filtering Ns but we will come back to this function later after removing our primers. Specify the path to a new output directory to contain the N filtered files. We will call this directory filtN
, and a copy of each of the forward and reverse fastq files with any 'N' containing sequences removed will be written here.
fnFs.filtN <- file.path(output.path, "filtN", basename(fnFs))
fnRs.filtN <- file.path(output.path, "filtN", basename(fnRs))
Then run filterAndTrim
with maxN = 0
(sequences with more than 0 Ns will be discarded) and multithread=TRUE
(all input files are filtered in parallel, which is faster).
For this function we have specified:
- the path to our raw forward read fatsqs
fnFs
- the path to write our N-trimmed forward fastqs
fnFs.filtN
- the path to our raw reverse read fatsqs
fnRs
- the path to write our N-trimmed reverse fastqs
fnRs.filtN
- and the function-specific options for
maxN
andmultithread
This will take a few minutes to run. Wait for filterAndTrim
to finish running before running the next step. This is indicated by the colour of the circle next to R
on the right-hand side of the screen. Grey=running, clear=idle:

The following function will use all possible primer combinations to count the number of times a primer is found in the forward and reverse read in each orientation.
primerHits <- function(primer, fn) {
# Counts number of reads in which the primer is found
nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE)
return(sum(nhits > 0))
}
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.filtN[[1]]),
FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = fnRs.filtN[[1]]),
REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs.filtN[[1]]),
REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.filtN[[1]]))
The [[1]]
part of the rbind command runs the function on the first file in our list. To specify the second file you would change this to [[2]]
. However, we do not need to run this command more than once (which is good as it can take a little while to run!) because we know that the libraries for each sample were made using the same protocol. Therefore, all the samples should look similar.
We can see that although most of our primers are found as expected in the forward orientation on our forward/reverse read in this sample, we have a few sequences that also contain the reverse complement. This is likely due to the fact that there is some variation in the length of this amplicon between species so in some cases the sequencing machine has cycled through the full region and into the the forward/reverse primer.
6.2 Cutadapt
Now we know the orientation of our primers we can use the software cutadapt to remove the primers from our samples.
See https://cutadapt.readthedocs.io/en/stable/ for more information on how to install and run cutadapt.
Specify the path to the cutadapt software so R knows where to find and run it.
We can use the system2
command to run a shell command from within R.
Specify the path to the output files where we want to put the cutadapt output files.
path.cut <- file.path(output.path, "cutadapt")
if(!dir.exists(path.cut)) dir.create(path.cut)
fnFs.cut <- file.path(path.cut, basename(fnFs))
fnRs.cut <- file.path(path.cut, basename(fnRs))
Here we specify the options needed by cutadapt in order to trim the forward orientation of the forward and reverse primer off the forward and reverse read and the reverse complement off the end of the forward and reverse reads.
FWD.RC <- dada2:::rc(FWD)
REV.RC <- dada2:::rc(REV)
# Trim FWD and the reverse-complement of REV off of R1 (forward reads)
R1.flags <- paste("-g", FWD, "-a", REV.RC)
# Trim REV and the reverse-complement of FWD off of R2 (reverse reads)
R2.flags <- paste("-G", REV, "-A", FWD.RC)
What do the -g
-a
-G
and -A
parameters mean? (Hint: use the following command to check the helpfile for cutadapt)
This should open the cutadapt help information in the Linux terminal session rather than Jupyter-Notebooks directly.
In addition to trimming the primers from the reads we will also specify a couple of extra useful parameters.
--discard-untrimmed
this tells cutadapt to discard any read where the primers haven't been trimmed off. This is especially important for our data as our files at the moment contain sequences amplified using both MiFish-U and 12S-V5 primer sets. We only want to keep sequences matching the MiFish-U primer set for this analysis.
--minimum-length 60
discard reads shorter than 60bp. This will remove unexpected short reads and help speed up further analysis.
Run cutadapt. (Note. The cutadapt step is time intensive so this might take a little while to run, probably about 15 minutes, and a lot of output will be written to the Linux terminal screen while it is running).
for(i in seq_along(fnFs)) {
system2(cutadapt, args = c(R1.flags, R2.flags, "-n", 2,
# -n 2 required to remove FWD and REV
"-o", fnFs.cut[i], "-p", fnRs.cut[i], # output files
fnFs.filtN[i], fnRs.filtN[i], # input files
"--discard-untrimmed",
"--minimum-length 60"))
}
We can now check whether all the primers have been removed using the primerHits function we specified earlier.
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.cut[[1]]),
FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = fnRs.cut[[1]]),
REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs.cut[[1]]),
REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.cut[[1]]))
We now have no primers remaining in our file.