Chapter 6 Identifying and removing primers

Now you will carry out primer removal. DADA2 requires the primers you used to be trimmed off the forward and reverse of your reads. We will use the software cutadapt for this from within R.

Read in your forward and reverse primer sequences for our MiFish-U primerset. The MiFish-U primers do not contain any degenerate bases but cutadapt is able to handle any degenerate bases you may have in primer sequences for your own projects.

FWD <- "GTCGGTAAAACTCGTGCCAGC"
REV <- "CATAGTGGGGTATCTAATCCCAGTTTG"

We have written the primer sequences out below with space dividers to make it easier to type in. Remember to remove the spaces when you type this into R!

Forward primer

GTC GGT AAA ACT CGT GCC AGC

Reverse primer

CAT AGT GGG GTA TCT AAT CCC AGT TTG

6.1 Primer orientation checking

It might be useful to check the orientation of the primers on our sequences so we know the best way to remove them. We know how our libraries were created and sequenced so we are expecting that the forward primer should be located in a forward orientation in the forward read and the reverse primer should be located in the forward orientation in the reverse read.

For your own data you might be less certain as to which orientation your primers are in your reads.

6.1.1 Primer orientations

We use the following code to make a list of all orientations of our primer sequences (forward, complement, reverse and reverse complement).

allOrients <- function(primer) {
  # Create all orientations of the input sequence
  require(Biostrings)
  dna <- DNAString(primer)  # The Biostrings works w/ DNAString 
                                  #objects rather than character vectors
  orients <- c(Forward = dna, Complement = complement(dna), Reverse = reverse(dna), 
               RevComp = reverseComplement(dna))
  return(sapply(orients, toString))  # Convert back to character vector
}

FWD.orients <- allOrients(FWD)
FWD.orients # print all orientations of the forward primer to the console
REV.orients <- allOrients(REV)
REV.orients # print all orientations of the reverse primer to the console

6.1.2 N filter

It is a good idea to pre-filter your data before searching for the primer sequences to remove Ns (ambiguously called bases) from the dataset and improve mapping of the primers. To do this we can use the filterAndTrim function. For more information on filterAndTrim and the parameters you can specify type into the console.

?filterAndTrim

(Close the pop-up window).

For now we are only filtering Ns but we will come back to this function later after removing our primers. Specify the path to a new output directory to contain the N filtered files. We will call this directory filtN, and a copy of each of the forward and reverse fastq files with any 'N' containing sequences removed will be written here.

fnFs.filtN <- file.path(output.path, "filtN", basename(fnFs)) 
fnRs.filtN <- file.path(output.path, "filtN", basename(fnRs))

Then run filterAndTrim with maxN = 0 (sequences with more than 0 Ns will be discarded) and multithread=TRUE (all input files are filtered in parallel, which is faster).

filterAndTrim(fnFs, fnFs.filtN, fnRs, fnRs.filtN, maxN = 0, multithread = TRUE)

For this function we have specified:

the path to our raw forward read fastqs fnFs
the path to write our N-trimmed forward fastqs fnFs.filtN
the path to our raw reverse read fatsqs fnRs
the path to write our N-trimmed reverse fastqs fnRs.filtN
and the function-specific options for maxN and multithread

This will take a few minutes to run. Wait for filterAndTrim to finish running before running the next step. This is indicated by the colour of the circle next to R on the right-hand side of the screen. Grey=running, clear=idle:

6.1.3 Primer orientations count

The following function will use all possible primer combinations to count the number of times a primer is found in the forward and reverse read in each orientation.

primerHits <- function(primer, fn) {
  # Counts number of reads in which the primer is found
  nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE)
  return(sum(nhits > 0))
}

rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.filtN[[1]]), 
      FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = fnRs.filtN[[1]]), 
      REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs.filtN[[1]]), 
      REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.filtN[[1]]))

The [[1]] part of the rbind command runs the function on the first file in our list. To specify the second file you would change this to [[2]]. However, we do not need to run this command more than once (which is good as it can take a little while to run!) because we know that the libraries for each sample were made using the same protocol. Therefore, all the samples should look similar.

We can see that although most of our primers are found as expected in the forward orientation on our forward/reverse read in this sample, we have a few sequences that also contain the reverse complement. This is likely due to the fact that there is some variation in the length of this amplicon between species so in some cases the sequencing machine has cycled through the full region and into the the forward/reverse primer.

6.2 Cutadapt

Now that we know the orientation of our primers we can use the software cutadapt to remove the primers from our samples.

In the below part we will use the system2 command to allow us to run a shell command from within R.

See https://cutadapt.readthedocs.io/en/stable/ for more information on how to install and run cutadapt.

6.2.1 Cutadapt program path

Specify the path to the cutadapt software so R knows where to find and run it.

cutadapt <- "/pub14/tea/nsc206/mambaforge/envs/qiime2-2023.5/bin/cutadapt"

6.2.2 Output paths

Specify the path to the output files where we want to put the cutadapt output files.

path.cut <- file.path(output.path, "cutadapt")
if(!dir.exists(path.cut)) dir.create(path.cut)
fnFs.cut <- file.path(path.cut, basename(fnFs))
fnRs.cut <- file.path(path.cut, basename(fnRs))

6.2.3 Primer orientations options

Here we specify the options needed by cutadapt in order to trim the forward orientation of the forward and reverse primer off the forward and reverse read and the reverse complement off the end of the forward and reverse reads.

FWD.RC <- dada2:::rc(FWD)
REV.RC <- dada2:::rc(REV)
# Trim FWD and the reverse-complement of REV off of R1 (forward reads)
R1.flags <- paste("-g", FWD, "-a", REV.RC) 
# Trim REV and the reverse-complement of FWD off of R2 (reverse reads)
R2.flags <- paste("-G", REV, "-A", FWD.RC)

What do the -g -a -G and -A parameters mean? (Hint: use the following command to check the helpfile for cutadapt)

system2(cutadapt, args = "--help")

This should open the cutadapt help information in the Linux terminal session rather than Jupyter-Notebooks directly.

6.2.4 Other cutadapt options

In addition to trimming the primers from the reads we will also specify a couple of extra useful parameters.

--discard-untrimmed this tells cutadapt to discard any read where the primers haven't been trimmed off. This is especially important for our data as our files at the moment contain sequences amplified using both MiFish-U and 12S-V5 primer sets. We only want to keep sequences matching the MiFish-U primer set for this analysis.

--minimum-length 60 discard reads shorter than 60bp. This will remove unexpected short reads and help speed up further analysis.

6.2.5 Run cutadapt

Run cutadapt. (Note. The cutadapt step is time intensive so this might take a little while to run, probably about 15 minutes, and a lot of output will be written to the Linux terminal screen while it is running).

for(i in seq_along(fnFs)) {
  system2(cutadapt, args = c(R1.flags, R2.flags, "-n", 2, 
                             # -n 2 required to remove FWD and REV
                             "-o", fnFs.cut[i], "-p", fnRs.cut[i], # output files
                             fnFs.filtN[i], fnRs.filtN[i], # input files
                             "--discard-untrimmed",
                             "--minimum-length 60"))
}

6.2.6 Primer check

We can now check whether all the primers have been removed using the primerHits function we specified earlier.

rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.cut[[1]]), 
      FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = fnRs.cut[[1]]), 
      REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs.cut[[1]]), 
      REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.cut[[1]]))

We now have no primers remaining in our file.