Chapter 20 Stitching read pairs

Longer k-mers generally perform better for assemblies. However, our maximum read length is 100 bp so we are limited to a maximum k-mer length of 99 bp. Thankfully we can get even longer k-mers if we stitch our read pairs together.

Note: This method will not work if your reads have no overlap. If you are not sure if your reads have overlap ask the team who sequenced them.

A read pair consists of two sequences read from each end of a fragment of DNA (or RNA). If the two sequences meet and overlap in the middle of the fragment, there will be a region of homology. We can use this to merge the two reads together (See next image).

First, we obtain our forward and reverse reads, derived from different ends of the same fragment. Second, we look for sufficient overlap between the 3' ends of our sequences. Third, if there is sufficient overlap, we combine, or stitch, the two reads together to form one long sequence.

Once we have longer stitched reads, we can increase the k-mer length for our assembly.

There are a number of pieces of software that can be used to stitch reads (e.g. Pear, Pandaseq) but today we will use one called FLASH:

20.1 FLASH: run

Make a new output directory for the stitched reads and run FLASH:

#Change directory to home
cd ~
#Make and move into new directory
mkdir 5-Stitched
cd 5-Stitched
#Run flash
flash  -o K1 -z -t 12 -d . \
../2-Trimmed/K1_R1.fq.gz ../2-Trimmed/K1_R2.fq.gz

Parameters

  • -o: Sets the prefix of the output files.
  • -z: The input is zipped.
  • -t: Number of threads to use.
  • -d: The directory the output files will be placed.
  • The last 2 flag-less parameters are the forward and reverse read files for stitching.

20.2 FLASH: Output

Once FLASH has finished running, it will display on screen how well the stitching process went, in this case a low amount of reads were combined. Have a look what files have been generated.

ls

We have three new fastq.gz files. One containing the stitched reads (K1.extendedFrags.fastq.gz) and two containing the reads from pairs that could not be combined (K1.notCombined_1.fastq.gz and K1.notCombined_2.fastq.gz).

We can also see what the new read lengths are:

less K1.histogram

Scroll down with the down key and you will see that we are looking at a histogram showing the proportion of stitched reads at different lengths.

20.3 FLASH: MCQs

  1. What length has the highest proportion of stitched reads?
  2. What length has the lowest proportion of stitched reads?