Chapter 20 Stitching read pairs
Longer k-mers generally perform better for assemblies. However, our maximum read length is 100 bp so we are limited to a maximum k-mer length of 99 bp. Thankfully we can get even longer k-mers if we stitch our read pairs together.
Note: This method will not work if your reads have no overlap. If you are not sure if your reads have overlap ask the team who sequenced them.
A read pair consists of two sequences read from each end of a fragment of DNA (or RNA). If the two sequences meet and overlap in the middle of the fragment, there will be a region of homology. We can use this to merge the two reads together (See next image).
First, we obtain our forward and reverse reads, derived from different ends of the same fragment. Second, we look for sufficient overlap between the 3' ends of our sequences. Third, if there is sufficient overlap, we combine, or stitch, the two reads together to form one long sequence.
Once we have longer stitched reads, we can increase the k-mer length for our assembly.
There are a number of pieces of software that can be used to stitch reads (e.g. Pear, Pandaseq) but today we will use one called FLASH:
20.1 FLASH: run
Make a new output directory for the stitched reads and run FLASH:
20.2 FLASH: Output
Once FLASH has finished running, it will display on screen how well the stitching process went, in this case a low amount of reads were combined. Have a look what files have been generated.
We have three new fastq.gz files. One containing the stitched reads (K1.extendedFrags.fastq.gz) and two containing the reads from pairs that could not be combined (K1.notCombined_1.fastq.gz and K1.notCombined_2.fastq.gz).
We can also see what the new read lengths are:
Scroll down with the down key and you will see that we are looking at a histogram showing the proportion of stitched reads at different lengths.