Chapter 21 MEGAHIT

We will use our stitched and unstitched reads to produce an assembly with MEGAHIT.

21.1 MEGAHIT

Create a new directory to store our assembly in.

cd ..
mkdir 6-Assembly
cd 6-Assembly

Run the metagenome assembler MEGAHIT using our stitched read data. We are usign the stithced and unstitched reads.

megahit \
-r ../5-Stitched/K1.extendedFrags.fastq.gz \
-1 ../5-Stitched/K1.notCombined_1.fastq.gz \
-2 ../5-Stitched/K1.notCombined_2.fastq.gz \
-o K1 \
-t 12 \
--k-list 29,49,69,89,109,129,149,169,189

Parameters

  • -r: Single-end reads to be used for assembly.
    • We are using our successfully stitched reads.
  • -1: Forward reads of paired end reads to be used for assembly.
    • We are using the reads that did not stitch as they still have useful information.
  • -2: Reverse reads of paired end reads to be used for assembly.
    • We are using the reads that did not stitch as they still have useful information.
  • -o: Output directory.
  • -t: Number of threads to be used for process.
  • --k-list: K-mer list.

The k-mer list instructs MEGAHIT to first generate an assembly using a k-mer size of 29 bp and when that is complete, integrate the results into an assembly using a k-mer size of 49 bp, and so on up to a final iteration using a k-mer size of 189 bp. This large range of k-mer lengths should give us a good assembly, given the data. However, it may take a while to run. This might be a good time to read on or take a break whilst the command runs.

If you need a command prompt (your current one is busy because MEGAHIT is running), right click on the main screen, choose Terminal.

Once the assembly is complete, look at the output FASTA file containing the contigs:

less K1/final.contigs.fa

There is not much to see. When happy, quit less (q) and carry on to QUAST.

21.2 QUAST

We will generate assembly metrics with QUAST. QUAST is a very popular genomeevaluation tool that produces a html report with various metrics such as the number of contigs and length of the assembly.

21.2.1 Assembly assessment

Create a directory for the QUAST output.

#Create QUAST output directory
#The option -p will create a directory and any required
# parent directories
mkdir -p quast/K1

Run QUAST.

quast -o quast/K1 K1/final.contigs.fa

21.2.2 Report

QUAST will run relatively quickly. Once complete view the QUAST report with firefox.

firefox quast/K1/report.html

The report tells us quite a bit about the assembly quality. Two definitions that you may not be aware of are N50 and L50. To calculate these values:

  • Order the contigs from largest to smallest.
  • Total up the sizes from biggest downwards.
  • The contig we reach where our total is at least 50% of the size of the whole assembly is the N50 contig.
  • N50 equals the length of the N50 contig.
  • L50 is the number of contigs with a length equal to or greater than N50.

21.3 MCQs

Brilliant! Using the QUAST report answer the below MCQs.

Note: Due to the assembly process your values may be slightly different (< 1%). Please choose the closest value.

  1. What is the total length of the assembly?
  2. How many contigs does the assembly consist of?
  3. What is the GC% of the assembly?
  4. What is the N50 of the assembly?
  5. What is the L50 of the assembly?
  6. What is the length of the largest contig?

Questions

  • How do the contig metrics compare to the original reads?
  • Are there more reads than contigs?
  • Are the contigs longer than the original reads?

21.4 Metagenome assembly summary

We now have an assembly. It is not a brilliant one due to us only having used 1 million reads. In real analysis we would prefer fewer but longer contigs. We will explore some tools we can use with our metagenome assembly in the next chapters.

There is also a metaQUAST specifically for metagenome assemblies but it requires reference assemblies.