Chapter 22 Genome binning
A metagenome assembly consists of contigs from many different genomes. At this stage we don't know which contigs are from which species. We could try to taxonomically classify each contig but there are 2 problems with this approach:
- Some contigs may be misclassified which can lead to multiple contigs from the same genome/organism being classified as various taxa.
- Databases are incomplete and so some contigs will not be classified at all (microbial dark matter).
To alleviate these issues genomic binning can be carried out. This will cluster contigs into bins based on:
- Coverage: Contigs with similar coverage are more likely to be from the same genome.
- Composition: Contigs with similar GC content are more likely to belong to the same genome.
Genomic binning has been used to discover many new genomes. Additionally, it makes downstream analyses quicker as the downstream steps will be carried out on the sets of bins rather than on one large metagenome assembly.
Binning produces "bins" of contigs of various quality (e.g. draft, complete). These bins are also known as MAGs (Metagenome-assembled genomes). In other words a MAG is a single assembled genome that was assembled with other genomes in a metagenome assembly but later separated from the other assemblies. The term MAG has been adopted by the GSC (Genomics Standards Consortium).
It is recommended to ensure you do not have a poor quality metagenome assembly. Binning requires contigs of good length and good coverage. Extremely low coverage and very short contigs will be excluded from binning.
22.1 MetaBAT2
We will use MetaBAT2 for our genome binning. It has three major upsides that makes it very popular:
- It has very reliable default parameters meaning virtually no parameter optimisation is required.
- It performs very well amongst genome binners.
- It is computationally efficient compared to other binners (requires less RAM, cores etc.)
Make a new directory and move into it.
22.1.1 MetaBAT2: depth calculation
To carry out effective genome binning MetaBAT2 uses coverage information of the contigs. To calculate depth we need to align the reads to the metagenome assembly.
Index assembly
For the alignment we will use bwa. We need to index our assembly file prior to alignment.
Alignment
Next we will align our trimmed paired reads we used to create the stitched reads. We will carry this out with the bwa mem command. bwa mem is a good aligner for short reads. If you are using long reads (PacBio or Nanopore) minimap2 will be more appropriate.
Sam to sorted bam
After alignment we need to get the file ready for the contig depth summarisation step. This requires converting the sam file to a bam (binary form of a sam file) file and then sorting the bam file.
Summarise depths
Now we can summarise the contig depths from the sorted bam files with MetaBAT2's jgi_summarize_bam_contig_depths command.
View summary depth
You can have a look at the depth file and you will notice there are many contigs with low coverage (<10) and of short length (<1500).
To get a better look we will open the file in R and look at a summary of the file's table.
Activate R:
Now in R we will read in the file and get a summary() of it.
#Read in the table as an object called df (short for data frame)
#We want the first row to be the column names (header=TRUE)
#We do not want R to check the column names and "fix" them (check.names=FALSE)
df <- read.table("K1.depth.txt", header=TRUE, check.names=FALSE)
#Create a summary of the data
summary(df)The last command gave us summary information of all the columns. This includes the minimum, maximum, mean, median, and Inter-Quartile Range (IQR) values.
We can see the values of the contigLen and totalAvgDepth are very low. However, this is most likely due to a bunch of short and low coverage contigs which will be ignored by MetaBAT2. Therefore we will remove rows with information on contigs shorter than 1500 and rerun the summary. MetaBAT2's documentation dictates the minimum contig length should be >=1500 with its default being 2500.
#Set the new object "df_min1500len" as all rows
#where the value in the column "contigLen" of "df"
#Is greater than or equal to 1500
df_min1500len <- df[df$contigLen >= 1500,]
#Summary of our new data frame
summary(df_min1500len)That is looking better. The minimum average coverage for MetaBAT2 is 1 and our minimum value is 2.700 with a maximum of 93.285 (your values may differ slightly). Now you can quit R and continue.
Note: One of the reasons for our short contigs is that we only used a subset of our sequencing dataset for this tutorial due to time concerns.
22.1.2 MetaBAT2: run
With our assembly and its depth information we can run MetaBAT2 for binning.
#make a diretcory for the bins
mkdir bins
#Run MetaBAT2
metabat2 \
--inFile ~/6-Assembly/K1/final.contigs.fa \
--outFile bins/K1 \
--abdFile K1.depth.txt \
--minContig 1500Parameters
--inFile: Input metagenome assembly fasta file.--outFile: Prefix of output files.--abdFile: Base depth file.--minContig: Minimum size of contigs to be used for binning.- The default is 2500.
- We used the minimum value of 1500 as we are using tutorial data. We recommend using the default in your own analysis.
22.1.3 MetaBAT2: output
List the contents of the output directory and you'll see there is 1 fasta file with the prefix of K1. This is a bin that will hopefully contain 1 MAG (Metagenome-Assembled Genome). In your future analysis you may get many bins, each hopefully only having one MAG.
In the next chapter we will assess the quality of these bins.