Chapter 26 Bin quality scores
One quick way to calculate the overall quality of a MAG/bin is with the following equation:
\[ q = comp - (5 * cont) \] Where:
- q = Overall quality
- comp = Completeness
- cont = Contamination
A score of at least 70-80% (i.e. 0.7 to 0.8) would be the aim, with a maximum/perfect value being 100% (100% completeness, 0% contamination). We'll therefore calculate this for the bins with some bash and awk scripting.
Note: Values will range from:
- 100% (i.e. 1): 1 Completeness - (5 * 0 Contamination)
- -500% (i.e. -5): 0 Completeness - (5 * 1 Contamination)
26.1 Quality file
We will create a new file with only the quality information called "MAGS_quality.csv".
Make the file with it only containing the header "quality. We will add the quality scores to this later.
26.2 Calculate quality with awk
Next is the most complicated command. We will be calculating the Overall quality (see calculation above) for each row except the header row.
We will be using a complicated linux based language called awk. This is very useful as it can carry out calculations on columns or as awk calls them, fields.
As this is new and complicated we will build up our command step by step.
26.2.1 Extract fields/columns
The first step is to extract the completeness and contamination fields/columns.
-F,: Indicates the input fields are separated by commas (,).'': All theawkoptions are contained within the quotes.{}: We can supply a function toawkwithin the braces.print $2,$3: This function instructsawkto print the 2nd (completeness) and 3rd (contamination) fields. It is common to put commas (,) between fields if printing multiple fields.cocopye_output.csv: Our last parameter is the input file. We are not changing the contents of the file, only printing information to screen/stdout.
26.2.2 Ignore header
We do not want the header in our calculation so we will add an extra awk option.
NR>1:NRstands for number of records. Rows are called records inawk. ThereforeNR>1meansawkwill only carry out the functions on the records numbered greater than 1. I.e. skip row 1, the header row.
26.2.3 Calculate quality
The next step is to carry out the overall quality calculation.
Our new function, {print $2 - (5 * 13)}, carries out the overall quality calculation and prints it for each record/row except the first (NR>1).
You will notice that we have values that equal 4. Let us fix that.
26.2.4 Fix values
Some quality values come out as 4.
This is not correct and comes about as some completeness and contamination values have been set to -1 (-1 - (5 * -1) = 4).
If you look at the file cocopye_output.csv you will notice the bins with -1 values have the rejected for their method value.
These are bins which failed the Input Pre-procesing step.
We will therefore change these quality values to the lowest possible value of -5 (0 - (5 * 1) = -5).
In this case we pipe (|) our output to sed to substitute lines that start with (^) and end with ($) the same 4 with -5.
In other words we replace lines that only contain a 4 with a -5.
26.2.5 Append to quality file
Finally we can append the quality values to our MAGS_quality.csv file.
In the above case we use >> to append the information to the file MAGS_quality.csv. We append because we want to retain the header we added to the file earlier.
You can view the file to ensure it worked. The first and second values should be 0.9838 and 0.493
26.3 Add quality to the checkm results file
Now we can combine the files cocopye_output.csv and MAGS_quality.csv with the paste command into a new file called cocopye_quality.csv. The -d "," option indicates the merged files will be separated by commas (,), matching the column separation in cocopye_output.csv.
26.4 MCQs
Viewing the file cocopye_output.csv attempt the below questions.
Tip: You can use the cut command to look at specific columns. For example:
#look at the "bin" and "quality" columns
#Convert the printed output's commas to tabs for readability
cut -d "," -f 1,8 cocopye_quality.csv | tr "," "\t"- What lineage was assigned to bin K1.1?
- What lineage was assigned to bin K1.22?
- What lineage was assigned to bin K1.8?
- What is the quality value of K1.1?
- What is the completeness value of K1.30?
- What is the contamination value of K1.12 bin?
- Which bin has the highest quality value (98.38%)?
- Which bin has the quality value of -2.9215?
- Which bin has the highest completeness value (98.59%)?
26.5 Bin quality summary
It is always useful to know the quality of your bins so you know which are more reliable than others. With that information you can be more or less certain when concluding your findings.
We have some good quality bins but many poorer quality bins too. Ultimately binning is trying to separate all the genomes from each other. A better metagenome assembly would most likely have led to better binning.