Chapter 26 Bin quality scores

One quick way to calculate the overall quality of a MAG/bin is with the following equation:

\[ q = comp - (5 * cont) \] Where:

q = Overall quality
comp = Completeness
cont = Contamination

A score of at least 70-80% (i.e. 0.7 to 0.8) would be the aim, with a maximum/perfect value being 100% (100% completeness, 0% contamination). We'll therefore calculate this for the bins with some bash and awk scripting.

Note: Values will range from:

100% (i.e. 1): 1 Completeness - (5 * 0 Contamination)
-500% (i.e. -5): 0 Completeness - (5 * 1 Contamination)

26.1 Quality file

We will create a new file with only the quality information called "MAGS_quality.csv".

Make the file with it only containing the header "quality. We will add the quality scores to this later.

echo "quality" > MAGS_quality.csv

26.2 Calculate quality with awk

Next is the most complicated command. We will be calculating the Overall quality (see calculation above) for each row except the header row.

We will be using a complicated linux based language called awk. This is very useful as it can carry out calculations on columns or as awk calls them, fields.

As this is new and complicated we will build up our command step by step.

26.2.1 Extract fields/columns

The first step is to extract the completeness and contamination fields/columns.

awk -F, '{print $2,$3}' cocopye_output.csv

-F,: Indicates the input fields are separated by commas (,).
'': All the awk options are contained within the quotes.
{}: We can supply a function to awk within the braces.
print $2,$3: This function instructs awk to print the 2nd (completeness) and 3rd (contamination) fields. It is common to put commas (,) between fields if printing multiple fields.
cocopye_output.csv: Our last parameter is the input file. We are not changing the contents of the file, only printing information to screen/stdout.

26.2.2 Ignore header

We do not want the header in our calculation so we will add an extra awk option.

awk -F, 'NR>1 {print $2,$3}' cocopye_output.csv

NR>1: NR stands for number of records. Rows are called records in awk. Therefore NR>1 means awk will only carry out the functions on the records numbered greater than 1. I.e. skip row 1, the header row.

26.2.3 Calculate quality

The next step is to carry out the overall quality calculation.

awk -F, 'NR>1 {print $2 - (5 * $3)}' cocopye_output.csv

Our new function, {print $2 - (5 * 13)}, carries out the overall quality calculation and prints it for each record/row except the first (NR>1).

You will notice that we have values that equal 4. Let us fix that.

26.2.4 Fix values

Some quality values come out as 4. This is not correct and comes about as some completeness and contamination values have been set to -1 (-1 - (5 * -1) = 4). If you look at the file cocopye_output.csv you will notice the bins with -1 values have the rejected for their method value. These are bins which failed the Input Pre-procesing step.

We will therefore change these quality values to the lowest possible value of -5 (0 - (5 * 1) = -5).

awk -F, 'NR>1 {print $2 - (5 * $3)}' cocopye_output.csv | \
sed "s/^4$/-5/"

In this case we pipe (|) our output to sed to substitute lines that start with (^) and end with ($) the same 4 with -5.

In other words we replace lines that only contain a 4 with a -5.

26.2.5 Append to quality file

Finally we can append the quality values to our MAGS_quality.csv file.

awk -F, 'NR>1 {print $2 - (5 * $3)}' cocopye_output.csv | \
sed "s/^4$/-5/" >> MAGS_quality.csv

In the above case we use >> to append the information to the file MAGS_quality.csv. We append because we want to retain the header we added to the file earlier.

You can view the file to ensure it worked. The first and second values should be 0.9838 and 0.493

less MAGS_quality.csv

26.3 Add quality to the checkm results file

Now we can combine the files cocopye_output.csv and MAGS_quality.csv with the paste command into a new file called cocopye_quality.csv. The -d "," option indicates the merged files will be separated by commas (,), matching the column separation in cocopye_output.csv.

paste -d "," cocopye_output.csv MAGS_quality.csv > cocopye_quality.csv

26.4 MCQs

Viewing the file cocopye_output.csv attempt the below questions.

Tip: You can use the cut command to look at specific columns. For example:

#look at the "bin" and "quality" columns
#Convert the printed output's commas to tabs for readability
cut -d "," -f 1,8 cocopye_quality.csv | tr "," "\t"

What lineage was assigned to bin K1.1?
Bacteria Bacteroides Lachnospiraceae
What lineage was assigned to bin K1.22?
Bacteria Bacteroides Lachnospiraceae
What lineage was assigned to bin K1.8?
Bacteria Bacteroides Lachnospiraceae
What is the quality value of K1.1?
0.0202 0.6724 0.9769
What is the completeness value of K1.30?
0.0202 0.6724 0.9769
What is the contamination value of K1.12 bin?
0.0202 0.6724 0.9769
Which bin has the highest quality value (98.38%)?
K1.20 K1.26 K1.22
Which bin has the quality value of -2.9215?
K1.20 K1.26 K1.22
Which bin has the highest completeness value (98.59%)?
K1.20 K1.26 K1.22

26.5 Bin quality summary

It is always useful to know the quality of your bins so you know which are more reliable than others. With that information you can be more or less certain when concluding your findings.

We have some good quality bins but many poorer quality bins too. Ultimately binning is trying to separate all the genomes from each other. A better metagenome assembly would most likely have led to better binning.