Chapter 24 Assembly functional annotation

24.1 Taxonomic annotation

Taxonomic annotation of bins can be carried out with Kraken2. As we have already done this for the reads and taxonomic results from read and assembly approaches have similar performances we will not cover it here. Instead we will move straight onto functional annotation with Bakta.

Conda Environment: Go back to your shotgun_meta terminal (or create a new one and use . useshotgun).

24.2 Bakta

We will carry out Bakta functional annotation. Bakta can annotate bacterial genomes and plasmids from both isolates and MAGs.

Make a new directory and move into it.

mkdir ~/8-Annotation
cd ~/8-Annotation

24.2.1 Bakta: run

Now we can annotate one of the bins.

The below will take a long time to run (>1 hour). Instead of running it skip onto the next section to copy pre-made output to continue with. This command is here so you know what to run in your own future analyses.

bakta \
--db /pub14/tea/nsc206/databases/bakta/db/ \
-o K1.1 \
~/7-Binning/K1_fullset/bins/K1.1.fa

Parameters

  • --db: Location of Bakta database. You will need to install this in your own installation. Instructions are in the appendix.
  • -o: The output directory. This must not exist before running the command.
  • The last parameter is the fasta file containing the genome/plasmid you would like annotated.

24.2.2 Bakta: premade results

Link the pre-made results for all the K1 bins.

#Link all data
ln -s /pub14/tea/nsc206/NEOF/Shotgun_metagenomics/bakta/K1_fullset/* .

24.2.3 Bakta: output

List the files in the newly created K1.1 directory. Each of the files has the prefix "K1.1" and contains the following information:

  • .tsv: annotations as simple human readble TSV
  • .gff3: annotations & sequences in GFF3 format
  • .gbff: annotations & sequences in (multi) GenBank format
  • .embl: annotations & sequences in (multi) EMBL format
  • .fna: replicon/contig DNA sequences as FASTA
  • .ffn: feature nucleotide sequences as FASTA
  • .faa: CDS/sORF amino acid sequences as FASTA
  • .hypotheticals.tsv: further information on hypothetical protein CDS as simple human readble tab separated values
  • .hypotheticals.faa: hypothetical protein CDS amino acid sequences as FASTA
  • .json: all (internal) annotation & sequence information as JSON
  • .txt: summary as TXT
  • .png: circular genome annotation plot as PNG
    • These are only useful for complete/near complete circular genomes.
    • I would suggest looking at GenoVi for circular genome plots.
  • .svg: circular genome annotation plot as SVG
  • .log: Log file of command.

View the summary file for bin K1.1.

less -S K1.1/K1.1.txt

Sequence information

  • Length: Number of bases.
  • Count: Number of contigs/scaffolds.
  • GC: GC%.
  • N50: N50.
  • N ratio: Ratio of N bases to non-N bases.
  • coding density: Percentage of bases within coding regions.

Annotation information.

  • tRNAs: Transfer RNAs.
  • tmRNAs: Transfer-messenger RNA.
  • rRNAs: Ribosomal RNAs.
  • ncRNAs: Non-coding RNAs.
  • ncRNA regions: Non-coding RNA regions.
  • CRISPR arrays: CRISPR arrays.
  • CDSs: Coding sequences.
  • pseudogenes: Segments of DNA that structurally resembles a gene but is not capable of coding for a protein
  • hypotheticals: Hypothetical genes, which are predicted solely by computer algorithms, are experimentally uncharacterized genes.
  • signal peptides: Short peptides (usually 16-30 amino acids long) normally present at the N-terminus of most newly synthesized proteins that are destined toward the secretory pathway.
  • sORFs: Short open reading frames (<100 amino acids).
  • gaps: Gaps in the genome assembly.
  • oriCs: Chromosome replication origin for bacteria.
  • oriVs: Plasmid replication origin.
  • oriTs: An origin of transfer (oriT) is a short sequence ranging from 40-500 base pairs in length. It is necessary for the transfer of DNA from a gram-negative bacterial donor to recipient during bacterial conjugation.

View the gff file for bin K1.1.

less K1.1/K1.1.gff3

The GFF file is a tab delimited file containing annotation information for the features in the assembly/bin. In this case it is a GFF3 file (most curent version of GFF).

There is quite a lot of information contained in each row so instead of listing all the columns here please have a look at the official documentation:

https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

The code below is for your future analysis, do not run it now as it will take too long.

ls -1 ~/7-Biinning/K1_fullset/bins/*fa | while read f ; \
do s=$(basename $f | sed "s/.fa//") ; echo $s ; \
bakta --db /pub14/tea/nsc206/databases/bakta/db/ \
-o ${s} \
~/7-Binning/K1_fullset/bins/K1.1.fa

A quick thing we can do with these files is to see if any of the bins contain a specific annotation. For example, if we wanted to know if there were any ATP-binding proteins in any of the bins we could carry out the below command.

grep "ATP-binding protein" */*gff3 | less -S

We can now view the lines containing "ATP-binding protein" with the start of the line containing the file name the line belongs to.

In your future analyses you can expect these files further with excel, R, or visualisation software like IGV (https://software.broadinstitute.org/software/igv/GFF).

What if you want to know about pathways?

24.3 MinPath

MinPath can predict MetaCyc metabolic pathways. These pathway are made up of sets of enzymes that catalyse various reactions.

Ensure you start in the ~/8-Annotation directory.

24.3.1 MinPath: EC extraction

Before we can estimate the pathways we need to extract the EC numbers predictions from the GFF file. EC (Enzyme Commission) numbers are a numerical classification for enzymes based on the reaction they catalyse.

Unless you know the EC scheme well they are generally not helpful by themselves. An example EC number is EC 3.1.1. The numbers represents the group the enzyme belongs to with the first number being the biggest group. From highest to the lowest grouping 3.1.1 represents:

  • 3: Hydrolase enzymes.
  • 3.1: Hydrolase enzymes that act on ester bonds.
  • 3.1.1: Carboxylic Ester Hydrolases.

With all that information we will extract the EC annotations from the GFF files. First we'll create a directory for the output.

#Create EC directory
mkdir EC

We will now use a loop with various parts to create EC annotation files. The input .ec files required for MinPath are tab delimited with two columns:

  1. Protein/sequence id. E.g. GDGAPA_12670.
  2. EC annotation. E.g. 6.1.1.4.

Note: The lack of \ at the end of most of the lines is intentional. The below is all one command over multiple lines but loops work slightly different and don't need \ in certain parts. Ensure you do press enter at the end of each line.

ls -1 */*gff3 | sed "s|.*/||" | sed "s/.gff3//" | while read bin
do
cat ${bin}/${bin}.gff3 | grep "EC:" | cut -f 9 | sed "s/ID=//" | \
sed "s/;.*EC:/\t/" | sed "s/,.*//" > EC/${bin}.ec
done

24.3.1.1 Code explanation

That is quite a bit of code. You don't need to understand it as it should always work for Bakta output. If you are not currently interested you can skip to the MinPath: run section

If you are interested in how it works we'll break it down with examples for you to run.

The first part lists all our .gff files on one line each (ls -1 *gff). Next, we remove the directory name (sed "s|.*||"). Then the suffix .gff is substituted with nothing sed "s/.gff//". This gives us the name of each bin (e.g. K1.1, K1.2).

#List all gff files on one (-1) each
ls -1 */*gff3
#Remove the directory name
#You can use any character as the divider in sed
#Useful when you want to move slashes from a file name
ls -1 */*gff3 | sed "s|.*/||"
#List all the file prefixes (on one line each)
ls -1 */*gff3 | sed "s|.*/||" | sed "s/.gff3//"

With the above code we can loop through each file prefix and use the variable bin (arbitrarily chosen) which contains the file prefix. This is carried out with while read bin. All the lines between the do (start of the loop) and done (end of loop) are line run in the loop.

Run the loop with an echo command to show we are using the ${bin} variable to specify the input and output files.

ls -1 */*gff3 | sed "s|.*/||" | sed "s/.gff3//" | while read bin
do
echo "${bin}/${bin}.gff3 ../EC/${bin}.ec"
done

Now to look at the command within the loop:

cat ${bin}.gff3 | grep "EC:" | cut -f 9 | \
sed "s/ID=//" | sed "s/;.*;Dbxref=/,/" | \
sed "s/,.*EC:/\t/" | sed "s/,.*//" >../EC/${bin}.ec

A good way to figure out what a workflow is doing, is by building it up step by step. Run the first part of workflow and then add the next section, run, repeat. This shows you how each new section is affecting the output. Additionally, it is always good to run it on one file and head the output so we have a manageable amount of data to look at.

We'll do that with the K1.1.gff3 file.

Note: Remember we are adding head to the end for ease of viewing.

Tips:

  • Use the up arrow to go back to previously run commands that you can then edit.
  • Remember the clear command.
#Grab every line containing "EC:"
cat K1.1/K1.1.gff3 | grep "EC:" | head
#Cut out the 9th column/field (-f) (i.e. only keep the 9th column)
#This is the attributes field in GFF3
#This contains a plethora of information including the EC annotation if present
#cut uses tabs as the default column/field delimiter
cat K1.1/K1.1.gff3 | grep "EC:" | cut -f 9 | head
#The gff3 attributes field starts with the ID
#We want to keep this but remove the "ID=" part
cat K1.1/K1.1.gff3 | grep "EC:" | cut -f 9 | sed "s/ID=//" | head
#We don't want any of the info between the ID and the EC number
#Therefore we want to remove everything (.*) between
# the first ";" (at the end of the ID info)
# and "EC="
#We'll replace this with a \t to seprarate the ID and EC
# columns with a tab (required by MinPath)
cat K1.1/K1.1.gff3 | grep "EC:" | cut -f 9 |  sed "s/ID=//" | \
sed "s/;.*EC:/\t/" | head
#Finally remove all the info after the EC number
#This info will be after the last ,
cat K1.1/K1.1.gff3 | grep "EC:" | cut -f 9 | sed "s/ID=//" | \
sed "s/;.*EC:/\t/" | sed "s/,.*//" | head

In the looped command, output is written into files: > EC/${bin}.ec.

24.3.2 MinPath: run

With our .ec files we can create our MetaCyc predictions.

First we'll change directory into the EC directory. Then create an output directory for the MetaCyc predictions.

cd ./EC
mkdir ../MetaCyc

Now we can loop through the file suffixes to run MinPath.

ls -1 *ec | sed "s/.ec//" | while read bin
do
python /pub14/tea/nsc206/git_installs/Minpath/MinPath.py \
-ec ${bin}.ec \
-report ../MetaCyc/${bin}.minpath
done

A lot of output will be printed to screen but this can be ignored unless you see warnings.

24.3.3 MinPath: output

First, change directory into the MetaCyc directory.

#Change directory
cd ../MetaCyc
#List contents
ls

From the CheckM results we found that bin K1.22 was very good with a quality score >96%. We will therefore have a look at its output.

Have a look at the .minpath -report file for K1.22.

less K1.22.minpath

The file contains the following columns

  1. Pathway ID
  2. Pathway reconstruction: Only available for genomes annotated in MetaCyc database.
  3. Naive: Indicates if pathway was reconstructed by the naive mapping approach (1) or not (0).
  4. Minpath: Indicates if the pathway was kept (1) or removed (0) by MinPath.
  5. Fam0: The number of families involved in the pathway.
  6. Fam-found: Number of families in pathway that were annotated/found.
  7. Name: Description of pathway.

Quit (q) less when you are happy.

There are some quick things we can do in bash with these files.

#Count number of pathways found in each bin with word count
wc -l *minpath
#Grab every line with "PWY-6972" from every file
grep "PWY-6972" *minpath | less

24.3.4 KEGGs

You can also get KEGG information with MinPath. The code:

#Change to correct directory
cd ~/8-Annotation
#Make directory for KEGGs
mkdir KEGG
#KEGG extractions
ls -1 */*gff3 | sed "s|.*/||" | sed "s/.gff3//" | while read bin
do
cat ${bin}/${bin}.gff3 | grep "KEGG:" | cut -f 9 | sed "s/ID=//" | \
sed "s/;.*KEGG:/\t/" | sed "s/,.*//" > KEGG/${bin}.kegg
done
#Make directory for KEGG minpath output
mkdir KEGG_minpath
#Change directory to KEGG
cd KEGG
#Run MinPath
ls -1 *kegg | sed "s/.kegg//" | while read bin
do
python /pub14/tea/nsc206/git_installs/Minpath/MinPath.py \
-ko ${bin}.kegg \
-report ../KEGG_minpath/${bin}.minpath
done

With these files you can then investigate what bins have which pathways. Additionally, with more samples analysed you can determine which samples have which pathways present.