Chapter 28 Bakta
We will carry out Bakta functional annotation. Bakta can annotate bacterial genomes and plasmids from both isolates and MAGs.
Make a new directory and move into it.
28.1 Annotation
Now we can annotate one of the bins.
The below will take a long time to run (>1 hour). Instead of running it skip onto the next section to copy pre-made output to continue with. This command is here so you know what to run in your own future analyses.
Parameters
--db: Location ofBaktadatabase. You will need to install this in your own installation. Instructions are in the appendix.-o: The output directory. This must not exist before running the command.- The last parameter is the fasta file containing the genome/plasmid you would like annotated.
28.2 Annotation files
List the files in the newly created K1.1 directory. Each of the files has the prefix "K1.1" and contains the following information:
.tsv : annotations as simple human readble TSV.gff3 : annotations & sequences in GFF3 format.gbff : annotations & sequences in (multi) GenBank format.embl : annotations & sequences in (multi) EMBL format.fna : replicon/contig DNA sequences as FASTA.ffn : feature nucleotide sequences as FASTA.faa : CDS/sORF amino acid sequences as FASTA.hypotheticals.tsv : further information on hypothetical protein CDS as simple human readable tab separated values.hypotheticals.faa : hypothetical protein CDS amino acid sequences as FASTA.json : all (internal) annotation & sequence information as JSON.txt : summary as TXT.png : circular genome annotation plot as PNG- These are only useful for complete/near complete circular genomes
- I would suggest looking at GenoVi for circular genome plots
.svg : circular genome annotation plot as SVG.log : Log file of command
View the summary file for bin K1.1.
Sequence information
- Length: Number of bases
- Count: Number of contigs/scaffolds
- GC: GC%
- N50: N50
- N ratio: Ratio of N bases to non-N bases
- coding density: Percentage of bases within coding regions
Annotation information.
- tRNAs: Transfer RNAs
- tmRNAs: Transfer-messenger RNA
- rRNAs: Ribosomal RNAs
- ncRNAs: Non-coding RNAs
- ncRNA regions: Non-coding RNA regions
- CRISPR arrays: CRISPR arrays
- CDSs: Coding sequences
- pseudogenes: Segments of DNA that structurally resembles a gene but is not capable of coding for a protein
- hypotheticals: Hypothetical genes, which are predicted solely by computer algorithms, are experimentally uncharacterized genes
- signal peptides: Short peptides (usually 16-30 amino acids long) normally present at the N-terminus of most newly synthesized proteins that are destined toward the secretory pathway
- sORFs: Short open reading frames (<100 amino acids)
- gaps: Gaps in the genome assembly
- oriCs: Chromosome replication origin for bacteria
- oriVs: Plasmid replication origin
- oriTs: An origin of transfer (oriT) is a short sequence ranging from 40-500 base pairs in length. It is necessary for the transfer of DNA from a gram-negative bacterial donor to recipient during bacterial conjugation
View the gff file for bin K1.1.
The GFF file is a tab delimited file containing annotation information for the features in the assembly/bin. In this case it is a GFF3 file (most curent version of GFF).
There is quite a lot of information contained in each row so instead of listing all the columns here please have a look at the official documentation:
https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
The code below is for your future analysis, do not run it now as it will take too long.
We can quickly see if any of the bins contain a specific annotation. For example, if we wanted to know if there were any ATP-binding proteins in any of the bins we could carry out the below command:
We can now view the lines containing "ATP-binding protein" with the start of the line containing the file name the line belongs to.
In your future analyses you can expect these files further with excel, R, or visualisation software like IGV (https://software.broadinstitute.org/software/igv/GFF).
What if you want to know about pathways?