Chapter 9 Kraken2
Kraken2 assigns taxonomic sequence classifications to DNA sequences.
It carries this out by comparing the k-mers of each DNA sequence to the k-mers in a genomic library.
Each k-mer in the genomic library has a taxonomy assigned to it based on the LCA (lowest common ancestor) of all the known lineages that have the k-mer in their genome.
One DNA sequence will have many k-mers, each of those k-mers having a classification. A single consensus taxonomic classification will be assigned to the DNA sequence. This could be at species level or higher. If the classification is at genus level then the species will be unknown.
We use and suggest Kraken2 as a method to assign taxonomies to shotgun metagenomic reads for the following reasons:
- It is computationally efficient, being incredibly quick.
- It can be used for DNA sequences of any size including short illumina read, longer PacBio or ONT reads, and even assembled contigs.
- Custom databases can be used so you can in theory classify any organism whose genome has been sufficiently categorised by genomic sequencing.
Brackencan be used downstream to produce taxonomy abundances that can be used for community analysis.
For more information please see the Kraken2 paper:
Improved metagenomic analysis with Kraken 2
9.1 Set Kraken2 database
Prior to running Kraken2 we need to set a variable so Kraken2 knows where to look for the databases it will use.
Note: You can look at the contents of the above directory to see it currently contains the MiniKraken database. This database contains only a subset of the bacteria, archaea, and viral Kraken2 libraries. This is used in this practical due to restrictions on time and computational resources. For your own analyses we would recommend the full Kraken2 database which uses all the bacteria, achaeal and viral complete genomes that are in Refseq at the time of building. See the following links for info on installing the databases.
- Standard
Kraken2databases: https://github.com/DerrickWood/kraken2/wiki/Manual#standard-kraken-2-database - Custom
Kraken2databases: https://github.com/DerrickWood/kraken2/wiki/Manual#custom-databases
9.2 Classify reads
Now, run Kraken2 on sample K1 by running the following command.
Note: In this tutorial we are not using the host removed data. This is to save time during the workshop. In your own analysis ensure you are using host removed data if your data is host derived.
kraken2 --paired --db minikraken2_v1_8GB \
--output K1.kraken --report K1.kreport2 \
~/2-Trimmed/K1_R1.fq.gz ~/2-Trimmed/K1_R2.fq.gzParameters
--paired: Indicates that we are providing paired reads toKraken2. Internally,Kraken2will concatenate the R1 and R2 reads into one sequence with an N between them.--db: Specify theKraken2database to be used for taxonomic classification.- Previous to the command we set the
KRAKEN2_DB_PATHso in this case the command will look for the directory calledminikraken2_v1_8GBwithinKRAKEN2_DB_PATH. - Alternatively the full path of the required database could be provided (
/pub14/tea/nsc206/NEOF/Shotgun_metagenomics/kraken2_db/minikraken2_v1_8GB).
- Previous to the command we set the
--output: The output file. More info below.--report: The output report file. More info below.~/2-Trimmed/K1_R1.fq.gz ~/2-Trimmed/K1_R2.fq.gz: The trimmed read pairs for K1, which we will use as input.
9.3 Output and report files
There are two major output formats from Kraken2.
9.3.1 Output file
The --output parameter creates a .kraken file. Each sequence (or sequence pair, in the case of paired reads) classified by Kraken2 results in a single line of output. Kraken2's output lines contain five tab-delimited fields; from left to right, they are
- "C"/"U": a one letter code indicating that the sequence was either classified or unclassified.
- Sequence ID: Obtained from the FASTA/FASTQ header.
- Taxonomy ID: Assigned by
Kraken2. This is 0 if the sequence is unclassified. - Length of the sequence in bp: In the case of paired read data, this will be a string containing the lengths of the two sequences in bp, separated by a pipe character, e.g. "98|94".
- LCA mapping: A space-delimited list indicating the LCA (also known as MRCA) mapping of each k-mer in the sequence(s). For example, "562:13 561:4 A:31 0:1 562:3" would indicate that:
- 562:13 - The first 13 k-mers mapped to taxonomy ID #562
- 561:4 - The next 4 k-mers mapped to taxonomy ID #561
- A:31 - The next 31 k-mers contained an ambiguous nucleotide
- 0:1 - The next k-mer was not in the database
- 562:3 - The last 3 k-mers mapped to taxonomy ID #562
- Note: that paired read data will contain a "|:|" token in this list to indicate the end of one read and the beginning of another.
9.3.2 Report file
The --report parameter creates a .kreport2 file.
This is the report output format and is required for bracken.
It is tab-delimited with one line per taxon.
The fields of the output, from left-to-right, are as follows:
- Percentage of paired reads covered by the clade rooted at this taxon.
- Number of paired reads covered by the clade rooted at this taxon.
- Number of paired reads assigned directly to this taxon.
- A rank code, indicating (U)nclassified, (R)oot, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. Taxa that are not at any of these 10 ranks have a rank code that is formed by using the rank code of the closest ancestor rank with a number indicating the distance from that rank. E.g., "G2" is a rank code indicating a taxon is between genus and species and the grandparent taxon is at the genus rank.
- NCBI taxonomic ID number
- Indented scientific name
9.4 Confidence threshold
In a real analysis you may use the option --confidence which represents the "Confidence score threshold". The default is 0.0, which is the lowest, with the maximum value being 1. A good place to start may be 0.1. Too many classifications are removed if you attempt it with this dataset, due to the mini Kraken2 database used. More info on the confidence scoring can be found at: https://github.com/DerrickWood/kraken2/wiki/Manual#confidence-scoring
9.5 Kraken2 task
Once the Kraken2 command has finished running, run it on the other two samples. Attempt the commands without looking at the help box.
Hint: You will need to change all instances of K1 to K2 or W1 in the above command
9.6 MCQs
Viewing the Kraken2 output files with your favourite text viewer (less, nano, vim, etc.), attempt the below MCQs.
- How many paired reads were unclassified for K1?
- How many paired reads were classified for K2 (i.e. number of reads classified at root level and below)?
- How many paired reads were assigned directly to root level for W1?
- What percentage of W1's paired reads were assigned to the clade of Bacteroidetes (Phylum)?
- What percentage of K2's paired reads were assigned to the clade of Rikenellaceae (Family)?
- What percentage of K1's paired reads were assigned to the clade of Bacteroides helcogenes (Species)?