Chapter 9 Kraken2
Prior to running Kraken2 we need to set a variable so Kraken2 knows where to look for the databases it will use.
Note: You can look at the contents of the above directory to see it currently contains the MiniKraken database. This database contains only a subset of the bacteria, archaea, and viral Kraken2 libraries. This is used in this practical due to restrictions on time and computational resources. For your own analyses we would recommend the full Kraken2 database which uses all the bacteria, achaeal and viral complete genomes that are in Refseq at the time of building. See the following links for info on installing the databases.
- Standard
Kraken2databases: https://github.com/DerrickWood/kraken2/wiki/Manual#standard-kraken-2-database - Custom
Kraken2databases: https://github.com/DerrickWood/kraken2/wiki/Manual#custom-databases
9.1 Kraken2: run
Now, run Kraken2 on sample K1 by running the following command.
Note: We are not using the host removed data. This is to save time. In your own analysis ensure you are using host removed data.
kraken2 --paired --db minikraken2_v1_8GB \
--output K1.kraken --report K1.kreport2 \
~/2-Trimmed/K1_R1.fq.gz ~/2-Trimmed/K1_R2.fq.gzParameters
--paired: Indicates that we are providing paired reads toKraken2. Internally,Kraken2will concatenate the R1 and R2 reads into one sequence with an N between them.--db: Specify theKraken2database to be used for taxonomic classification.- Previous to the command we set the
KRAKEN2_DB_PATHso in this case the command will look for the directory calledminikraken2_v1_8GBwithinKRAKEN2_DB_PATH. - Alternatively the full path of the required database could be provided (
/pub14/tea/nsc206/NEOF/Shotgun_metagenomics/kraken2_db/minikraken2_v1_8GB).
- Previous to the command we set the
--output: The output file. More info below.--report: The output report file. More info below.~/2-Trimmed/K1_R1.fq.gz ~/2-Trimmed/K1_R2.fq.gz: The trimmed read pairs for K1, which we will use as input.
9.2 Kraken2: output
There are two major output formats from Kraken2.
Output file
The --output parameter creates a .kraken file. Each sequence (or sequence pair, in the case of paired reads) classified by Kraken2 results in a single line of output. Kraken2's output lines contain five tab-delimited fields; from left to right, they are
- "C"/"U": a one letter code indicating that the sequence was either classified or unclassified.
- Sequence ID: Obtained from the FASTA/FASTQ header.
- Taxonomy ID: Assigned by
Kraken2. This is 0 if the sequence is unclassified. - Length of the sequence in bp: In the case of paired read data, this will be a string containing the lengths of the two sequences in bp, separated by a pipe character, e.g. "98|94".
- LCA mapping: A space-delimited list indicating the LCA (also known as MRCA) mapping of each k-mer in the sequence(s). For example, "562:13 561:4 A:31 0:1 562:3" would indicate that:
- 562:13 - The first 13 k-mers mapped to taxonomy ID #562
- 561:4 - The next 4 k-mers mapped to taxonomy ID #561
- A:31 - The next 31 k-mers contained an ambiguous nucleotide
- 0:1 - The next k-mer was not in the database
- 562:3 - The last 3 k-mers mapped to taxonomy ID #562
- Note: that paired read data will contain a "|:|" token in this list to indicate the end of one read and the beginning of another.
Report file
The --report parameter creates a .kreport2 file. This is the report output format. This is required for bracken. It is tab-delimited with one line per taxon. The fields of the output, from left-to-right, are as follows:
- Percentage of paired reads covered by the clade rooted at this taxon.
- Number of paired reads covered by the clade rooted at this taxon.
- Number of paired reads assigned directly to this taxon.
- A rank code, indicating (U)nclassified, (R)oot, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. Taxa that are not at any of these 10 ranks have a rank code that is formed by using the rank code of the closest ancestor rank with a number indicating the distance from that rank. E.g., "G2" is a rank code indicating a taxon is between genus and species and the grandparent taxon is at the genus rank.
- NCBI taxonomic ID number
- Indented scientific name
Screen output
The output to screen will show how many sequences are classified. This will be lower than normal as we are using a mini Kraken2 database.
Confidence threshold
In a real analysis you may use the option --confidence which represents the "Confidence score threshold". The default is 0.0, which is the lowest, with the maximum value being 1. A good place to start may be 0.1. Too many classifications are removed if you attempt it with this dataset, due to the mini Kraken2 database used. More info on the confidence scoring can be found at: https://github.com/DerrickWood/kraken2/wiki/Manual#confidence-scoring
9.3 Kraken 2: MCQs
Viewing the Kraken2 output files with your favourite text viewer (less, nano, vim, etc.), attempt the below MCQs.
- How many paired reads were unclassified for K1?
- How many paired reads were classified for K2 (i.e. number of reads classified at root level and below)?
- How many paired reads were assigned directly to root level for W1?
- What percentage of W1's paired reads were assigned to the clade of Bacteroidetes (Phylum)?
- What percentage of K2's paired reads were assigned to the clade of Rikenellaceae (Family)?
- What percentage of K1's paired reads were assigned to the clade of Bacteroides helcogenes (Species)?