Chapter 9 Kraken2

Prior to running Kraken2 we need to set a variable so Kraken2 knows where to look for the databases it will use.

export KRAKEN2_DB_PATH=/pub14/tea/nsc206/NEOF/Shotgun_metagenomics/kraken2_db

Note: You can look at the contents of the above directory to see it currently contains the MiniKraken database. This database contains only a subset of the bacteria, archaea, and viral Kraken2 libraries. This is used in this practical due to restrictions on time and computational resources. For your own analyses we would recommend the full Kraken2 database which uses all the bacteria, achaeal and viral complete genomes that are in Refseq at the time of building. See the following links for info on installing the databases.

9.1 Kraken2: run

Now, run Kraken2 on sample K1 by running the following command.

Note: We are not using the host removed data. This is to save time. In your own analysis ensure you are using host removed data.

kraken2 --paired --db minikraken2_v1_8GB \
--output K1.kraken --report K1.kreport2 \
~/2-Trimmed/K1_R1.fq.gz ~/2-Trimmed/K1_R2.fq.gz

Parameters

  • --paired: Indicates that we are providing paired reads to Kraken2. Internally, Kraken2 will concatenate the R1 and R2 reads into one sequence with an N between them.
  • --db: Specify the Kraken2 database to be used for taxonomic classification.
    • Previous to the command we set the KRAKEN2_DB_PATH so in this case the command will look for the directory called minikraken2_v1_8GB within KRAKEN2_DB_PATH.
    • Alternatively the full path of the required database could be provided (/pub14/tea/nsc206/NEOF/Shotgun_metagenomics/kraken2_db/minikraken2_v1_8GB).
  • --output: The output file. More info below.
  • --report: The output report file. More info below.
  • ~/2-Trimmed/K1_R1.fq.gz ~/2-Trimmed/K1_R2.fq.gz: The trimmed read pairs for K1, which we will use as input.

9.2 Kraken2: output

There are two major output formats from Kraken2.

Output file

The --output parameter creates a .kraken file. Each sequence (or sequence pair, in the case of paired reads) classified by Kraken2 results in a single line of output. Kraken2's output lines contain five tab-delimited fields; from left to right, they are

  1. "C"/"U": a one letter code indicating that the sequence was either classified or unclassified.
  2. Sequence ID: Obtained from the FASTA/FASTQ header.
  3. Taxonomy ID: Assigned by Kraken2. This is 0 if the sequence is unclassified.
  4. Length of the sequence in bp: In the case of paired read data, this will be a string containing the lengths of the two sequences in bp, separated by a pipe character, e.g. "98|94".
  5. LCA mapping: A space-delimited list indicating the LCA (also known as MRCA) mapping of each k-mer in the sequence(s). For example, "562:13 561:4 A:31 0:1 562:3" would indicate that:
    • 562:13 - The first 13 k-mers mapped to taxonomy ID #562
    • 561:4 - The next 4 k-mers mapped to taxonomy ID #561
    • A:31 - The next 31 k-mers contained an ambiguous nucleotide
    • 0:1 - The next k-mer was not in the database
    • 562:3 - The last 3 k-mers mapped to taxonomy ID #562
    • Note: that paired read data will contain a "|:|" token in this list to indicate the end of one read and the beginning of another.

Report file

The --report parameter creates a .kreport2 file. This is the report output format. This is required for bracken. It is tab-delimited with one line per taxon. The fields of the output, from left-to-right, are as follows:

  1. Percentage of paired reads covered by the clade rooted at this taxon.
  2. Number of paired reads covered by the clade rooted at this taxon.
  3. Number of paired reads assigned directly to this taxon.
  4. A rank code, indicating (U)nclassified, (R)oot, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. Taxa that are not at any of these 10 ranks have a rank code that is formed by using the rank code of the closest ancestor rank with a number indicating the distance from that rank. E.g., "G2" is a rank code indicating a taxon is between genus and species and the grandparent taxon is at the genus rank.
  5. NCBI taxonomic ID number
  6. Indented scientific name

Screen output

The output to screen will show how many sequences are classified. This will be lower than normal as we are using a mini Kraken2 database.

Confidence threshold

In a real analysis you may use the option --confidence which represents the "Confidence score threshold". The default is 0.0, which is the lowest, with the maximum value being 1. A good place to start may be 0.1. Too many classifications are removed if you attempt it with this dataset, due to the mini Kraken2 database used. More info on the confidence scoring can be found at: https://github.com/DerrickWood/kraken2/wiki/Manual#confidence-scoring

Task

Once the Kraken2 command has finished running, run it on the other two samples. Attempt the commands without looking at the help box.

Hint: You will need to change all instances of K1 to K2 or W1 in the above command

#K2
kraken2 --paired --db minikraken2_v1_8GB \
--output K2.kraken --report K2.kreport2 \
~/2-Trimmed/K2_R1.fq.gz ~/2-Trimmed/K2_R2.fq.gz
#W1
kraken2 --paired --db minikraken2_v1_8GB \
--output W1.kraken --report W1.kreport2 \
~/2-Trimmed/W1_R1.fq.gz ~/2-Trimmed/W1_R2.fq.gz

9.3 Kraken 2: MCQs

Viewing the Kraken2 output files with your favourite text viewer (less, nano, vim, etc.), attempt the below MCQs.

  1. How many paired reads were unclassified for K1?
  2. How many paired reads were classified for K2 (i.e. number of reads classified at root level and below)?
  3. How many paired reads were assigned directly to root level for W1?
  4. What percentage of W1's paired reads were assigned to the clade of Bacteroidetes (Phylum)?
  5. What percentage of K2's paired reads were assigned to the clade of Rikenellaceae (Family)?
  6. What percentage of K1's paired reads were assigned to the clade of Bacteroides helcogenes (Species)?