Chapter 11 Bracken

Bracken (Bayesian Reestimation of Abundance with KrakEN) uses taxonomy labels assigned by Kraken2 to compute estimated abundances of species in a metagenomic sample.

11.1 Bracken: run

Just like with Krona we can use the Kraken2 report files to run bracken.

bracken -d $KRAKEN2_DB_PATH/minikraken2_v1_8GB \
-i K1.kreport2 -o K1.bracken -w K1.breport2 -r 100 -l S -t 5

Parameters

  • -d : Specifies the Kraken2 database that was used for taxonomic classification. In this case bracken requires the variable $KRAKEN2_DB_PATH so the option is provided the full path to the kraken database.
    • For clarity try the command ls $KRAKEN2_DB_PATH/minikraken2_v1_8GB.
  • -i : The Kraken2 report file, this will be used as the input.
  • -o : The output Bracken file. Information about its contents is below.
  • -w: Output report file. This contains the Bracken read counts in a kraken-style report. This is an essential file if you want to use the Bracken output in R using the phyloseq object. This is covered in our R community analysis workshop. We won't cover it more here.
  • -r 100: This is the ideal length of the reads that were used in the Kraken2 classification. It is recommended that the initial read length of the sequencing data is used. We are using 100 here as we used a paired library of 100bp*2 reads.
  • -l S: This specifies the taxonomic level/rank of the Bracken output. In this case S is equal to species with the other options being D, P, C, O, F and G.
  • -t 5: This specifies the minimum number of reads required for a classification at the specified rank. Any classifications with fewer reads than the specified threshold will not receive additional reads from higher taxonomy levels when distributing reads for abundance estimation. Five has been chosen here for this example data but in real datasets you may want to increase this number (default is 10).

11.2 Bracken: output

The output file of Bracken contains the following columns:

  1. Name: Name of taxonomy at the specified taxonomic level.
  2. Taxonomy ID: NCBI taxonomy id
  3. Level ID: Letter signifying the taxonomic level of the classification
  4. Kraken assigned read: Number of reads assigned to the taxonomy by Kraken2
  5. Added reads with abundance reestimation: Number of reads added to the taxonomy by Bracken abundance reestimation.
  6. Total reads after abundance reestimation: Number from field 4 and 5 summed. This is the field that will be used for downstream analysis
  7. Fraction of total reads: Relative abundance of the taxonomy

Task

Repeat the above commands for K2 and W1

#K2
bracken -d $KRAKEN2_DB_PATH/minikraken2_v1_8GB \
-i K2.kreport2 -o K2.bracken -r 100 -l S -t 5
#W1
bracken -d $KRAKEN2_DB_PATH/minikraken2_v1_8GB \
-i W1.kreport2 -o W1.bracken -r 100 -l S -t 5

11.3 Bracken: MCQs

Viewing the Bracken output files (.bracken) with your favourite text viewer (less, nano, vim, etc.), attempt the below MCQs.

  1. In K1, how many total reads after abundance reestimation are there for Prevotella fusca?
  2. In K2, how many reads after abundance reestimation were added for Bacteroides caccae?
  3. In W1, what is the fraction of total reads (after abundance reestimation) for Tannerella forsythia?

11.4 Bracken: merging output

To make full use of Bracken output, it is best to merge the output into one table. Before we do this we’ll copy the Bracken output of other samples that have been generated prior to the workshop. These are all either Korean or Western Diet samples.

cp /pub14/tea/nsc206/NEOF/Shotgun_metagenomics/bracken/* .

Now to merge all the K and W Bracken files.

combine_bracken_outputs.py --files [KW]*.bracken -o all.bracken

This output file contains the first three columns:

  • name = Organism group name. This will be based on the TAX_LVL chosen in the Bracken command and will only show the one level.
  • taxonomy_id = Taxonomy id number.
  • taxonomy_lvl = A single string indicating the taxonomy level of the group. ('D','P','C','O','F','G','S').

After these columns are the following two columns for each sample.

  • ${SampleName}.bracken_num: The number of reads after abundance reestimation
  • ${SampleName}.bracken_frac: Relative abundance of the group in the sample

11.5 Bracken: extracting output

We want a file with only the first column (organism name) and the bracken_num columns for each sample. To carry this out we first create a sequence of numbers that will match the bracken_num column numbers. These start at column 4 and are every even numbered column after this. We will use seq to create a sequence of numbers starting at 4 and including every second (2) number up to and including 50 with commas (,) as separators (-s).

Note: The number 50 is chosen as 3 (first three info columns) + 24*2 (24 samples with 2 columns each) = 50.

#Try out the seq command to see its output
seq -s , 4 2 50
#Create variable
bracken_num_columns=$(seq -s , 4 2 50)
echo $bracken_num_columns

Now to use the variable to extract the bracken_num columns plus the first column (species names).

cat all.bracken | cut -f 1,$bracken_num_columns > all_num.bracken