Chapter 5 Taxonomizr

Taxonomizr is an R package which can be used to parse BLAST result files in order to assign taxonomy to NCBI accession numbers or taxonomic IDs. It has the ability find agreement between the taxonomic hits using a Lowest Common Ancestor (LCA) algorithm.

The LCA considers all significant BLAST hits and, if these are to reference sequences from multiple species, assigns the taxonomy of the node that lies above all of those species. For example, if an ASV matched equally well to reference sequences of Neolamprologus longior and Neolamprologus gracilis it would be assigned to the genus level as Neolamprologus sp. Similarly, if an ASV matched equally well to reference sequences of Neolamprologus longior and Ophthalmotilapia ventralis it would be assigned at the family level to Cichlidae as both of these species are within that family.

Taxonomizr includes a function for downloading the latest taxonomy data from NCBI and uses this to match NCBI accession numbers (the second column in our blast output table) to taxonomic classes. As these files are large this is not intended to be run on the web VNC, but the workflow is provided in case it is useful for your own data.

5.1 Filtering BLAST results

We already restricted the BLAST hits in our output file by specifying a maximum E-value threshold. It is also useful to further filter out sequences with a low percentage identity or a short alignment length as these are likely to be spurious matches to our ASVs. We can do this easily on the linux command line using an awk command (awk is a programming language useful for text and file processing). Let's filter to keep only hits with a percent identity of at least 90% and an alignment length of at least 100 bp.

awk '$3 >= 90' MiFish_ASVs_blast_out.txt | \
    awk '$4 >= 100' > MiFish_ASVs_blast_out_filtered.txt

5.2 Importing BLAST results into R and taxonomizr

First we need to download the taxonomizr accession file (remember do not run this on the web VNC this is just for your own analysis).

# load the taxonomizr library
library(taxonomizr)

# download the taxonomizr accession file 
prepareDatabase('accessionTaxa.sql')

This command will check to see if you have the accession file already downloaded, it will download it if it does not find the file. You should only need to run this once when running your analysis. If you wanted to update the accession file, if for example you are analysing a new set of data some months later then you will first have to delete/rename the old file and run this command again.

We now need to read in our blast file into R.

# Read in the filtered blast result file
blastResults<-read.table(file = "MiFish_ASVs_blast_out_filtered.txt", header=FALSE, stringsAsFactors=FALSE)

We can now optionally choose to filter the blast results to keep only the top user specified percent of blast hits (suggested values: 1-10).

# Filter file by top percent value of bit score as specified in topperc (e.g. 2%)
# Note that the top percent value is converted to a fraction and subtracted from 1:
# thus a top percent value of 2 becomes 0.98 

topperc<-2

topPercent <- data.frame(blastResults %>% group_by(V1) %>% 
  arrange(V1, desc(V12)) %>% 
  filter(V12 > max(V12*(1 - (topperc/100)))))

We will then extract the ASVs and accession number columns from our top percent filtered blast results.

# Split the top percent values in to ASVs and accessions
asvs<-topPercent[,1]
accessions<-topPercent[,2]

We now have everything we need to run our taxonomizr taxonomy assignment. First we assign taxonomic IDs to our NCBI accession numbers using the accessionToTaxa function. We can then use the function getTaxonomy to appoint the taxonomy for those IDs.

# Get the taxonomic information for top percent accessions from accessionTaxa.sql
taxaId<-accessionToTaxa(accessions,"./../accessionTaxa.sql")
taxa<-getTaxonomy(taxaId,'./../accessionTaxa.sql', desiredTaxa=c("kingdom", "phylum", "class", "order", "family", "genus", "species"))

To generate an agreement among our taxonomy assignments to each ASV we can use the function condenseTaxa. This will collapse the taxonomy assignments to the lowest taxonomic rank shared by all the blast hits.

# LCA algorithm and merge back to top percent ASVs
taxon_path <- condenseTaxa(taxa,asvs)

Finally we can write our taxonomic information for each ASV to an output file.

# Write output
write.table("taxonomizr_taxon_path.tsv", sep="\t", quote = FALSE)