Chapter 6 More BLAST and taxonomizr
6.1 Introduction to blast2taxonomizr
We may want to scale up the approaches described in chapters 04 and 05 to allow us to process larger data sets more rapidly. In this chapter we explain two ways you could do this:
- Using Slurm job arrays to simultaneous blast a split fasta file
- Running taxonomizr and formatting the output to provide LCA information
We run through some hypothetical examples and draw your attention to some of the key bits of code. This workflow is based on an automated command line workflow called blast2taxonomizr which is designed for use on a specific HPC (the University of Sheffield's Bessemer cluster), but the scripts associated with blast2taxonomizr can be easily modified for use on your own cluster.
These steps are not intended to be run on the web VNC, but the workflow is provided in case it is useful for your own data.
6.2 Running blast in array mode
NCBI nt is an enormous database containing tens of millions of sequences. If you have a long list of ASVs, 'blasting' hundreds or thousands of sequences against nt can take many hours or even days. To speed things up we can use the Slurm workload manager (used by many HPCs) to split blast into an array job and run several blast searches simultaneously. Many HPCs provide their own information about how to use slurm effectively (see for example the University of Sheffield's HPC documentation on advanced job submission with slurm).
Below we outline the main steps required to use a slurm job array to process a large number of sequences:
6.2.1 Split the input file
Say we have an fasta file 'asv.fasta' with 1000 sequences, and we want to split it into 10 smaller files (with 100 sequences each) that can be run simultaneously. We can use awk to split the file into chunks like so:
### use awk to split a fasta file
awk 'BEGIN {n=0;} /^>/ {if(n%100==0){file=sprintf("chunk%d.fa",n);} print > file; n++; next;} { print >> file; }' < asv.fastaYou don't need to fully understand how this is splitting the file, but it's worth noting a couple of things:
/^>/: the characters between two forward slashes are what we are tell awk to look for (what is known as a regular expression) - in this case a '>' at the start of a line (used by fasta files to denote the sequence name and description),if(n%100==0): the '100' is how many sequences will be included in each chunk (you can set it to any number),file=sprintf("chunk%d.fa",n): this is how we define how the output files will be called. in the example it is set to 'chunk%d', where '%d' is the line number of the original file (starting at 0). So our chunk files will be called 'chunk0.fa', 'chunk100.fa' all the way up to 'chunk900.fa',
6.2.2 Make a list of the subfiles
Now we will make a text file containing a list of all the chunks we want to run on the job array. This is easily achieved with the ls command. It's also worth double checking how many chunk files there are in the list, as we will need that information in the following steps:
6.2.3 Making the array script
Now we have split our fasta file up, we can move on to the array. First, we will have to make a simple bash script that reads in the information from the 'list_of_chunks.txt' file, and uses that as an input for blast.
Below are the main commands you would need to include:
# create a variable FASTA corresponding to the information on a numbered line of list_of_chunks.txt
FASTA=$(sed "${SLURM_ARRAY_TASK_ID}q;d" < (cat list_of_chunks.txt))
# Run blastn using the newly defined FASTA variable. Make sure the path to the nt database is correct!
blastn -query ${FASTA} \
-task blastn \
-db /path/to/ncbi/nt \
-out ${FASTA}_blast.out.tab \
-outfmt 6When the job array is submitted this script will be run as many times as specified (it will be 10 times in our example). For each numbered array job (1 to 10):
FASTA=$(...): a new variable called 'FASTA' is created based on some information we will providesed "${SLURM_ARRAY_TASK_ID}q;d": it is equal to the information on a single line (whose number is determine by which array job is running)< (cat list_of_chunks.txt): this file is the source of the information
Thus for array job 1 FASTA takes the value of line 1 in 'list_of_chunks.txt' (line 1 reads 'chunk0.fa'), while in array job 2 FASTA takes the value of line 2 in 'list_of_chunks.txt' (chunk100.fa). For each job in the array, the unique value of variable ${FASTA} is then used when calling blastn.
If you plan to use the command line version of MEGAN, we recommend altering the -outfmt argument slightly (see section 6.3.1 Preparing Blast for MEGAN for more details).
When making a script remember that you will have to include the shebang and set various other sbatch options. If blast is available through a conda environment you will also have to load this as well. Once this is done, you can save the script. We could call it something like 'blast_array.sh'. Depending on how permissions are applied in your HPC you may have to make the script executable using the chmod command (see this explainer on chmod). A script file can be made executable be running:
6.2.4 Running the array script
Finally we can submit the job as an array to slurm. Notice that we have to specify how many jobs our array is composed of. This must be equal to the number of lines in 'list_of_chunks.txt'.
Once the job has finished, you should have 10 'chunk%d.fa_blast.out.tab' files containing the blast results. The final step is to concatenate these files together with the cat command:
6.2.5 Arrays in blast2taxonomizr
The blast array in blast2taxonomizr makes use of two scripts, a preparatory script that splits the input and makes a list of files (01A_run_prep_for_blast.sh), and then the array itself (01B_run_blastn_array.sh). These scripts do a few additional things compared with the steps listed above, such as making sub-directories to store the 'chunk%d.fa' input files and 'chunk*.fa_blast.out.tab' output files. Please take a look at the readme and have a look inside the scripts themselves for more information.
6.3 Taxonomizr
In Chapter 5 we provided information about running taxonomizr in R, but it is also possible to install a command line version of MEGAN on your cluster and automate the procedure of finding the Lowest Common Ancestor (LCA). Please speak to your HPC administrator for advice on installing new software.
6.3.1 Preparing Blast for taxonomizr
Taxonomizr requires a blast tab delimited file as input (-outfmt 6). However, it also expects the ncbi subject sequence id (column 2 in the blast tab file) to be a simple accession number (e.g. 'MT539258.1'), yet by default, the output will contain additional gene identifier information (e.g. 'gi|1846592020|gb|MT539258.1|'). Although this information can be filtered out afterwards (e.g. using awk), you can also amend the -outfmt argument so that only the accession is included:
# Run blastn with a custom list of column headers
blastn -query ${FASTA} \
-task blastn \
-db /path/to/ncbi/nt \
-out ${FASTA}_blast.out.tab \
-outfmt "6 qseqid saccver pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen staxid ssciname scomnames sblastname sskingdoms stitle"In this example of blast we specify exactly which column headers to include. Columns 1 and 3-12 are the default values, but we have used saccver (subject accession version) rather than sseqid (subject sequence ID) in column 2 to be compatible with taxonomizr. This handy Guide to blastn output format 6 has more information about the default and optional columns in blast output format 6. It is possible to specify additional columns such as sscinames and sskingdoms which will provide the scientific names and the kingdom of the ncbi subject sequence (as long as the ncbi taxa database files 'taxdb.btd' and 'taxdb.bti' are present in the current working directory). This taxonomic information can provide a useful sanity check when looking at blast results.
6.3.2 Running taxonomizr
Once we have a filtered blast output file prepared, we can use taxonomizr (as in the previous chapter). In this case we use a bash script to prepare and/or filter our input file before passing the prepared file to the taxonomizr R script.
The script expects either the blast.out.tab created using the command above or a series of 'chunk*.fa_blast.out.tab' output files as created in the blast arrary method. In the case that the array script was used, these will be combined using 'cat'.
As explained in Chapter 5.1, it is also useful to filter blast results (using awk) for a minimum percentage identity and a minimum length before running taxonomizr. It is also recommended that a minimum percentage identity threshold and a top percent value is used for the taxonomizr LCA algorithm to improve taxonomic assignment.
If there are any blast hits you want to exclude from your blast results you can do this. For example to remove any 'uncultured eukaryote' samples:
After you have filtered your blast output file as desired, you can run your taxonomizr R script from within your bash script, e.g.:
As explained in Section 6.2.3 Making the array script, you must include certain information in your bash script like the the shebang, and it is useful to provide other sbatch options. You may have to make the script executable with chmod. Once your script is saved and executable, run it like so:
6.4 Combining output with DADA2 output
The final bash script in the blast2megan pipeline is '03_run_make_summary_files.sh'. This script is simply a wrapper for an R script called '03_make_summary_files.R' that combines taxonomic assignment results from blast2taxonomizr with sequence data and ASV counts from DADA2 to create several summary files, including:
- ASV_taxa_seq_counts.tsv: a complete summary of taxonomic results (LCA taxon, taxon rank, and taxon path), sequence, and count results,
- ps_taxamat.tsv : ASV taxonomic results in matrix format for phyloseq,
- ps_countmat.tsv : ASV counts results in matrix format for phyloseq,
- ps_phylogeny.rds : phylogenetic tree for phyloseq prepared according to protocol of Callahan et al. 2016, see subsection Construct the phylogenetic tree.
The second, third and fourth files are designed to be inputs for the popular community analysis R package phyloseq.
If you are interested in understanding how it works, take a look at the R code in 03_make_summary_files.R. Among other things the script uses the R function merge to join together different data tables based on shared information (either the ASV sequence or the ASV number) and assigns column headers and rownames to the merged data (turning data tables into matrices). Notice that the script requires 'taxonomizr_summary_out.tsv' and 'taxonomizr_taxon_path_us.tsv' (the output from the previous step, which it expects to find in the 'blast_out' directory) as well as the DADA2 output files '06_ASV_seqs.fasta' and '06_ASV_counts.tsv' (which it expects to find in the 'working_data' directory).