Chapter 7 Host removal

It is good practice to remove any host sequences from your data before further analysis. A good method for this is to align/map your reads to a reference of your host genome and remove the mapped sequences (i.e sequences we believe belong to the host).

If there is no host genome available before you start your sample collections and sequencing it may be a good idea to attempt to sequence and assemble the host genome. We would recommend long read technologies for single genome assembly projects.

This chapter contains a small example on how to carry out host removal. It uses only a section of a human (host of our samples) reference genome assembly. In real life you should use the entire reference.

The first step is to copy over the reference fasta file we will use.

cp /pub14/tea/nsc206/NEOF/Shotgun_metagenomics/GRCh38_slice.fasta .

7.1 Index reference

We will will use the Bowtie2 aligner for mapping/aligning. Prior to alignment/mapping we need to index our reference.

bowtie2-build GRCh38_slice.fasta GRCh38_slice.fasta

If you use ls you will now see a bunch of files starting with GRCh38_slice.fasta and ending with various suffixes that contain bt2. These are the index files which allow us to use the reference with Bowtie2.

7.2 Alignment

With the indexed reference we will align the K1 reads to the reference. This creates a BAM file that contains alignment and read information (K1_mapped.bam).

bowtie2 -x GRCh38_slice.fasta -1 K1_R1.fq.gz -2 K1_R2.fq.gz \
-p12 2> K1_bowtie2_out.log | samtools view -b -S -h > K1_mapped.bam

Parameters

This command is split into two commands. The first is bowtie2 that creates the alignment. The parameters for the command are:

-x: Indexed reference the reads will be aligned to.
-1: The forward reads.
-2: The reverse reads.
-p: Number of threads to be used. 12 in this case.
2>: This will cause the standard error to be redirected to the chosen file. In this case K1_bowtie2_out.log.
- This is useful for commands that may produce a lot of output to screen. If an error occurs you can view this file to see the error messages.

The alignment is then piped (|) to the command samtools view. For more information on pipes please see our Intro to Unix course book.

The parameters for samtools view are:

-b: Output the alignment as a BAM file.
- BAM files are a binary form of SAM files so they are smaller in memory size.
- If you are interested in the SAM format please see its specification file.
-S: Auto detect input format.
-h: Include header.

The binary alignment is redirected to a new file called K1_mapped.bam. For more information on redirection please see our Intro to Unix course book.

7.3 Unmapped read extraction

Next step is to extract the reads that did not map to the host reference from the K1_mapped.bam file with the samtools fastq command (unmapped reads).

samtools fastq -f 4 -1 K1_R1.u.fastq -2 K1_R2.u.fastq K1_mapped.bam

Parameters

-f: Output reads that only include the SAM flag.
- In this case 4 stands for unmapped reads. Therefore, our resulting fastq files will only contain unmapped reads.
- The following link is very useful to create a SAM flag you may need: https://broadinstitute.github.io/picard/explain-flags.html.
-1: The output R1 fastq file of unmapped reads.
-2: The output R2 fastq file of unmapped reads.
Last file: The last file name is the input file for the command, K1_mapped.bam in this case.

This step may make unmatched paired files (why we have .u. in the output file names). This occurs when a read from R2 is removed but the matching read in R1 is not removed, or vice versa. This will cause issues for further analysis.

7.4 Re-pair

The below BBTools command will re-pair the reads by removing reads with a missing pair. The command ensures the order of the reads are identical in the 2 output paired files.

repair.sh in1=K1_R1.u.fastq in2=K1_R2.u.fastq \
out1=K1_R1.final.fastq out2=K1_R2.final.fastq \
outs=K1_singletons.fastq

Parameters

in1=: The input R1 fastq file of unmapped unpaired reads.
in2=: The input R2 fastq file of unmapped unpaired reads.
out1=: The output R1 fastq file of unmapped paired reads.
out2=: The output R2 fastq file of unmapped paired reads.
outs=: The output fastq file containing the left over singletons (a sequence missing a pair).
- This file can normally be ignored.

7.5 Host removal summary

We have run through quality control including host removal for our K1 sample. As our data has pretty much no human data we will skip this step for the other samples and use the trimmed data for the downstream analysis.

In a real analysis project you would use a whole genome reference for your host. However, that would have taken too long for this practical. The most current Human reference (when this was written) is GRCh38. We used a random 10kb section to align our reads to.

For more resources on the Human reference please see: https://www.ncbi.nlm.nih.gov/genome/guide/human/

The assembly we used was: https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.fna.gz