Chapter 6 Quality control

Now that we've obtained the raw data and had a look at it, we should clean it up. With any sequencing data, it is very important to ensure that you use the highest quality data possible: Rubbish goes in, rubbish comes out.

There are two main methods employed to clean sequence data, and a third method specific to some metagenomic datasets.

  • Remove low quality bases from the end of the reads: These are more likely to be incorrect, so are best trimmed off.
  • Remove adapters: Sometimes sequencing adapters can be sequenced if the sequencing runs off the end of a fragment.
  • Host removal: If a metagenomic sample derives from a host species then it may be advisable to remove any reads associated with the host genome.

6.1 Removing adapters and low quality bases

Go back to your home directory and create a new directory where we will clean the sequences up:

cd ..
mkdir 2-Trimmed
cd 2-Trimmed

You are now in your newly created directory. Here we will run Trim Galore! which removes low quality bases and adapters.

trim_galore --paired --quality 20 --stringency 4 \
../1-Raw/K1_R1.fastq.gz ../1-Raw/K1_R2.fastq.gz

This command will remove any low quality regions from the end of both reads in each read pair (quality score < 20). Additionally, if it detects four or more bases of a sequencing adapter, it will trim that off too.

Task: Rerun this command for the other two samples (K2 and W1). Try to run these without looking at the help box below.

#K2
trim_galore --paired --quality 20 --stringency 4 \
../1-Raw/K2_R1.fastq.gz ../1-Raw/K2_R2.fastq.gz
#W1
trim_galore --paired --quality 20 --stringency 4 \
../1-Raw/W1_R1.fastq.gz ../1-Raw/W1_R2.fastq.gz

6.2 Rename the files

Once that is complete list the contents of your directory:

ls

You will notice that we have a new bunch of files created: 2 new read files for each sample along with a trimming report for each file trimmed. However, the new names are needlessly long. For example K1_R1_val_1.fq.gz could be shortened to K1_R1.fq.gz. So, we'll rename all of the files with the mv command:

mv K1_R1_val_1.fq.gz K1_R1.fq.gz
mv K1_R2_val_2.fq.gz K1_R2.fq.gz
mv K2_R1_val_1.fq.gz K2_R1.fq.gz
mv K2_R2_val_2.fq.gz K2_R2.fq.gz
mv W1_R1_val_1.fq.gz W1_R1.fq.gz
mv W1_R2_val_2.fq.gz W1_R2.fq.gz

Tip: If you want to edit and reuse previous commands, press the up arrow key.

Task: Briefly inspect the log files to see how the trimming went (e.g. K1_R1.fastq.gz_trimming_report.txt).

6.3 Inspect the trimmed data

To see what difference the trimming made, run FastQC and MultiQC again on the trimmed output files and view it.

#R1 fastqc and multiqc
mkdir R1_fastqc
fastqc -t 3 -o R1_fastqc *R1.fq.gz
mkdir R1_fastqc/multiqc
multiqc -o R1_fastqc/multiqc R1_fastqc

Task

Run FastQC and MultiQC for the R2 files and then view the R1 and R2 MultiQC reports with firefox. Try to run the commands without looking at the help box below.

How does the quality compare to the untrimmed data?

#R2 fastqc and multiqc
mkdir R2_fastqc
fastqc -t 3 -o R2_fastqc *R2.fq.gz
mkdir R2_fastqc/multiqc
multiqc -o R2_fastqc/multiqc R2_fastqc