Chapter 11 Exercise 2

The directory "~/Linux/6_final_exercise/" has all the files you need. Below is a set of tasks and questions that will require all the skills you have gained from this practical.

You can check my solutions by clicking the expandable boxes like the one below ("Move to correct directory"). These are not the definitive solution but only examples of solutions. If your method works and you understand why then you have carried it out correctly.

cd ~/Linux/6_final_exercise

11.1 Exercise 2 tasks

Task 1

See what files are in the directory.

ls

Task 2

Rename the file "3-P£_CACTTCGA_L001_R1_001.fastq" as "3-P3_CACTTCGA_L001_R1_001.fastq".

mv 3-P£_CACTTCGA_L001_R1_001.fastq \
3-P3_CACTTCGA_L001_R1_001.fastq

Task 3

Make a backup of the files in a directory called backup.

mkdir backup
cp 1-P1_ATGCCTGG_L001_R1_001.fastq backup/
cp 1-P1_ATGCCTGG_L001_R2_001.fastq backup/
cp 2-P2_AAGGACAC_L001_R1_001.fastq backup/
cp 2-P2_AAGGACAC_L001_R2_001.fastq backup/
cp 3-P3_CACTTCGA_L001_R1_001.fastq backup/
cp 3-P3_CACTTCGA_L001_R2_001.fastq backup/
cp 4-E1_ATTGGCTC_L001_R1_001.fastq backup/
cp 4-E1_ATTGGCTC_L001_R2_001.fastq backup/
cp metadata.txt backup/

This can be done a lot quicker with the use of wildcard characters (Covered in Advanced Linux section)

mkdir backup
cp *fastq backup
cp *txt backup

Task 4

How many reads are in the samples?

The below command will give the number of lines in the files, this number can then be divided by 4 (mentally or using a calculator). These values will be the same for the R2 files as they are for the matching R1 file.

wc -l 1-P1_ATGCCTGG_L001_R1_001.fastq \
2-P2_AAGGACAC_L001_R1_001.fastq \
3-P3_CACTTCGA_L001_R1_001.fastq

An advanced method using regular expressions, wildcard characters and grep:

grep -c "^@[0-9]*_" *R1*.fastq

Task 5

Remove the fastq files with no data.

Check which files have no data

wc \
1-P1_ATGCCTGG_L001_R1_001.fastq 1-P1_ATGCCTGG_L001_R2_001.fastq \
2-P2_AAGGACAC_L001_R1_001.fastq 2-P2_AAGGACAC_L001_R2_001.fastq \
3-P3_CACTTCGA_L001_R1_001.fastq 3-P3_CACTTCGA_L001_R2_001.fastq \
4-E1_ATTGGCTC_L001_R1_001.fastq 4-E1_ATTGGCTC_L001_R2_001.fastq 

Remove empty files

rm \
4-E1_ATTGGCTC_L001_R1_001.fastq 4-E1_ATTGGCTC_L001_R2_001.fastq

Task 6

Update the backup files with the previous change.

rm backup/4-E1_ATTGGCTC_L001_R1_001.fastq \
backup/4-E1_ATTGGCTC_L001_R2_001.fastq 

Task 7

Check if the 1st read names match in the paired files.

head -n 1 \
1-P1_ATGCCTGG_L001_R1_001.fastq 1-P1_ATGCCTGG_L001_R2_001.fastq \
2-P2_AAGGACAC_L001_R1_001.fastq 2-P2_AAGGACAC_L001_R2_001.fastq \
3-P3_CACTTCGA_L001_R1_001.fastq 3-P3_CACTTCGA_L001_R2_001.fastq 

Task 8

Check if the last read names match in the paired files.

tail -n 4 \
1-P1_ATGCCTGG_L001_R1_001.fastq 1-P1_ATGCCTGG_L001_R2_001.fastq \
2-P2_AAGGACAC_L001_R1_001.fastq 2-P2_AAGGACAC_L001_R2_001.fastq \
3-P3_CACTTCGA_L001_R1_001.fastq 3-P3_CACTTCGA_L001_R2_001.fastq 

Task 9

In file "1-P1_ATGCCTGG_L001_R1_001.fastq" look for sequence headers with the term ‘psychrobacter' (Tip: Use grep).

grep “psychrobacter” 1-P1_ATGCCTGG_L001_R1_001.fastq

Task 10

In the sample 1-P1 remove any fastq entries where the term ‘psychrobacter’ appears in the fastq header. Do this for the R1 and R2 files.

Using nano navigate to the psychrobacter sequences. Then use "Ctrl+K" to cut the lines followed by "Ctrl+S" and "Ctrl+X" to save and exit.

Task 11

Print to screen the fastq header, sequence and quality data for the 25th sequence in sample 2-P2 for both the R1 and R2 file. Do this with one command for R1 and a separate command for R2.

@24_ecoli is grepped as the first sequence is @0_ecoli

grep -A 3 "@24_ecoli" 2-P2_AAGGACAC_L001_R1_001.fastq
grep -A 3 "@24_ecoli" 2-P2_AAGGACAC_L001_R2_001.fastq

11.2 Exercise 2 conclusion

Stupendous! You have finished the last exercise of the intro to linux section.

Thanks for your hard work. You have learnt a lot throughout this course but there is more to learn if you are willing and have the time. The next section is the Advanced Linux section. This is not required for any of our other workshops but the skills are very useful for bioinformatics analysis in Linux.