cd ~/Linux/
Advanced practice exercise

- Copy the directory ~/Linux/advanced_practice to ~/Linux/advanced_practice_exercise
- Move into ~/Linux/advanced_practice_exercise
- Make a directory called fastq and one called txt
- With one command move all the fastq files into the directory fastq
- With one command move all the txt files, excluding metadata.txt and samples.txt, into the directory txt
- Create a file in the fastq directory called patient_1_corrected.fastq and put all the corrected fastq data for patient_1 into the file. You can look at the metadat.txt file to see which samples belong to patient_1.
- Append the metadata line for sample_1_AAAA to the bottom of the file sample_1_AAAA.txt in the txt directory.
- For all the corrected fastq files find the sequences that start with a stop codon in the forward orientation (i.e. TAG, TAA or TGA). Print out to screen the sample name and sequence info separated by a “:” only (e.g. sample_10_AAGT:TAAGAGAACAATGAACAGATATTAATAATTTTGCCGCTTTTCTGCGGGAT)
- Count the number of Gs and Cs within file sample_16_AACC.fastq
- Get the fastq headers of sequences with homopolymers made of As with a length of 5 or greater for the uncorrected fastq files for samples 3,4,5,13,14 and 15 with one command.
Answers
Click on the below expandable boxes to view my solutions for the exercise. These are not the definitive solution but only examples of solutions. If your method works and you understand why then you have done it correctly.
Ensure you are in the correct directory before carrying out the below commands
Copy the directory ~/Linux/advanced_practice to ~/Linux/advanced_practice_exercise
cp -r ~/Linux/advanced_practice ~/Linux/advanced_practice_exercise
Move into ~/Linux/advanced_practice_exercise
cd ~/Linux/advanced_practice_exercise
Make a directory called fastq and one called txt
mkdir fastq txt
With one command move all the fastq files into the directory fastq
mv *.fastq fastq/
With one command move all the txt files, excluding metadata.txt and samples.txt, into the directory txt.
mv sample_*txt txt/
Create a file in the fastq directory called patient_1_corrected.fastq and put all the corrected fastq data for patient_1 into the file.
cat fastq/sample_[1-2]_*corrected.fastq > \
fastq/patient_1_corrected.fastq
Append the metadata line for sample_1_AAAA from metadata.txt to the bottom of the file sample_1_AAA.txt in the txt directory.
cat metadata.txt | grep "sample_1_AAAA" >> txt/sample_1_AAAA.txt
For all the corrected fastq files find the sequences that start with a stop codon in the forward orientation (i.e. TAG, TAA or TGA). Print out to screen the sample name and sequence info separated by a “:” only (i.e. sample_10_AAGT:TAAGAGAACAATGAACAGATATTAATAATTTTGCCGCTTTTCTGCGGGAT)
grep "^TA[AG]\|^TGA" fastq/*corrected.fastq | \
sed "s/.*sample/sample/" | sed "s/_corrected.fastq//"
Count the number of Gs and Cs within the sequences of file sample_16_AACC.fastq
cat fastq/sample_16_AACC.fastq | grep -B 1 "^+$" | \
grep -v "+\|--" | sed "s/A\|T//g" | wc -c
Get the fastq headers of sequences with homopolymers made of As with a length of 5 or greater for the uncorrected fastq files for samples 8-13 with one command. Then in the same command make the final output of each line in the format of “Sample_13: Sequence 12”
cat fastq/*[3-5]*[AGCT].fastq | grep -B 2 "^+$" | \
grep -B 1 "AAAAA" | grep "^@" | sed "s/^@s/S/" | \
sed "s/_[AGCT]*_/: Sequence /" | sed "s/ 1:$//"