Advanced practice exercise

  1. Copy the directory ~/Linux/advanced_practice to ~/Linux/advanced_practice_exercise
  2. Move into ~/Linux/advanced_practice_exercise
  3. Make a directory called fastq and one called txt
  4. With one command move all the fastq files into the directory fastq
  5. With one command move all the txt files, excluding metadata.txt and samples.txt, into the directory txt
  6. Create a file in the fastq directory called patient_1_corrected.fastq and put all the corrected fastq data for patient_1 into the file. You can look at the metadat.txt file to see which samples belong to patient_1.
  7. Append the metadata line for sample_1_AAAA to the bottom of the file sample_1_AAAA.txt in the txt directory.
  8. For all the corrected fastq files find the sequences that start with a stop codon in the forward orientation (i.e. TAG, TAA or TGA). Print out to screen the sample name and sequence info separated by a “:” only (e.g. sample_10_AAGT:TAAGAGAACAATGAACAGATATTAATAATTTTGCCGCTTTTCTGCGGGAT)
  9. Count the number of Gs and Cs within file sample_16_AACC.fastq
  10. Get the fastq headers of sequences with homopolymers made of As with a length of 5 or greater for the uncorrected fastq files for samples 3,4,5,13,14 and 15 with one command.

Answers

Click on the below expandable boxes to view my solutions for the exercise. These are not the definitive solution but only examples of solutions. If your method works and you understand why then you have done it correctly.

Ensure you are in the correct directory before carrying out the below commands

cd ~/Linux/

Copy the directory ~/Linux/advanced_practice to ~/Linux/advanced_practice_exercise

cp -r ~/Linux/advanced_practice ~/Linux/advanced_practice_exercise

Move into ~/Linux/advanced_practice_exercise

cd ~/Linux/advanced_practice_exercise

Make a directory called fastq and one called txt

mkdir fastq txt

With one command move all the fastq files into the directory fastq

mv *.fastq fastq/

With one command move all the txt files, excluding metadata.txt and samples.txt, into the directory txt.

mv sample_*txt txt/

Create a file in the fastq directory called patient_1_corrected.fastq and put all the corrected fastq data for patient_1 into the file.

cat fastq/sample_[1-2]_*corrected.fastq > \
fastq/patient_1_corrected.fastq

Append the metadata line for sample_1_AAAA from metadata.txt to the bottom of the file sample_1_AAA.txt in the txt directory.

cat metadata.txt | grep "sample_1_AAAA" >> txt/sample_1_AAAA.txt

For all the corrected fastq files find the sequences that start with a stop codon in the forward orientation (i.e. TAG, TAA or TGA). Print out to screen the sample name and sequence info separated by a “:” only (i.e. sample_10_AAGT:TAAGAGAACAATGAACAGATATTAATAATTTTGCCGCTTTTCTGCGGGAT)

grep "^TA[AG]\|^TGA" fastq/*corrected.fastq | \
sed "s/.*sample/sample/" | sed "s/_corrected.fastq//"

Count the number of Gs and Cs within the sequences of file sample_16_AACC.fastq

cat fastq/sample_16_AACC.fastq | grep -B 1 "^+$" | \
grep -v "+\|--" | sed "s/A\|T//g" | wc -c

Get the fastq headers of sequences with homopolymers made of As with a length of 5 or greater for the uncorrected fastq files for samples 8-13 with one command. Then in the same command make the final output of each line in the format of “Sample_13: Sequence 12”

cat fastq/*[3-5]*[AGCT].fastq | grep -B 2 "^+$" | \
grep -B 1 "AAAAA" | grep "^@" | sed "s/^@s/S/" | \
sed "s/_[AGCT]*_/: Sequence /" | sed "s/ 1:$//"