Chapter 12 Advanced Linux practice

We have covered a small amount of Linux coding. This should be sufficient to carry out our future workshops but if you were to continue in bioinformatics we would recommend learning more advanced methods.

Below are some short sections to introduce you to some more advanced linux coding techniques. These give you a quick overview and some examples. This will hopefully put you in a good position to allow you to to learn these techniques in more depth outside of this workshop.

The following sections will all be run with the files in the directory "~/Linux/advanced_practice/". Therefore ensure you are in this directory before running the below examples. This contains fastq and txt files for 20 samples. Each sample contains a fastq file and a txt file for uncorrected and corrected reads. These fastq files are single end (i.e. there are no reverse/R2 reads).

12.1 Wildcard characters

These are characters that can be used to represent a variety of other characters. This can be useful for deleting many files, searching for files with specific patterns in their names and more.

Be very careful when using wildcard character with the command rm.

Three basic and useful wildcards include:

* - This represents zero or more characters
? - This represents a single character
[] - This represents a range of characters

Below are various examples you can run to show wildcards in action.

List all the files and directories in the working directory

ls *

List all files ending in “.fastq”

ls *.fastq

List all files ending in “.txt”

ls *.txt

List all files with the string “corrected” somewhere in the file name

ls *corrected*

List all files with the string “corrected” somewhere in the file name and it ends in “fastq”

ls *corrected*fastq

List all files that begin with “sample_2”

ls sample_2*

List all files that begin with “sample_2_”

ls sample_2_*

List the files that begin with “sample_1” and ends with “AAAA.txt”. It may have zero or more characters between these two.

ls sample_1_*AAAA.txt

List the fastq files of samples with a single digit number

ls sample_?_*fastq

List the txt files of samples with a number in the tens

ls sample_1?_*txt

List the txt files for samples 3,4,5,6 & 7 i.e. 3-7

ls sample_[3-7]_*txt

List the txt files for the non corrected information of samples with single digits.

ls sample_?_*[ATGC].txt

List the corrected txt files for samples with numbers divisible by 10.

ls sample_[0-9][0]_*corrected.txt

12.2 Redirection

Redirection allows you to put the output of a command to a file. The redirect symbol is >. Be careful when redirecting as it will overwrite any existing files. To append to the bottom of a file use >>.

Below are various examples of redirecting in action.

Create a file called ecoli.tmp containing the text “I am escherichia coli”

echo “I am escherichia coli” > ecoli.tmp

Create a file called pcryohalolentis.tmp containing the text “I am psychrobacter cryohalolentis”

echo “I am psychrobacter cryohaloloentis” > pcryohalolentis.tmp

Create a new file called bacteria.tmp which will contain the text from ecoli.tmp and pcryohalolentis.tmp

cat *tmp > bacteria.tmp

Create a file called vcholerae.tmp containing the text “I am not ecoli or pcryohalolentis”

echo “I am not ecoli or pcryohalolentis” > vcholerae.tmp

cat the file vcholerae.tmp and redirect it to bacteria.tmp.

cat vcholerae.tmp > bacteria.tmp

Look at the contents of bacteria.tmp

cat bacteria.tmp

This has removed the ecoli and pcryohalolentis lines. Append the contents of ecoli.tmp and pcryohalolentis.tmp to bacteria.tmp and then check the file

cat ecoli.tmp pcryohalolentis.tmp >> bacteria.tmp
cat bacteria.tmp

Put information regarding number of lines of all the fastq files into a new file called fastq_lines.tmp

wc -l *fastq > fastq_lines.tmp

Now delete all the files that were created in the above examples. Again be very careful about using the rm command with wildcards.

rm *tmp

12.3 Pipes

Pipes allow you to put the output of one command to the input of another. For example you could use grep to get all the lines with a certain string and pipe the output to wc to count the number of lines that have the specific string.

The pipe symbol is |. This is normally found on your keyboard directly left of the Z key. Weirdly the symbol is represented by | but split in the middle on some keyboards.

A useful tip when building up longer pipes is to start with a smaller amount of data and check the output of each step as you go. To do this you could use head instead of cat whilst testing.

Below are various examples of piping in action

Print to screen the second last fastq entry of the file sample_20_ATAC_corrected.fastq

cat sample_20_ATAC_corrected.fastq | tail -n 8 | head -n 4

Note: In the above command the tail command is working on the output of the cat command. Therefore this would not work to get the second last fastq entry of multiple files. For example the following command would print the second last fastq entry of the last fastq file (i.e. sample_9_AAGA.fastq due to file ordering)

cat *fastq | tail -n 8 | head -n 4

Count the number of lines within all the fastq files

cat *fastq | wc -l

Count the number of lines which contain the text "TAG" within all the fastq files

cat *fastq | grep “TAG” | wc -l

12.4 Regular expressions

Regular expressions are similar to wildcard characters but more complex and used for commands like grep and sed.

Below are a basic set of regular expressions:

. : A single character
? : The preceding character matches 1 or 0 times
* : The preceding character matches zero or more times
+ : The preceding character matches one or more times
{n} : The preceding character matches exactly n times
{n,m} : The preceding character matches n to m times
[AT] : The character is one of the characters in the brackets
[^CG] : The character is not one of those in the brackets
[1-7] : The character is 1,2,3,4,5,6 or 7. This works with letters too.
() : Group several characters into one
| : Logical OR operator
^ : Matches the beginning of the line
$ : Matches the end of the line

Below are various examples of regular expressions in action.

Look at the contents of metadata.txt

cat metadata.txt

Print out the lines for the Healthy patients

cat metadata.txt | grep “HEALTHY”

Print out the lines for the IBD patients from Craigavon and Belfast. In the below command \ is used to allow | to be used as an or operator instead of acting as a string to match.

cat metadata.txt | grep "IBD" | grep "BELFAST\|CRAIGAVON"

Print out the lines for the Pre information of patients not from Edinburgh or Aberdeen

cat metadata.txt | grep "PRE$" | grep -v "ABERDEEN\|EDINBURGH"

Print out the lines for patients 1,2,3 and 4

cat metadata.txt | grep "Patient_[1-4][^0-9]"

Print to screen every fastq header of file sample_15_AACG_corrected.fastq

cat sample_15_AACG_corrected.fastq | grep “^@sample”

In the piping examples we counted the number of lines with the text “TAG” within the fastq files. However this also counted fastq headers due to the name of the samples. Let us use a regular expression to only count the number of sequences within the fastq files that contain “TAG”.

cat *fastq | grep “^[^@].*TAG”

Print to screen every line within the file sample_3_AAAG_corrected.fastq that has a possible Threonine codon in the forward direction.

grep "^[^@].*AC[ACTG]" sample_3_AAAG_corrected.fastq

In the above example fastq quality lines are also extracted as some of them also contain the pattern we are searching for. To get around this we can pipe. First grep the fastq quality header (i.e. +), as no other line only contains “+”, and the line before it. Then we can remove lines with a plus with an invert grep. Finally we can grep for the threonine pattern using only the sequence lines. Let us build this up step by step.

Print to screen the fastq quality header plus the one line preceding each (i.e. Sequence line) for file sample_3_AAAG_corrected.fastq.

grep -B 1 “^+$” sample_3_AAAG_corrected.fastq

Now pipe this output so it removes the lines with “+” (fastq quality headers) and “--” (separators of each grep match provided by grep because of the -B 1 option).

grep -B 1 "^+$" sample_3_AAAG_corrected.fastq | grep -v "+\|--"

Now from this output, grep for the Threonine pattern plus colour each match within the line with the option “--color”.

grep -B 1 "^+$" sample_3_AAAG_corrected.fastq | grep -v "+\|--" | grep --color "AC[ACTG]"

Let us repeat the above but add the possibility of the threonine being in the reverse direction.

grep -B 1 "^+$" sample_3_AAAG_corrected.fastq | grep -v "+\|--" \
| grep --color "AC[ACTG]\|[ACTG]CA"

Resources to learn more in the future

Rex Egg, A good resource to learn more about regular expressions:

https://www.rexegg.com/

Cheatsheet:

https://www.rexegg.com/regex-quickstart.html

Regex Crossword, A online game like soduku that is useful to practice regular expressions. Best used in conjunction with the above cheat sheet:

https://regexcrossword.com/

12.5 sed

This is a complicated yet powerful command that can be used to edit text files quickly and efficiently. The main use is to substitute text with other text. sed can be used with regular expressions.

The basic outline of a sed substitute command is as below. In the below case s/ signifies that sed will be used for substitution

sed “s/old_text/new_text/” old_file > new_file

Below are some examples of sed in action.

Print out a list of all the sample names using the fastq files

ls -1 *[AGTC].fastq | sed "s/.fastq//"

First print out the contents of the file metadata.txt

cat metadata.txt

Print out metadata.txt and change IBD to DISEASE without altering the file

sed "s/IBD/DISEASE/" metadata.txt

cat metadata.txt | sed "s/IBD/DISEASE/"

sed is case sensitive and will by default only replace the first instance it finds within each line. Print metadata.txt to screen and then change the “P” in “Patient” to “Human_P”

cat metadata.txt | sed "s/P/Human_P/"

To replace every instance of the old pattern within each line g can be added after the last /. This stands for global therefore it changes the command to a global substitute.

Print metadata.txt to screen and change every occurrence of a number to “number”. To get the regular expression meaning of “+” it needs a “\” before the “+”

cat metadata.txt | sed "s/[0-9]\+/number/g"

The file metadata.txt is tab delimited (i.e. there is a tab between each column. Make a comma separated file containing the information of metadata.txt called metadata.csv (csv = comma separated value). \t presents a tab.

cat metadata.txt | sed "s/\t/,/g" > metadata.csv

For a very in depth look into the sed command please look at the following link: http://www.grymoire.com/Unix/Sed.html

12.6 Permissions

All files, directories and programs have permissions. It is important to know about this so you know your read, write and executability permissions for the content within machines.

Below is a useful link to learn about file permissions:
https://www.guru99.com/file-permissions.html

Intro to Linux