Chapter 12 Advanced Linux practice
We have covered a small amount of Linux coding. This should be sufficient to carry out our future workshops but if you were to continue in bioinformatics we would recommend learning more advanced methods.
Below are some short sections to introduce you to some more advanced linux coding techniques. These give you a quick overview and some examples. This will hopefully put you in a good position to allow you to to learn these techniques in more depth outside of this workshop.
The following sections will all be run with the files in the directory "~/Linux/advanced_practice/". Therefore ensure you are in this directory before running the below examples. This contains fastq and txt files for 20 samples. Each sample contains a fastq file and a txt file for uncorrected and corrected reads. These fastq files are single end (i.e. there are no reverse/R2 reads).
12.1 Wildcard characters
These are characters that can be used to represent a variety of other characters. This can be useful for deleting many files, searching for files with specific patterns in their names and more.
Be very careful when using wildcard character with the command rm
.
Three basic and useful wildcards include:
*
- This represents zero or more characters
?
- This represents a single character
[]
- This represents a range of characters
Below are various examples you can run to show wildcards in action.
List all the files and directories in the working directory
List all files ending in “.fastq”
List all files ending in “.txt”
List all files with the string “corrected” somewhere in the file name
List all files with the string “corrected” somewhere in the file name and it ends in “fastq”
List all files that begin with “sample_2”
List all files that begin with “sample_2_”
List the files that begin with “sample_1” and ends with “AAAA.txt”. It may have zero or more characters between these two.
List the fastq files of samples with a single digit number
List the txt files of samples with a number in the tens
List the txt files for samples 3,4,5,6 & 7 i.e. 3-7
List the txt files for the non corrected information of samples with single digits.
List the corrected txt files for samples with numbers divisible by 10.
12.2 Redirection
Redirection allows you to put the output of a command to a file. The redirect symbol is >
. Be careful when redirecting as it will overwrite any existing files. To append to the bottom of a file use >>
.
Below are various examples of redirecting in action.
Create a file called ecoli.tmp containing the text “I am escherichia coli”
Create a file called pcryohalolentis.tmp containing the text “I am psychrobacter cryohalolentis”
Create a new file called bacteria.tmp which will contain the text from ecoli.tmp and pcryohalolentis.tmp
Create a file called vcholerae.tmp containing the text “I am not ecoli or pcryohalolentis”
cat
the file vcholerae.tmp and redirect it to bacteria.tmp.
Look at the contents of bacteria.tmp
This has removed the ecoli and pcryohalolentis lines. Append the contents of ecoli.tmp and pcryohalolentis.tmp to bacteria.tmp and then check the file
Put information regarding number of lines of all the fastq files into a new file called fastq_lines.tmp
Now delete all the files that were created in the above examples. Again be very careful about using the rm
command with wildcards.
12.3 Pipes
Pipes allow you to put the output of one command to the input of another. For example you could use grep
to get all the lines with a certain string and pipe the output to wc
to count the number of lines that have the specific string.
The pipe symbol is |
. This is normally found on your keyboard directly left of the Z key. Weirdly the symbol is represented by |
but split in the middle on some keyboards.
A useful tip when building up longer pipes is to start with a smaller amount of data and check the output of each step as you go. To do this you could use head
instead of cat
whilst testing.
Below are various examples of piping in action
Print to screen the second last fastq entry of the file sample_20_ATAC_corrected.fastq
Note: In the above command the tail
command is working on the output of the cat
command. Therefore this would not work to get the second last fastq entry of multiple files. For example the following command would print the second last fastq entry of the last fastq file (i.e. sample_9_AAGA.fastq due to file ordering)
Count the number of lines within all the fastq files
Count the number of lines which contain the text "TAG" within all the fastq files
12.4 Regular expressions
Regular expressions are similar to wildcard characters but more complex and used for commands like grep
and sed
.
Below are a basic set of regular expressions:
.
: A single character
?
: The preceding character matches 1 or 0 times
*
: The preceding character matches zero or more times
+
: The preceding character matches one or more times
{n}
: The preceding character matches exactly n times
{n,m}
: The preceding character matches n to m times
[AT]
: The character is one of the characters in the brackets
[^CG]
: The character is not one of those in the brackets
[1-7]
: The character is 1,2,3,4,5,6 or 7. This works with letters too.
()
: Group several characters into one
|
: Logical OR operator
^
: Matches the beginning of the line
$
: Matches the end of the line
Below are various examples of regular expressions in action.
Look at the contents of metadata.txt
Print out the lines for the Healthy patients
Print out the lines for the IBD patients from Craigavon and Belfast. In the below command \
is used to allow |
to be used as an or operator instead of acting as a string to match.
Print out the lines for the Pre information of patients not from Edinburgh or Aberdeen
Print out the lines for patients 1,2,3 and 4
Print to screen every fastq header of file sample_15_AACG_corrected.fastq
In the piping examples we counted the number of lines with the text “TAG” within the fastq files. However this also counted fastq headers due to the name of the samples. Let us use a regular expression to only count the number of sequences within the fastq files that contain “TAG”.
Print to screen every line within the file sample_3_AAAG_corrected.fastq that has a possible Threonine codon in the forward direction.
In the above example fastq quality lines are also extracted as some of them also contain the pattern we are searching for. To get around this we can pipe. First grep
the fastq quality header (i.e. +), as no other line only contains “+”, and the line before it. Then we can remove lines with a plus with an invert grep
. Finally we can grep
for the threonine pattern using only the sequence lines. Let us build this up step by step.
Print to screen the fastq quality header plus the one line preceding each (i.e. Sequence line) for file sample_3_AAAG_corrected.fastq.
Now pipe this output so it removes the lines with “+” (fastq quality headers) and “--” (separators of each grep
match provided by grep because of the -B 1 option).
Now from this output, grep
for the Threonine pattern plus colour each match within the line with the option “--color”.
Let us repeat the above but add the possibility of the threonine being in the reverse direction.
grep -B 1 "^+$" sample_3_AAAG_corrected.fastq | grep -v "+\|--" \
| grep --color "AC[ACTG]\|[ACTG]CA"
Resources to learn more in the future
Rex Egg, A good resource to learn more about regular expressions:
Cheatsheet:
https://www.rexegg.com/regex-quickstart.html
Regex Crossword, A online game like soduku that is useful to practice regular expressions. Best used in conjunction with the above cheat sheet:
12.5 sed
This is a complicated yet powerful command that can be used to edit text files quickly and efficiently. The main use is to substitute text with other text. sed
can be used with regular expressions.
The basic outline of a sed
substitute command is as below. In the below case s/
signifies that sed
will be used for substitution
Below are some examples of sed in action.
Print out a list of all the sample names using the fastq files
First print out the contents of the file metadata.txt
Print out metadata.txt and change IBD to DISEASE without altering the file
or
sed
is case sensitive and will by default only replace the first instance it finds within each line.
Print metadata.txt to screen and then change the “P” in “Patient” to “Human_P”
To replace every instance of the old pattern within each line g
can be added after the last /
. This stands for global therefore it changes the command to a global substitute.
Print metadata.txt to screen and change every occurrence of a number to “number”. To get the regular expression meaning of “+” it needs a “\
” before the “+”
The file metadata.txt is tab delimited (i.e. there is a tab between each column. Make a comma separated file containing the information of metadata.txt called metadata.csv (csv = comma separated value). \t
presents a tab.
For a very in depth look into the sed command please look at the following link: http://www.grymoire.com/Unix/Sed.html
12.6 Permissions
All files, directories and programs have permissions. It is important to know about this so you know your read, write and executability permissions for the content within machines.
Below is a useful link to learn about file permissions:
https://www.guru99.com/file-permissions.html
12.7 MCQs: Advanced linux
Please attempt to answer the below Multiple-Choice Questions to reinforce what you have learnt in this chapter.
- Which symbol is used to pipe the output of one command to the input of another?
- What command can be used to substitute text with other text?
- Which symbol is used to redirect the output of a command to a file?
- Which wildcard represents a single character?
- Which wildcard represents zero or more characters?
- Which wildcard represents a range of characters?
- Which regular expression indicates that the preceding character matches 1 or 0 times?
- Which regular expression represents a single character?
- Which regular expression indicates that the preceding character matches zero or more times?
- Which regular expression matches the end of the line?
- Which regular expression is a logical OR operator?
- Which regular expression matches the start of the line?