We have discussed the raw data we will use in the general introduction notebook. Now we will use bash commands to look in these files.
The sequencing reads are already available in the folder 'data/reads'. To keep the practical feasible, we have made a subset of the original data.
In this reads folder, you will find 12 files. L and P stand for the leaf and plant samples. L1 L2 L3 are the biological replicates. R1 and R2 represent the forward and reverse Illumina reads. So 2 samples 3 replicates 2 directions = 12 files Files are stored in a fastq format (.fastq) which are compressed in gzip archives (.gz). The gzip compression format is widely used in the genomics field. Fastq files often compress down to only a quarter of the filesize with gzip! These days, you rarely ever need to extract these; all modern software can read these compressed fastq files.
[DO] Lets first double check if all files are available.
ls
command to see what files are present in the data/reads/
folder.fastq.gz
files
To decompress and print a gzipped file to your screen, you can use the zcat command like this
zcat path/to/file.gz
Note that the content of gzipped files often is quite big.
So big even that you may crash this webpage.
To prevent this from happening, 'pipe' the output of zcat
to head
to display only the first ten lines like so:
zcat path/to/file.gz | head
[DO] Now, in the cell below, check the first ten lines of one of the fastq files like so:
zcat
commandfastq.gz
files we have seen in ls
beforezcat
There is no need to be alarmed by the 'broken pipe' error message. Still, do you understand what the error means?
[Q] Do these look as we would expect?
[A]
Note that the regular cat
command prints regular files to your screen.
The zcat
command is just a variant of this for compressed files.
cat is short for conCATenate, zcat is short for something like gZipped-conCATenate
Now, I want you to find out how many reads there are in a single fastq file. You will achieve this in two steps.
First, using grep, we can select for lines in a file that contain only the word/key specified in the grep command. As you can see in the above command, the headers of the reads always start with the "@", therefor if we only want to see the headers, we can grep for the "@" using the command below:
[DO:] In the cell below, create a command that uses
zcat
to open the fastq.gz file.grep '@'
to print only lines with an '@'. So only the headers of the FastQ file.head
to show you only the first ten lines and thus not crash our webpage.Again, use pipes to feed the output from one programme to the next.
zcat
In the output you just created in step one, every line contains a header representing one sequence. The second step now is to count these lines. To count the number of lines, we will first have to read in the gzip file, then grep on the headers, and then we can use the word count command to count the number of lines.
[DO:] First read the 'help page' of the wc
command by typing in the cell below:
wc --help
What option do we need to provide wc
to count lines? Deduce from the manual page.
[DO:] Now Finally, make a new 'pipeline' using
zcat
to open the fastq.gz filegrep '@'
to filter out only headers wc
+ options to count the lines in the fastq file.zcat
[Q:] How many sequences are in the FastQ file, is this number remarkable?
[A]
So now you have executed some bash commands and investigated the output. Well done!
Calculating a de-novo metagenome assembly is beyond the scope of this practical, so we have done this for you. You will however, analyse this assembly yourself! :)
[DO:] Let's start by looking at the assembly files in the folder data/assembly
.
Now, look at the first ten lines of the assembly with the head
command.
Will you use zcat
to open the assembly?
Only if the assembly is a compressed scaffolds.fasta.gz
file.
If you are dealing with a 'regular' fasta file, use cat
.
Finally, after using head, also type | cut -c 1-500
This cuts out character 1 to 500.
If you skip that last step, you may overload this web page.
[DO:] Use (z)cat
, head
, and cut -c 1-500
to look at the first lines of the assembly'
Next, we will assess the number of sequences in the assembly. Remember how we did this for the FastQ file?
Also, remember that the headers for FastQ files and Fasta files are not the same.
(z)cat
the assembly filegrep
the headerswc
to count the lines.[DO:] Assess the number of scaffolds in the metagenome assembly.
[Q:] Do you remember from the lecture any reasons that assemblies may be split up over this amount of contigs?
[A]
For convenience, we will decompress the assembly.
Compression was achieved with the gzip
programme.
Decompression can be done with the command gunzip path/to/assembly
.
gunzip
only gives you output if something went wrong. So no | head
is needed here.zcat your-file.fastq.gz > yourfile.fastq
if gunzip doesn't work. This is the case for the CoCalc environmentls
command to see if it worked the file is still a .gz
file, or if it was decompressed.head command
to peak into the assembly like so head ./data/assembly/scaffolds.fasta | cut -c 1-500
[DO:] decompress the assembly and check if you were successfull as indicated above
gunzip
ls
head