We have discussed the raw data we will use in the general introduction notebook. Now we will use bash commands to look in these files.
The sequencing reads are already available in the folder 'data/reads'. To keep the practical feasible, we have made a subset of the original data.
In this reads folder, you will find 12 files. L and P stand for the leaf and plant samples. L1 L2 L3 are the biological replicates. R1 and R2 represent the forward and reverse Illumina reads. So 2 samples 3 replicates 2 directions = 12 files Files are stored in a fastq format (.fastq) which are compressed in gzip archives (.gz). The gzip compression format is widely used in the genomics field. Fastq files often compress down to only a quarter of the filesize with gzip! These days, you rarely ever need to extract these; all modern software can read these compressed fastq files.
[DO] Lets first double check if all files are available.
ls
command to see what files are present in the data/reads/
folder.fastq.gz
filesls data/reads
To decompress and print a gzipped file to your screen, you can use the zcat command like this
zcat path/to/file.gz
Note that the content of gzipped files often is quite big.
So big even that you may crash this webpage.
To prevent this from happening, 'pipe' the output of zcat
to head
to display only the first ten lines like so:
zcat path/to/file.gz | head
[DO] Now, in the cell below, check the first ten lines of one of the fastq files like so:
zcat
commandfastq.gz
files we have seen in ls
beforezcat data/reads/L1.R1.fastq.gz | head
There is no need to be alarmed by the 'broken pipe' error message. Still, do you understand what the error means?
zcat
wants to "push"the entireL1.R1.fastq.gz
file through the pipe tohead
. However, thehead
command closes the pipe after 10 lines (like it's supposed to).zcat
cannot do its job, and hence gives you an error message, from the perspective ofzcat
the pipe is broken.
[Q] Do these look as we would expect?
[A] Yes, the fastq file contains reads with information spread over four lines.:
We only see 10 lines because we used the head
command.
Note that the regular cat
command prints regular files to your screen.
The zcat
command is just a variant of this for compressed files.
cat is short for conCATenate, zcat is short for something like gZipped-conCATenate
Now, I want you to find out how many reads there are in a single fastq file. You will achieve this in two steps.
First, using grep, we can select for lines in a file that contain only the word/key specified in the grep command. As you can see in the above command, the headers of the reads always start with the "@", therefor if we only want to see the headers, we can grep for the "@" using the command below:
[DO:] In the cell below, create a command that uses
zcat
to open the fastq.gz file.grep '@'
to print only lines with an '@'. So only the headers of the FastQ file.head
to show you only the first ten lines and thus not crash our webpage.Again, use pipes to feed the output from one programme to the next.
zcat data/reads/L1.R1.fastq.gz | grep '@' | head
In the output you just created in step one, every line contains a header representing one sequence. The second step now is to count these lines. To count the number of lines, we will first have to read in the gzip file, then grep on the headers, and then we can use the word count command to count the number of lines.
[DO:] First read the 'help page' of the wc
command by typing in the cell below:
wc --help
wc --help
What option do we need to provide wc
to count lines? Deduce from the manual page.
[DO:] Now Finally, make a new 'pipeline' using
zcat
to open the fastq.gz filegrep '@'
to filter out only headers wc
+ options to count the lines in the fastq file.zcat data/reads/L1.R1.fastq.gz | grep '@' | wc -l
zcat data/reads/L1.R1.fastq.gz | grep '@' -c
[Q:] How many sequences are in the FastQ file, is this number remarkable?
[A] The FastQ file contains exactly 1 million reads. This is a remarkable round number and does not match with the number from the FastQC report (10613642 for Sample L1 forward). The reads for this practical were subsetted to 1 million, to make the steps a bit more feasible and quick.
So now you have executed some bash commands and investigated the output. Well done!
Calculating a de-novo metagenome assembly is beyond the scope of this practical, so we have done this for you. You will however, analyse this assembly yourself! :)
[DO:] Let's start by looking at the assembly files in the folder data/assembly
.
ls data/assembly
Now, look at the first ten lines of the assembly with the head
command.
Will you use zcat
to open the assembly?
Only if the assembly is a compressed scaffolds.fasta.gz
file.
If you are dealing with a 'regular' fasta file, use cat
.
Finally, after using head, also type | cut -c 1-500
This cuts out character 1 to 500.
If you skip that last step, you may overload this web page.
[DO:] Use (z)cat
, head
, and cut -c 1-500
to look at the first lines of the assembly'
zcat data/assembly/scaffolds.fasta.gz | cut -c 1-500 | head
Next, we will assess the number of sequences in the assembly. Remember how we did this for the FastQ file?
Also, remember that the headers for FastQ files and Fasta files are not the same.
(z)cat
the assembly filegrep
the headerswc
to count the lines.[DO:] Assess the number of scaffolds in the metagenome assembly.
zcat data/assembly/scaffolds.fasta.gz | grep '>' | wc -l
zcat data/assembly/scaffolds.fasta.gz | grep '>' -c
[Q:] Do you remember from the lecture any reasons that assemblies may be split up over this amount of contigs?
[A] In a (de bruijn) graph assembly, ambiguities may exist which results in unsolvable structures in the graph. These are for example:
This is a snapshot of the de bruijn graph that lead to this metagenome assembly. Assembly graphs can be visualised, filtered and searched through with Bandage.
For convenience, we will decompress the assembly.
Compression was achieved with the gzip
programme.
Decompression can be done with the command gunzip path/to/assembly
.
gunzip
only gives you output if something went wrong. So no | head
is needed here.zcat your-file.fastq.gz > yourfile.fastq
if gunzip doesn't work. This is the case for the CoCalc environmentls
command to see if it worked the file is still a .gz
file, or if it was decompressed.head command
to peak into the assembly like so head ./data/assembly/scaffolds.fasta | cut -c 1-500
[DO:] decompress the assembly and check if you were successfull as indicated above
gunzip data/assembly/scaffolds.fasta.gz
zcat data/assembly/scaffolds.fasta.gz > data/assembly/scaffolds.fasta
ls data/assembly
head data/assembly/scaffolds.fasta | cut -c 1-500