We have discussed the raw data we will use in the general introduction notebook. Now we will use bash commands to look in these files.
The sequencing reads are already available in the folder 'data/reads'. To keep the practical feasible, we have made a subset of the original data.
In this reads folder, you will find 12 files. L and P stand for the leaf and plant samples. L1 L2 L3 are the biological replicates. R1 and R2 represent the forward and reverse Illumina reads. So 2 samples 3 replicates 2 directions = 12 files Files are stored in a fastq format (.fastq) which are compressed in gzip archives (.gz). The gzip compression format is widely used in the genomics field. Fastq files often compress down to only a quarter of the filesize with gzip! These days, you rarely ever need to extract these; all modern software can read these compressed fastq files.
[DO] Lets first double check if all files are available.
ls command to see what files are present in the data/reads/ folder.fastq.gz filesls data/reads
To decompress and print a gzipped file to your screen, you can use the zcat command like this
zcat path/to/file.gz
Note that the content of gzipped files often is quite big.
So big even that you may crash this webpage.
To prevent this from happening, 'pipe' the output of zcat to head to display only the first ten lines like so:
zcat path/to/file.gz | head
[DO] Now, in the cell below, check the first ten lines of one of the fastq files like so:
zcat commandfastq.gz files we have seen in ls beforezcat data/reads/L1.R1.fastq.gz | head
There is no need to be alarmed by the 'broken pipe' error message. Still, do you understand what the error means?
zcatwants to "push"the entireL1.R1.fastq.gzfile through the pipe tohead. However, theheadcommand closes the pipe after 10 lines (like it's supposed to).zcatcannot do its job, and hence gives you an error message, from the perspective ofzcatthe pipe is broken.
[Q] Do these look as we would expect?
[A] Yes, the fastq file contains reads with information spread over four lines.:
We only see 10 lines because we used the head command.
Note that the regular cat command prints regular files to your screen.
The zcat command is just a variant of this for compressed files.
cat is short for conCATenate, zcat is short for something like gZipped-conCATenate
Now, I want you to find out how many reads there are in a single fastq file. You will achieve this in two steps.
First, using grep, we can select for lines in a file that contain only the word/key specified in the grep command. As you can see in the above command, the headers of the reads always start with the "@", therefor if we only want to see the headers, we can grep for the "@" using the command below:
[DO:] In the cell below, create a command that uses
zcat to open the fastq.gz file.grep '@' to print only lines with an '@'. So only the headers of the FastQ file.head to show you only the first ten lines and thus not crash our webpage.Again, use pipes to feed the output from one programme to the next.
zcat data/reads/L1.R1.fastq.gz | grep '@' | head
In the output you just created in step one, every line contains a header representing one sequence. The second step now is to count these lines. To count the number of lines, we will first have to read in the gzip file, then grep on the headers, and then we can use the word count command to count the number of lines.
[DO:] First read the 'help page' of the wc command by typing in the cell below:
wc --help
wc --help
What option do we need to provide wc to count lines? Deduce from the manual page.
[DO:] Now Finally, make a new 'pipeline' using
zcat to open the fastq.gz filegrep '@' to filter out only headers wc + options to count the lines in the fastq file.zcat data/reads/L1.R1.fastq.gz | grep '@' | wc -l
zcat data/reads/L1.R1.fastq.gz | grep '@' -c
[Q:] How many sequences are in the FastQ file, is this number remarkable?
[A] The FastQ file contains exactly 1 million reads. This is a remarkable round number and does not match with the number from the FastQC report (10613642 for Sample L1 forward). The reads for this practical were subsetted to 1 million, to make the steps a bit more feasible and quick.
So now you have executed some bash commands and investigated the output. Well done!
Calculating a de-novo metagenome assembly is beyond the scope of this practical, so we have done this for you. You will however, analyse this assembly yourself! :)
[DO:] Let's start by looking at the assembly files in the folder data/assembly.
ls data/assembly
Now, look at the first ten lines of the assembly with the head command.
Will you use zcat to open the assembly?
Only if the assembly is a compressed scaffolds.fasta.gz file.
If you are dealing with a 'regular' fasta file, use cat.
Finally, after using head, also type | cut -c 1-500 This cuts out character 1 to 500.
If you skip that last step, you may overload this web page.
[DO:] Use (z)cat, head, and cut -c 1-500 to look at the first lines of the assembly'
zcat data/assembly/scaffolds.fasta.gz | cut -c 1-500 | head
Next, we will assess the number of sequences in the assembly. Remember how we did this for the FastQ file?
Also, remember that the headers for FastQ files and Fasta files are not the same.
(z)cat the assembly filegrep the headerswc to count the lines.[DO:] Assess the number of scaffolds in the metagenome assembly.
zcat data/assembly/scaffolds.fasta.gz | grep '>' | wc -l
zcat data/assembly/scaffolds.fasta.gz | grep '>' -c
[Q:] Do you remember from the lecture any reasons that assemblies may be split up over this amount of contigs?
[A] In a (de bruijn) graph assembly, ambiguities may exist which results in unsolvable structures in the graph. These are for example:
This is a snapshot of the de bruijn graph that lead to this metagenome assembly. Assembly graphs can be visualised, filtered and searched through with Bandage.
For convenience, we will decompress the assembly.
Compression was achieved with the gzip programme.
Decompression can be done with the command gunzip path/to/assembly.
gunzip only gives you output if something went wrong. So no | head is needed here.zcat your-file.fastq.gz > yourfile.fastq if gunzip doesn't work. This is the case for the CoCalc environmentls command to see if it worked the file is still a .gz file, or if it was decompressed.head command to peak into the assembly like so head ./data/assembly/scaffolds.fasta | cut -c 1-500[DO:] decompress the assembly and check if you were successfull as indicated above
gunzip data/assembly/scaffolds.fasta.gz
zcat data/assembly/scaffolds.fasta.gz > data/assembly/scaffolds.fasta
ls data/assembly
head data/assembly/scaffolds.fasta | cut -c 1-500