We now have seen the reads and the assembly with our own eyes.
In any bio-informatic step, learn yourself always to have a peek into the actual data you are producing.
Since you now have seen that the files look OK, we can start by mapping (aligning) the original reads in .fastq.gz
files back to the scaffolds created with the assembly; the scaffolds.fasta
file.
Doing this will allow us to calculate the depth on the scaffolds; a prerequisite for the binning procedure.
Algning is achieved with bwa.
Per sample, we map the reads against the scaffolds and save the mapping as a .bam
file.
bwa
, we first need to make an index file of the scaffolds. bwa
and immediately pipe the output of bwa
through samtools view
which will output a .bam
file ..bam
files to align to the assembly, we will make a for
loop that will iterate over the different samples we have. Rember that the reads are paired-end Illumina sequences, which means that for each DNA fragment, we have sequence data from both ends.
Therefore, the sequences are stored in two separate files (one for the data from each end).
We'll use bwa
with default settings to align the reads back to the scaffolds, and then samtools
allows us to work with the bwa results.
[Q:] Why this called back-mapping and not just mapping?
[A:]
[Q:] Why do we not use the depth in the scaffold names?
[A:]
The next assignment is a complicated one. Make sure to read the instructions carefully and remember what you learned about bash loops earlier.
When in doubt, don't forget to read the 'help page' of a command like so:
ls --help
or sometimes by executing the command without any arguments like so:
bwa
bwa index
For more elaborate information, read the manual with the man
command.
Unfortunately, this doesn't work well inside Jupyter notebooks.
man bwa
man samtools
[DO:] Create an index of the scaffolds with bwa index
**
Start by reading the bwa index
instructions.
ls
and auto completion to find your way to the right filebwa index
While this runs, think about what is happening.
Mapping always takes a query sequence and a reference sequence.
The reference is our assembly, specifically the scaffolds.fasta
file.
What bwa index
does, is take this reference and 'index' it.
This means as much as it converts this regular text fasta file, into a binary format that computers can efficiently search.
After creating the index, there should be multiple files in the assembly folder. Think about what command you could use to see if that is the case. Use the empty cell below.
[DO:] Check if the index files were created:
Besides the original scaffolds file, we now expect multiple other files.
All of these together comprise the index
we just made.
A bwa index of a fasta file can be seen as a binary representation of the fasta file that bwa can search efficiently.
It is not meant for human eyes, but purely for the computer algorithms to search through.
For this part, we will combine your skills on bash loops and variables to run bwa mem and samtools view on all the reads in data/reads and create BAM output files in a newly made directory.
Then we use a bash for loop to run samtools sort on all the files created by samtools view
Step by step instructions:
mkdir
bwa mem
samtools view
samtools view
, look for the options to convert to a BAM file and write the output to a file.It should look somewhat like this bwa mem argument argument.fastq | samtools view argument argument
If you forgot how a loop works, check the notebook of this morning' m1-jupyter_and_bash_basics.ipynb' .
Mapping each separate sample will take about 5 to 6 minutes.
If you think your loop works (read: it doesn't crash immediately), then check in the next notebook if the files are created and if they increase in size. Use ls -sh
.
[DO:] 1 First make your new directory here:
[DO:] 2 now see how this loop works:
# first I define the samples in a variable called 'samples'
samples=( L1 L2 L3 P1 P2 P3 )
# next I use this variable in my loop
for i in ${samples[@]}
do echo $i
done
[DO:] 3 Read the manual pages of bwa and samtools. Remember to find the usage lines.
bwa mem
samtools view
Now it's up to you! Here you have another variant of the loop I made above.
Substitute the ls
command for a bwa
command.
Also, note that you can use the variable inside a path!
[DO:] Map the paired-end reads of each sample to the index scaffolds and save the output as a .bam
file.
# first I define the samples in a variable called 'samples'
samples=( L1 L2 L3 P1 P2 P3 )
# next I use this variable in my loop
for i in ${samples[@]}
do bwa mem .... # <<<<---- Write your own code here
done
TIP: If you are working on your own computer and not some shared server, you can speed up the process substantially by using more threads (CPUs/cores) to do the mapping.
Find the amount of cores available on your computer with nproc
and read the bwa manual to use more threads.
[DO:] If the loop is running, then proceed to the next notebook.
Check with ls
if your files are being created and perhaps if they increase in size over time.
Then start preparing the next loop: sorting of the bam files.
Some background:
By default, mapping algorithms like bwa spit out a .sam file. The sam format is widely used and accepted by many different downstream programmes. Basically, it's just a big table with a standardised format. Although a sam file is human-readable, it is also rather bulky. The file size of a single SAM file will quickly exceed the disk space you have on this virtual machine. Therefore, we convert it to a BAM file, a Binary sAM file. There is no loss of information; it is just saved much more efficiently. Naturally, a binary file is not human readable. Always store your files as bam files. If you need to have a look in the bam file, you can view them using
samtools view
or any of the many downstream programmes available online.