Mapping reads to scaffolds with BWA (Burrows-Wheeler Aligner) and sorting with Samtools

We now have seen the reads and the assembly with our own eyes. In any bio-informatic step, learn yourself always to have a peek into the actual data you are producing. Since you now have seen that the files look OK, we can start by mapping (aligning) the original reads in .fastq.gz files back to the scaffolds created with the assembly; the scaffolds.fasta file. Doing this will allow us to calculate the depth on the scaffolds; a prerequisite for the binning procedure.

Algning is achieved with bwa. Per sample, we map the reads against the scaffolds and save the mapping as a .bam file.

  1. To run bwa, we first need to make an index file of the scaffolds.
  2. Then we can run bwa and immediately pipe the output of bwa through samtools view which will output a .bam file .
  3. Finally, since we have quite a few .bam files to align to the assembly, we will make a for loop that will iterate over the different samples we have.

Rember that the reads are paired-end Illumina sequences, which means that for each DNA fragment, we have sequence data from both ends. Therefore, the sequences are stored in two separate files (one for the data from each end). We'll use bwa with default settings to align the reads back to the scaffolds, and then samtools allows us to work with the bwa results.

[Q:] Why this called back-mapping and not just mapping?

[A:]

[Q:] Why do we not use the depth in the scaffold names?

⤷ Click here for a hint Remember that in the back-mapping step, we map, store and process all samples individually.

[A:]

The next assignment is a complicated one. Make sure to read the instructions carefully and remember what you learned about bash loops earlier.

When in doubt, don't forget to read the 'help page' of a command like so:

ls --help

or sometimes by executing the command without any arguments like so:

bwa

bwa index

For more elaborate information, read the manual with the man command. Unfortunately, this doesn't work well inside Jupyter notebooks.

man bwa

man samtools

1 Index the assembly with bwa index

[DO:] Create an index of the scaffolds with bwa index** Start by reading the bwa index instructions.

  1. Remember to look for the 'usage' line first.
  2. You won't need any options here.
  3. Remember where the assembly was? We just gUnzipped it in the last notebook.
  4. use ls and auto completion to find your way to the right file
In [ ]:
bwa index

While this runs, think about what is happening. Mapping always takes a query sequence and a reference sequence. The reference is our assembly, specifically the scaffolds.fasta file. What bwa index does, is take this reference and 'index' it. This means as much as it converts this regular text fasta file, into a binary format that computers can efficiently search.

Check the output of bwa

After creating the index, there should be multiple files in the assembly folder. Think about what command you could use to see if that is the case. Use the empty cell below.

[DO:] Check if the index files were created:

In [ ]:

Besides the original scaffolds file, we now expect multiple other files. All of these together comprise the index we just made. A bwa index of a fasta file can be seen as a binary representation of the fasta file that bwa can search efficiently. It is not meant for human eyes, but purely for the computer algorithms to search through.

Run bwa mem for backmapping

For this part, we will combine your skills on bash loops and variables to run bwa mem and samtools view on all the reads in data/reads and create BAM output files in a newly made directory.

Then we use a bash for loop to run samtools sort on all the files created by samtools view

Step by step instructions:

  1. make a new directory for the samtools view results, give this a useful name
    1. I suggest ./data/mapped
    2. to make a directory, use the command mkdir
  2. See how the loop I pre-made for you works, play a bit to see if you understand correctly
  3. Don't forget to read the manual page of both! You can make extra cells to print these manual pages if you want to.
  4. finish the for loop and run both these commands in a pipe
    1. bwa mem
    2. samtools view
  5. bwa mem does not require any options,
  6. For samtools view, look for the options to convert to a BAM file and write the output to a file.

It should look somewhat like this bwa mem argument argument.fastq | samtools view argument argument

If you forgot how a loop works, check the notebook of this morning' m1-jupyter_and_bash_basics.ipynb' .

Mapping each separate sample will take about 5 to 6 minutes. If you think your loop works (read: it doesn't crash immediately), then check in the next notebook if the files are created and if they increase in size. Use ls -sh.

[DO:] 1 First make your new directory here:

In [ ]:

[DO:] 2 now see how this loop works:

In [ ]:
# first I define the samples in a variable called 'samples'
samples=( L1 L2 L3 P1 P2 P3 )
# next I use this variable in my loop
for i in ${samples[@]}
    do echo $i  
done

[DO:] 3 Read the manual pages of bwa and samtools. Remember to find the usage lines.

In [ ]:
bwa mem
In [ ]:
samtools view

Now it's up to you! Here you have another variant of the loop I made above. Substitute the ls command for a bwa command. Also, note that you can use the variable inside a path!

[DO:] Map the paired-end reads of each sample to the index scaffolds and save the output as a .bam file.

In [ ]:
# first I define the samples in a variable called 'samples'
samples=( L1 L2 L3 P1 P2 P3 )
# next I use this variable in my loop
for i in ${samples[@]}
    do bwa mem ....  # <<<<---- Write your own code here
done

TIP: If you are working on your own computer and not some shared server, you can speed up the process substantially by using more threads (CPUs/cores) to do the mapping. Find the amount of cores available on your computer with nproc and read the bwa manual to use more threads.

[DO:] If the loop is running, then proceed to the next notebook. Check with ls if your files are being created and perhaps if they increase in size over time. Then start preparing the next loop: sorting of the bam files.

Some background:

By default, mapping algorithms like bwa spit out a .sam file. The sam format is widely used and accepted by many different downstream programmes. Basically, it's just a big table with a standardised format. Although a sam file is human-readable, it is also rather bulky. The file size of a single SAM file will quickly exceed the disk space you have on this virtual machine. Therefore, we convert it to a BAM file, a Binary sAM file. There is no loss of information; it is just saved much more efficiently. Naturally, a binary file is not human readable. Always store your files as bam files. If you need to have a look in the bam file, you can view them using samtools view or any of the many downstream programmes available online.