Sorting bam files

BAM files are Binary sAM files. SAM files are nothing more than big tables that tell us what read from the FastQ file, mapped where exactly on the scaffolds of the metagenome assembly. The rows of this table that we call a SAM file are not ordered in any logical way. When just mapped, the order resembles the (random) order of the reads in the original fastq file.

For many computational purposes, we want to sort/order these rows according to

  1. the scaffold they mapped on
  2. the position on that scaffold. (position in bp)

We will achieve this with the samtools program. As the name suggests, samtools comprises many tools to deal with SAM (or BAM) files: one of which is the samtools sort tool. This tool we will use to sort our BAM files.

samtools also contains the samtools view tool we used earlier. Samtools view is used to convert SAM to BAM and also BAM to SAM.

Before we proceed, let's have a quick look if the BAM files are created as we expected.

[DO:] check the mapped directory with ls and peek inside the bam files with samtools view.

In [1]:
ls ./data/mapped
L1.bam  L2.bam  L3.bam  P1.bam  P2.bam  P3.bam
In [2]:
samtools view ./data/mapped/L1.bam | head
ERR2114809.1	99	NODE_472_length_18986_cov_293.449_ID_23957348	15238	60	149M	=	15505	418	TNTCTACCTATAACTAAGGCAATCAGCGCAATCAATTCCCAAAATGAACCTGTGGGAAAGAAAAGACGAGAAATTAGTACCGCCGCAATGCCTTTGAAACCTTCAGACAGCACTGCCAAAATACCCACCAATTTGCCGCCGTGATAAAA	A#AAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE<EEEEEEEEEEEE<EEEEEEAEEEEEEE/EE<EEEEAEEEEEEEEEAEEEEEEAA<<AAEAEEAA<666<AAAAAAEEE	NM:i:1	MD:Z:1A147	MC:Z:151M	AS:i:147	XS:i:0
ERR2114809.1	147	NODE_472_length_18986_cov_293.449_ID_23957348	15505	60	151M	=	15238	-418	ACTANANNACCCNATNGATGAAACATTTATTGTTTAATGTGAATTTTAGACCTATCTTCTATCTGAAATTGTCAATTAGAAATATAAATTTTGATAAAAATTTTTTTGATAGTTTGATTGCTTGGTTTATGGACACTAAGCAGACACGNNN	66</#/##EEA/#</#6EEEEEEEEEEEAEEEEEEAEEAEEEEEEEEEEEEEEEEEEEEAEEEEEEEE<EEEEEAA/EEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAA###	NM:i:10	MD:Z:4A0T0A0A4C1G0A132G0A0A0	MC:Z:149M	AS:i:132	XS:i:19
ERR2114809.2	83	NODE_1779_length_7357_cov_353.786_ID_23982227	3055	60	150M	=	2650	-555	TACTCACTTTTAGAAACATGGCCACTATGGGCATAAATAACATGAAAGTTAAGTAGATAATTGTAATTCGCCAAGTCCAGGGTAAATTAATGAGATTATTTATAACTACCTTCCAAATAGAAGGTTTATTCTTAGTGTCTACAGCAGANG	EEEEAAA<EA<EEEEEEEEA<EEEEE<A6A<AAE/AA<EEEAEEEEEEEEEEEEEAEEAAEEEEEEAEEEEEEEEEEEEEAEEEEEAEEEEEEEEE<EEEEEAEEEEEAEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEAAA#A	NM:i:1	MD:Z:148A1	MC:Z:151M	AS:i:148	XS:i:0
ERR2114809.2	163	NODE_1779_length_7357_cov_353.786_ID_23982227	2650	60	151M	=	3055	555	NNNAAGCGGCTTCTTCAATTTCTTGTTCCATTTCTTGCAGTACAGGTTGTACAGTTCTCACCACGAAAGGTAAGGAAATAAATATCATTGCTACCCCAACTCCTAAGCGGGTAAATGATACTCTAATTCCCAAAGNTGNTAATNNANAACC	###AAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE<EA<EEEAAEEEEEAEEEAEEEAEEEEEEEEEEAEEEE//EAE<</AAEEEAAEAEEAEA<</A6AAEEEE<66AA#EE#EEEE##/#6/A<	NM:i:8	MD:Z:0T0C0C132G2C4A0A1G4	MC:Z:150M	AS:i:138	XS:i:32
ERR2114809.3	77	*	0	0	*	*	0	0	CNACCAATCGCAATGGAGCCGCGCGTCGGCTGCTCCAGTCCCGAGATCATTGTCAGGAGCGTCGATTTAAGGCATCCAGAAGGCCCGACAAGAACCGTGAACGCCCCATGCGGCACGTTCAGGGTGAGGGGGGGATGGATCTCTAGCGCTC	A#AAAEEEAEEEEEE/EEAAEEAEE/AAA/EAAAA/EEEA/6A/EEA<A///////E////A/6/////////E///E/EE/////E<///A////A/E///E//////A/AA//</////<///A/<E///<A///////////A/A///	AS:i:0	XS:i:0
ERR2114809.3	141	*	0	0	*	*	0	0	NNNCGGGGACGGAGGTTAGCGGCAGCTGGGCGAGCCGCTTCGCGGAGGGTCAAGCCGCGATGACTGCGCGCAAAGTCGGCAAGGGTAATGTGATCTGCGTGGGCACCTACCTCCGCGGTGCTCGGGTCGAGGTGGNGGNCGATNNANTGTT	###//A/E/EEE/6E//A//AE//E/E/EEEE/A//E///E///E/<EE////AE<//</A//////E/E////E//EE///EEA/A/A//A/<///////AA///A///////E/6//6<<E/AA////EE///#//#/E/6##/#EE</	AS:i:0	XS:i:0
ERR2114809.4	77	*	0	0	*	*	0	0	ANATCATCACATGGTGTGAATCCCAAATGGGTGTGGATGATGTTTTTGGATTGGCGCAAAATAGTCGTTTAATTGGGATGACTAGAAAGACTCAAAGTAGAGCATCCCTTGAGTTTGAGGAAAAACTATCAACAGTAGTTTCATTCTT	A#AAAEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE<EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE<EEEAEEEEEEEEEEEAEEEA/EEAA<A<AAE<EEEAEEEEEAEE	AS:i:0	XS:i:0
ERR2114809.4	141	*	0	0	*	*	0	0	NNNAAATAAGTAAAAGGCAATTTAACTTCAAATTAAGCAGTATTAGTCAGCATTATTAAGCGTTTATGAGCAGTAGCAAAGATATGTCTGTATGGGCAAGAACTAGTAATCGCAATTAAAATGGGACGTGTACTTNAANTAATNNANTCTC	###AAEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEE/EEEEEEEEEEEEAEE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEEE<EEEEEEEEAEAEE/EAAEEE/<AAA/A</6A/AA/E6AAEEEE<AE#/A#EEEE##/#/<AA	AS:i:0	XS:i:0
ERR2114809.5	99	NODE_1543_length_8081_cov_287.47_ID_23992363	4428	60	151M	=	4793	516	ANGAAATAGAAGCAACATTAGGCTATCCCTATTTTGTTAAACCTGCTAATCTGGGTTCATCGGTAGGTATTGCCAAAGTGCAATCGCGCCAAGAATTAGAAGCAGCTTTAGATAATGCCGCAACCTATGATAGACGCATCATTGTTGAAGC	A#AAAEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEA/EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE<EEEEEEEEEEEEEEEEAAEEAAAAE<AEEEEEEEEEEEAA<AAEEEEEEEEEA	NM:i:1	MD:Z:1T149	MC:Z:151M	AS:i:149	XS:i:0
ERR2114809.5	147	NODE_1543_length_8081_cov_287.47_ID_23992363	4793	60	151M	=	4428	-516	GCCGNTNNTTTANCCCGGGTAGACTTTTTCTATGTAGAAGCTACCGGAGAAATCTTAATCAACGAAATTAACACCTTACCAGGATTTACATCCACCAGTATGTATCCTCAACTTTGGGCAAATACTGGTATTTCTTTTCCTAAATTAGNNN	/EAE#E##AEEE#EEAEEA<AEEE<E66EEEEAE<EAAEEEEEEAEEEEEEAEEAAA<EEEEEEEEAEEEEEEEEEAEEEEAEEAEEEE<EEEEEE<EAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAA###	NM:i:7	MD:Z:4C1G0G4G135T0A0G0	MC:Z:151M	AS:i:140	XS:i:0
samtools view: writing to standard output failed: Broken pipe
samtools view: error closing standard output: -1

You should see a lot of tab-delimited names, numbers and sequences.

[DO:] Google how a sam/bam file should look and see if it corresponds to what you get.

the official specifications:"https://samtools.github.io/hts-specs/SAMv1.pdf

wikipedia: https://en.wikipedia.org/wiki/SAM_(file_format)

[Q:] What are the headers of a sam/bam file?

[A:]

Col Field Type Brief description
1 QNAME String Query template NAME
2 FLAG Int bitwise FLAG
3 RNAME String References sequence NAME
4 POS Int 1- based leftmost mapping POSition
5 MAPQ Int MAPping Quality
6 CIGAR String CIGAR string
7 RNEXT String Ref. name of the mate/next read
8 PNEXT Int Position of the mate/next read
9 TLEN Int observed Template LENgth
10 SEQ String segment SEQuence
11 QUAL String ASCII of Phred-scaled base QUALity+33

Another loop

We have seen how to make a loop in the backmapping part of this practical. Now let's do the same and sort the BAM files we created earlier.

Before you start, create a new folder where you store your sorted BAM files. I suggest something like data/sorted

The cell below contains a copy of the loop from the previous notebook. Edit in a way so that the loop sorts your bam files.

  1. Make sure to do this step by step. Test every little thing you change in the loop
  2. Don't forget to very carefully read the help page of samtools sort
    • You want to sort the reads by coordinate (which is the default), not by name.

Make sure you only use one CPU/thread; we have to share this computer with all of us.

[DO:] make a new directory to store your sorted bam files

In [3]:
mkdir data/sorted
In [4]:
samtools sort
Usage: samtools sort [options...] [in.bam]
Options:
  -l INT     Set compression level, from 0 (uncompressed) to 9 (best)
  -m INT     Set maximum memory per thread; suffix K/M/G recognized [768M]
  -n         Sort by read name
  -t TAG     Sort by value of TAG. Uses position as secondary index (or read name if -n is set)
  -o FILE    Write final output to FILE rather than standard output
  -T PREFIX  Write temporary files to PREFIX.nnnn.bam
  --no-PG    do not add a PG line
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
  -O, --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]
  -@, --threads INT
               Number of additional threads to use [0]
      --verbosity INT
               Set level of verbosity

[DO:] use samtools sort to sort the bamfiles. Make sure to store them in your new directory.

In [6]:
samples=( L1 L2 L3 P1 P2 P3 )
for i in ${samples[@]}
    do samtools sort -o data/sorted/$i.sorted.bam data/mapped/$i.bam
done

check

[DO:] After sorting, check whether your bam files are sorted correctly.

  1. Use ls --size to see if the files exist and have a proportional size
  2. run samtools view to view your BAM files.
In [7]:
ls -s1 ./data/mapped
total 1354552
226972 L1.bam
226456 L2.bam
226860 L3.bam
224096 P1.bam
222840 P2.bam
227328 P3.bam
In [8]:
ls -s1 ./data/sorted
total 944700
147536 L1.sorted.bam
146044 L2.sorted.bam
146656 L3.sorted.bam
169244 P1.sorted.bam
165216 P2.sorted.bam
170004 P3.sorted.bam
In [11]:
samtools view data/sorted/L1.sorted.bam | head
ERR2114809.663052	145	NODE_1_length_1935275_cov_24.6805_ID_23901540	18	60	151M	=	1885325	1885158	CAATGCGGCTTGTTACGCTCAAATTTAGCCTTCGCCATGTCCGTACAATCCTAAAAACCAGAATTGAAATCGTATCTTAACTACTTTGCCGTAAACCGGTCAGGGACCGATCCACCGCGAAACTCTGGAGCGGGTGACGGGAATCGAACCC	6/E/AAEEAEEE/<EE//EEAE//<</<EEEEEEE/EEEAEEE/EEAEAEEA/6/E<<EEEEAE/EEEEEEEAEEAE<A/EE/AAEEE/EEEEA/EEEEEEAEE/EAE<EEEEEEEE/EEEAA/AEEEEEEEEE6EEAEEAE//EEAAAAA	NM:i:0	MD:Z:151	MC:Z:151M	AS:i:151	XS:i:38
ERR2114809.878075	163	NODE_1_length_1935275_cov_24.6805_ID_23901540	65	60	151M	=	579	665	ATCCTAAAAACCAGAATTGAAATCGTATCTTAACTACTTTGCCGTAAACCGGTCAGGGACCGATCCACCGCGAAACTCTGCACCGGGTGACGGGAATCGAACCCGCGTAGCCAGCTTGGCAGGCTGGCGCTCTACCAATGAGCTACCCCCG	AAAAAAAEE6EEA/E6EEEEE/EEEAE///A/AAA/EEEEAEEEA/EAEE6E/AEEEE/E/E/EEE/EE/A///EE/EEA/</AA<E//<E</EE/<EAAAEEEEE<6</EE<AE/E//6/AAA<<A//A</</AAA/<AA6A////AAA/	NM:i:6	MD:Z:80G1G36A7A9T8A4	MC:Z:151M	AS:i:121	XS:i:20
ERR2114809.878075	83	NODE_1_length_1935275_cov_24.6805_ID_23901540	579	60	151M	=	65	-665	GCCAAAGACCCCAACCTCCCCAGCTGACACGTTCGAAGGCCGGCGGTCCGGGAAATTCCGTGGGGACGGTCCGCGACAGGGGCGGAATCGTGATGATCATGCGCGCGGGAGTGAAACAAAATTACCGCCTGAGCTGTCCGGGCATGCTGGC	AA<<6/A//<<AAA/E/EE/E///E/AEE/E<EEEEE/A<<<//<</EAEEE<EE/<6A/EA<EE/EAAEE/E/<E<<AA<///AAA/EA/AAEAEE/A/EAA/E<EAEEAEEEEEEE///EE/EEEEE/EEA66EAEEEEE/EEEA/A6A	NM:i:4	MD:Z:14A7T6A51T69	MC:Z:151M	AS:i:131	XS:i:0
ERR2114809.11615	163	NODE_1_length_1935275_cov_24.6805_ID_23901540	658	60	151M	=	1062	555	GGTCGGAATCGTGATGATCATGCGCGCGGGAGTGAAACAAAATTACCGCCTGAGCTGTCCGGGCATGCTGGCGGTGGGTCCGGCAGTTACTGGATTTTCGGGCATCATGCGGTGGAAGAGGCACTTAAGAACCCGAAGCGGCGAATTCATC	AAAAAEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEAEEEEEEEEAEAEEEEEEAA<EE/EEEEE<AEEEEEEEAEE/EE	NM:i:0	MD:Z:151	MC:Z:151M	AS:i:151	XS:i:0
ERR2114809.11615	83	NODE_1_length_1935275_cov_24.6805_ID_23901540	1062	60	151M	=	658	-555	GCGCTCGGCGGCGGCCTTCGGGGCGATCGCAGTGATCCTGACTGAGCGCCATGCAGCGCCGGAAAGCGGGATCCTTGCCAAATCGGCCTCCGGCGCCCTGGAACATGTTCCGCTGGTCCGCGTTACCAACCTCGCCCGCGCCATGGAGGAA	<AAA<<<<6<<<<AEEEEAA<AAAAEAAAEEEEEEEEEEEEEEEEAEEEEEEAEEEEAEEAEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA	NM:i:0	MD:Z:151	MC:Z:151M	AS:i:151	XS:i:0
ERR2114809.368289	99	NODE_1_length_1935275_cov_24.6805_ID_23901540	2919	60	150M	=	3317	549	AAGGTTTCCGACGTGGTCGCTTCATGGGGCGATTTCAAACGCGACGAAGTCAACGTCTCAGCACTCGGCCACAACAACGCTGCAGCGGTGATGATTATGGATCGCGCCGGCTGGCGCTGATCTCCATTCGCAGAAAAAAGCCGCCGGTGA	AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEE/EEEEEEAEEEEEEEEEE<EAEE/EEEEEEEEEEEEEEEEEEEEEEEEEEE<<AA<AEEAEEAAAAEEAEAAEAEEEE<AEEEEEEE<<<AA<<AEEA	NM:i:0	MD:Z:150	MC:Z:151M	AS:i:150	XS:i:0
ERR2114809.368289	147	NODE_1_length_1935275_cov_24.6805_ID_23901540	3317	60	151M	=	2919	-549	CCTGCAGCGACGGATCCAATTTGGGTGATGCTCATCGCACACCCTCCTAGTTCTGGTTTCAGTTAGCTTCAACATTGGACCGATTGGCCCATAGCGCAAGCATTATCGATTACGAATTTAATCGAAGTTTAAGTAATTTCCTTTCGTATCA	<EEEEAAAAAEEEEEEEEEEEEEEE/E<EEEEEEEEEEEEEEEEAEEEEEAEEEEEEEAEEE/EEEEEEEEEEEAE<EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA	NM:i:0	MD:Z:151	MC:Z:150M	AS:i:151	XS:i:0
ERR2114809.878692	99	NODE_1_length_1935275_cov_24.6805_ID_23901540	7493	60	146M	=	7886	543	CACGAGCGAGGCTCGCGTCAATGAGTACGCCAAGGATATTCACTCCAGCGCGCGGATCCTGTTTCGCTTGATCGAGGAAATCCTCGACTACTCCAAACTCGAATCAGGCAAGGCTCATATCGACGAAAATCAGGTCGATCTTGCCG	AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEAEEEEEEEEEEEEEAEEEEEEAEEEEEE<EEEEEEEEEEE<EEEEAEEE<AEEEEEAEE<EEAE/EEEEAEEAEEE<EE<<EAEEEEE<<AEAEAA<AAA	NM:i:0	MD:Z:146	MC:Z:150M	AS:i:146	XS:i:0
ERR2114809.878692	147	NODE_1_length_1935275_cov_24.6805_ID_23901540	7886	60	150M	=	7493	-543	ACCGGCGAAAGACATCGAACGGGTGTTCGAACCATTCGTCCAGGTCAACCGCTCTGCCTTCCATCAACAGGAAGGAACTGGTCTCGGCTTGGCCATATGCCGCAGCCTTATCGAACTTCATCAGGGGCGAATCGAGATCTCCAGCCGCCC	AEAEEEEAEAEEEAEEEEAEAAEEE/<AEEEEEEEEAEEEEEE<AEEEEEEEEEEEAEEEEAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA	NM:i:0	MD:Z:150	MC:Z:146M	AS:i:150	XS:i:0
ERR2114809.651834	163	NODE_1_length_1935275_cov_24.6805_ID_23901540	8365	60	151M	=	8622	408	GTGCCTCGCGCGGACATAGCCTGCAAGATCGCGGCACTATTTAAGAACAATGTCAGTAACGTCAGTTATGGGTGAAGGTTGCGTTGAGGATGAAATAGATCGTATAGCATGATAAATTAGGAATATTTAGTGACAGGGACGATGCGCAATC	AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEAAEEAEEEEEEAEE<<EAAEAEEEEEE<<EEEEEEE<EEEAAEE6AEEEA/EA	NM:i:0	MD:Z:151	MC:Z:151M	AS:i:151	XS:i:0
samtools view: writing to standard output failed: Broken pipe
samtools view: error closing standard output: -1

clean

Did your bam files sort correctly? Then remove the unsorted bam files. We don't need these anymore and we save some disk space.

[DO:] If you are sure, then remove the mapped directory by adding the appropriate option to the command below

In [13]:
rm ./data/mapped -rf