Binning part 1: calculate depth

Now we have all ingredients to continue binning: the scaffolds and bam files containing reads mapped on those scaffolds. In metagenomics, binning is the process of grouping reads or contigs and assigning them to operational taxonomic units. Binning methods can be based on either compositional features, alignment (similarity), or both. metabat2 uses both the contig depth and tetra-nucleotide frequencies to bin the contigs. Every bin will ideally represent one microbial genome from one particular microbe that was in the original DNA extraction.

The first step in the binning process, is to calculate the contig depths from all bam files that were created before. All these depths are stored in one big table, which is then passed to metabat2. We achieve this with a script that comes with metabat2: jgi_summarize_bam_contig_depths

[DO:] See how the jgi_summarize_bam_contig_depths script works:

In [2]:
jgi_summarize_bam_contig_depths -h
jgi_summarize_bam_contig_depths 2.15 (Bioconda) 2020-01-04T21:10:40
Usage: jgi_summarize_bam_contig_depths <options> sortedBam1 [ sortedBam2 ...]
where options include:
	--outputDepth       arg  The file to put the contig by bam depth matrix (default: STDOUT)
	--percentIdentity   arg  The minimum end-to-end % identity of qualifying reads (default: 97)
	--pairedContigs     arg  The file to output the sparse matrix of contigs which paired reads span (default: none)
	--unmappedFastq     arg  The prefix to output unmapped reads from each bam file suffixed by 'bamfile.bam.fastq.gz'
	--noIntraDepthVariance   Do not include variance from mean depth along the contig
	--showDepth              Output a .depth file per bam for each contig base
	--minMapQual        arg  The minimum mapping quality necessary to count the read as mapped (default: 0)
	--weightMapQual     arg  Weight per-base depth based on the MQ of the read (i.e uniqueness) (default: 0.0 (disabled))
	--includeEdgeBases       When calculating depth & variance, include the 1-readlength edges (off by default)
	--maxEdgeBases           When calculating depth & variance, and not --includeEdgeBases, the maximum length (default:75)
	--referenceFasta    arg  The reference file.  (It must be the same fasta that bams used)

Options that require a --referenceFasta
	--outputGC          arg  The file to print the gc coverage histogram
	--gcWindow          arg  The sliding window size for GC calculations
	--outputReadStats   arg  The file to print the per read statistics
	--outputKmers       arg  The file to print the perfect kmer counts

Options to control shredding contigs that are under represented by the reads
	--shredLength       arg  The maximum length of the shreds
	--shredDepth        arg  The depth to generate overlapping shreds
	--minContigLength   arg  The mimimum length of contig to include for mapping and shredding
	--minContigDepth    arg  The minimum depth along contig at which to break the contig

Remember to find the usage line first. Then make sure you find the --outputDepth option. Notice that the help page tells you to supply an arg(ument) where to store the depth output file. Specify a path to a file that this script will create. Something like this:

./script --outputDepth /path/to/depth_matrix.tab

Remember than you can use bash to point to multiple files with a "glob" or "asterisk". A glob looks like this directory name/* and includes all files included in the directory.

[DO:] Try using the * with ls first. List all sorted bam files you created.

In [3]:
ls ./data/sorted/*.sorted.bam
./data/sorted/L1.sorted.bam  ./data/sorted/P1.sorted.bam
./data/sorted/L2.sorted.bam  ./data/sorted/P2.sorted.bam
./data/sorted/L3.sorted.bam  ./data/sorted/P3.sorted.bam

[DO:] Run the script jgi_summarize_bam_contig_depths to calculate the average depth per contig over all six samples.

In [5]:
jgi_summarize_bam_contig_depths --outputDepth ./data/depth_matrix.tab ./data/sorted/*.sorted.bam
Output depth matrix to ./data/depth_matrix.tab
jgi_summarize_bam_contig_depths 2.15 (Bioconda) 2020-01-04T21:10:40
Output matrix to ./data/depth_matrix.tab
0: Opening bam: ./data/sorted/L1.sorted.bam
1: Opening bam: ./data/sorted/L2.sorted.bam
2: Opening bam: ./data/sorted/L3.sorted.bam
3: Opening bam: ./data/sorted/P1.sorted.bam
1: Opening bam: ./data/sorted/P3.sorted.bam
0: Opening bam: ./data/sorted/P2.sorted.bam
Processing bam files
Thread 1 finished: L2.sorted.bam with 2017636 reads and 1591478 readsWellMapped
Thread 2 finished: L3.sorted.bam with 2019434 reads and 1574839 readsWellMapped
Thread 0 finished: L1.sorted.bam with 2016438 reads and 1580618 readsWellMapped
Thread 3 finished: P1.sorted.bam with 2024982 reads and 1102782 readsWellMapped
Thread 1 finished: P3.sorted.bam with 2023328 reads and 1151952 readsWellMapped
Thread 0 finished: P2.sorted.bam with 2020648 reads and 1194925 readsWellMapped
Creating depth matrix file: ./data/depth_matrix.tab
Closing most bam files
Closing last bam file
Finished

The output of this process is the input for MetaBAT in the next step. After the jgi script finishes, make sure you check that the file contains a table. If so, please remove all BAM files. We don't need these anymore.

[DO:] Check if your depth matrix contains data:

In [6]:
head data/depth_matrix.tab
contigName	contigLen	totalAvgDepth	L1.sorted.bam	L1.sorted.bam-var	L2.sorted.bam	L2.sorted.bam-var	L3.sorted.bam	L3.sorted.bam-var	P1.sorted.bam	P1.sorted.bam-var	P2.sorted.bam	P2.sorted.bam-var	P3.sorted.bam	P3.sorted.bam-var
NODE_1_length_1935275_cov_24.6805_ID_23901540	1.93528e+06	3.81973	0.384547	0.40742	0.559138	0.59541	0.264181	0.276061	1.40386	1.48422	1.13602	1.17447	0.0719824	0.0750967
NODE_2_length_1021023_cov_23.722_ID_28578184	1.02102e+06	3.79263	0.361356	0.370018	0.559212	0.616769	0.260588	0.274248	1.37231	1.4455	1.16247	1.26335	0.0766912	0.0846405
NODE_3_length_872793_cov_26.0458_ID_35712810	872793	4.08657	0.397981	0.416854	0.597986	0.629371	0.276378	0.292554	1.51594	1.60846	1.21	1.22886	0.0882858	0.0974476
NODE_4_length_853167_cov_25.0814_ID_23902252	853167	3.95794	0.384594	0.395542	0.59181	0.659878	0.268649	0.282336	1.45993	1.50359	1.1704	1.21212	0.0825517	0.0909625
NODE_5_length_723368_cov_22.0639_ID_32380821	723368	3.54784	0.576476	0.602873	0.49381	0.535239	0.308742	0.321746	0.65946	0.690835	1.42751	1.54562	0.0818356	0.090467
NODE_6_length_592196_cov_27.8287_ID_32359880	592196	4.46972	1.23031	1.31669	0.910303	0.964396	1.20794	1.25681	0.011945	0.014408	0.355303	0.361317	0.753918	0.804243
NODE_7_length_571573_cov_24.8726_ID_32329659	571573	3.87899	0.88336	0.931592	0.851431	0.898885	0.888109	0.960841	0.450843	0.484171	0.183012	0.183082	0.622238	0.673206
NODE_8_length_486458_cov_22.6157_ID_32183008	486458	3.49845	0.569483	0.601542	0.486603	0.500474	0.316542	0.325059	0.640236	0.666413	1.39795	1.41065	0.087642	0.0974081
NODE_9_length_472787_cov_23.2794_ID_23901976	472787	3.66833	0.343202	0.345515	0.584793	0.636413	0.255867	0.271952	1.33483	1.39462	1.07996	1.14636	0.0696793	0.0723315

[DO:] When you're sure, remove the sorted bams:

In [7]:
rm -rf ./data/sorted # double- and triple-check that your depth matrix is OK before removing the sorted bam files.

Depth matrix visualisation

Now you have your depth_matrix, let's take a moment and reflect upon what this matrix does and how it helps in binning the microbial contigs. For this part, we will visualise the depth_matrix file in excel (or a similar spreadsheet editor) on your computer.

[DO:] Follow these steps:

  1. Download the depth_matrix to your personal computer.
  2. The depth matrix is a big table in which columns are delimited by TABs. Open your data in excel and make sure all data is displayed as columns
  1. [Q:] interpret the table
    1. What do the columns represent?
    2. What do the rows represent?
  1. [A:]
    1. The collumns represent Contig depth per sample, and variation of that depth for a particular contig in a particular sample.
    2. The rows represent all the contigs in the assembly

[DO:]

  1. For clarity, remove all columns except those that display the depth data.
  2. Check if you have one column per sample.
  3. Find the option for conditional formatting, filling the cells with colour depending on their content.
  4. Color all cells in the excel sheet according to a colour gradient with three colors.
  1. [Q:] interpret the table
    1. Can you identify two rows with a similar colour pattern,
    2. what does that mean if these two have a similar colour pattern?
  1. [A:]
    1. Yes, these rows stand out easily.
    2. If two rows(contigs) have a similar colour pattern over the 6 different samples, this means that the contigs are equally abundant in all these samples. They are likely part of the same genome.

Now move on to binning part2!