Now that we have bins made with metabat2
we can check them for contamination and completeness (quality); for this, we will use CheckM.
CheckM is a suite of tools for assessing the quality of bacterial genomes assemblies/bins.
It estimates genome completeness and contamination by using Single Copy Marker Genes (SCMGs) of a specific phylogenetic lineage.
As you will be able to see in the checkm help pages, checkm
has a workflow (lineage_wf
) that will run all necessary steps to assess bin quality.
Lineage_wf (lineage-specific workflow) steps:
Unfortunately, the 'tree' part of this workflow is too memory intensive (about 32Gbytes of RAM (!) ).
So we will cheat a bit.
Instead of the lineage_wf
, we will use the taxonomy_wf
.
The taxonomy_wf
does not determine the lineage of a bin, but checks SCMGs for a lineage that you provide in the commandline.
Hence, we don't load the full tree to find the most appropriate marker set, but assume all bins are bacteria (reasonable assumption in this case) and don't look any deeper than that.
[DO:] Read the help page of checkm taxonomy_wf
:
checkm taxonomy_wf -h
The checkm manual may seem somewhat intimidating.
However, remember that the options in square brackets are optional [optional argument]
.
Those without brackets are mandatory.
[DO:] Run the checkm taxonomy_wf
on the bins you created:
checkm taxonomy_wf domain Bacteria data/bins/ data/checkm_taxonomy -x fa
# Extra: create a summary table
checkm qa data/checkm_taxonomy/Bacteria.ms data/checkm_taxonomy -f ./data/checkm_taxonomy/checkm_taxonomy_summary.txt
# Extra: run the lineage_wf requiring 30+GB of RAM
checkm lineage_wf --help
# Extra: run the lineage_wf requiring 30+GB of RAM
checkm lineage_wf ./data/bins ./data/checkm_lineage --pplacer_threads 1 -t 4 -x fa
# Extra: create a summary table of the linage_wf
checkm qa data/checkm_lineage/lineage.ms data/checkm_lineage -f ./data/checkm_lineage/checkm_lineage_summary.txt
If CheckM
doesn't work propperly, you can see an example output here
[Q:] What can you say about the binning quality
[A:] Six bins have a completeness of more than 94%, this is a great score! Contamination is also low in all of these bins.
One bin does have 50% hetrogeneity, this indicates roughly that although all markers are found, and only found once, these markers are not always found in the 'sets' they are expected to. This bin might be a mix of two very similar strains of the same lineage of bacteria.
A CheckM output of the full lineage_wf
is available online here
[Q:] Is the taxonomy of the bins a surprise given the nature of the sample?
[A:] No. We expected one cyanobacterium, and found a second. Cyanobacteria are often found in water, so this may be a cyanobacterium living outside this floating plant.
Burkholderiales and Rhizobiales (alphaproteobacteria) were thought to be associated with Azolla based on the paper linked in the introduction.
You can create an extended checkm table with more information.
checkm qa --help
Did you try to vary binning parameters in the previous notebook? If so, run these through Checkm as well. Remember to create clear and separate output directories.
Are the bins of similar quality?