Checking completeness and contamination using CheckM

Now that we have bins made with metabat2 we can check them for contamination and completeness (quality); for this, we will use CheckM. CheckM is a suite of tools for assessing the quality of bacterial genomes assemblies/bins. It estimates genome completeness and contamination by using Single Copy Marker Genes (SCMGs) of a specific phylogenetic lineage.

As you will be able to see in the checkm help pages, checkm has a workflow (lineage_wf) that will run all necessary steps to assess bin quality.

Lineage_wf (lineage-specific workflow) steps:

  • The tree command places genome bins into a reference genome tree.
  • The lineage_set command creates a marker file indicating lineage-specific marker sets suitable for evaluating each individual bin with the most appropriate reference set of markers.
  • This marker file is passed to the analyze command to identify marker genes and estimate the completeness and contamination of each genome bin.
  • Finally, the qa command can be used to produce different tables summarizing the quality of each genome bin.

Unfortunately, the 'tree' part of this workflow is too memory intensive (about 32Gbytes of RAM (!) ). So we will cheat a bit. Instead of the lineage_wf, we will use the taxonomy_wf. The taxonomy_wf does not determine the lineage of a bin, but checks SCMGs for a lineage that you provide in the commandline. Hence, we don't load the full tree to find the most appropriate marker set, but assume all bins are bacteria (reasonable assumption in this case) and don't look any deeper than that.

In [ ]:
checkm taxonomy_wf -h

The checkm manual may seem somewhat intimidating. However, remember that the options in square brackets are optional [optional argument]. Those without brackets are mandatory.

[DO:] Run the checkm taxonomy_wf on the bins you created:

In [ ]:
checkm taxonomy_wf

If CheckM doesn't work propperly, you can see an example output here

[Q:] What can you say about the binning quality

[A:]

A CheckM output of the full lineage_wf is available online here

[Q:] Is the taxonomy of the bins a surprise given the nature of the sample?

[A:]

Bonus

You can create an extended checkm table with more information.

  1. Read the Checkm manual, and find out how to do this.
  2. Save the table in 'tab-delimited format, so you can download it and open it in Excel/google-sheets/libroffie.
  3. Choose what information you find valuable and discard the rest.
  4. Add the mean depth +/- SEM (Standard Error Mean) of each bin, per sample type.
    • Sample types are L (Leaf) and P (Plant)
  5. Congratulations, you got your first table for a manuscript/thesis about your metagenome analysis!
checkm qa --help

Bonus 2

Did you try to vary binning parameters in the previous notebook? If so, run these through Checkm as well. Remember to create clear and separate output directories.

Are the bins of similar quality?