Checking completeness and contamination using CheckM

Now that we have bins made with metabat2 we can check them for contamination and completeness (quality); for this, we will use CheckM. CheckM is a suite of tools for assessing the quality of bacterial genomes assemblies/bins. It estimates genome completeness and contamination by using Single Copy Marker Genes (SCMGs) of a specific phylogenetic lineage.

As you will be able to see in the checkm help pages, checkm has a workflow (lineage_wf) that will run all necessary steps to assess bin quality.

Lineage_wf (lineage-specific workflow) steps:

  • The tree command places genome bins into a reference genome tree.
  • The lineage_set command creates a marker file indicating lineage-specific marker sets suitable for evaluating each individual bin with the most appropriate reference set of markers.
  • This marker file is passed to the analyze command to identify marker genes and estimate the completeness and contamination of each genome bin.
  • Finally, the qa command can be used to produce different tables summarizing the quality of each genome bin.

Unfortunately, the 'tree' part of this workflow is too memory intensive (about 32Gbytes of RAM (!) ). So we will cheat a bit. Instead of the lineage_wf, we will use the taxonomy_wf. The taxonomy_wf does not determine the lineage of a bin, but checks SCMGs for a lineage that you provide in the commandline. Hence, we don't load the full tree to find the most appropriate marker set, but assume all bins are bacteria (reasonable assumption in this case) and don't look any deeper than that.

[DO:] Read the help page of checkm taxonomy_wf:

In [1]:
checkm taxonomy_wf -h
usage: checkm taxonomy_wf [-h] [--ali] [--nt] [-g] [--individual_markers]
                          [--skip_adj_correction]
                          [--skip_pseudogene_correction]
                          [--aai_strain AAI_STRAIN] [-a ALIGNMENT_FILE]
                          [--ignore_thresholds] [-e E_VALUE] [-l LENGTH]
                          [-c COVERAGE_FILE] [-f FILE] [--tab_table]
                          [-x EXTENSION] [-t THREADS] [-q] [--tmpdir TMPDIR]
                          {life,domain,phylum,class,order,family,genus,species}
                          taxon bin_dir output_dir

Runs taxon_set, analyze, qa

positional arguments:
  {life,domain,phylum,class,order,family,genus,species}
                        taxonomic rank
  taxon                 taxon of interest
  bin_dir               directory containing bins (fasta format)
  output_dir            directory to write output files

optional arguments:
  -h, --help            show this help message and exit
  --ali                 generate HMMER alignment file for each bin
  --nt                  generate nucleotide gene sequences for each bin
  -g, --genes           bins contain genes as amino acids instead of nucleotide contigs
  --individual_markers  treat marker as independent (i.e., ignore co-located set structure)
  --skip_adj_correction
                        do not exclude adjacent marker genes when estimating contamination
  --skip_pseudogene_correction
                        skip identification and filtering of pseudogenes
  --aai_strain AAI_STRAIN
                        AAI threshold used to identify strain heterogeneity (default: 0.9)
  -a, --alignment_file ALIGNMENT_FILE
                        produce file showing alignment of multi-copy genes and their AAI identity
  --ignore_thresholds   ignore model-specific score thresholds
  -e, --e_value E_VALUE
                        e-value cut off (default: 1e-10)
  -l, --length LENGTH   percent overlap between target and query (default: 0.7)
  -c, --coverage_file COVERAGE_FILE
                        file containing coverage of each sequence; coverage information added to table type 2 (see coverage command)
  -f, --file FILE       print results to file (default: stdout)
  --tab_table           print tab-separated values table
  -x, --extension EXTENSION
                        extension of bins (other files in directory are ignored) (default: fna)
  -t, --threads THREADS
                        number of threads (default: 1)
  -q, --quiet           suppress console output
  --tmpdir TMPDIR       specify an alternative directory for temporary files

Example: checkm taxonomy_wf domain Bacteria ./bins ./output

The checkm manual may seem somewhat intimidating. However, remember that the options in square brackets are optional [optional argument]. Those without brackets are mandatory.

[DO:] Run the checkm taxonomy_wf on the bins you created:

In [3]:
checkm taxonomy_wf domain Bacteria data/bins/ data/checkm_taxonomy -x fa
[2022-03-22 15:58:15] INFO: CheckM v1.1.3
[2022-03-22 15:58:15] INFO: checkm taxonomy_wf domain Bacteria data/bins/ data/checkm_taxonomy -x fa -t 4
[2022-03-22 15:58:15] INFO: [CheckM - taxon_set] Generate taxonomic-specific marker set.
[2022-03-22 15:58:18] INFO: Marker set for Bacteria contains 104 marker genes arranged in 58 sets.
[2022-03-22 15:58:18] INFO: Marker set inferred from 5449 reference genomes.
[2022-03-22 15:58:18] INFO: Marker set written to: data/checkm_taxonomy/Bacteria.ms
[2022-03-22 15:58:18] INFO: { Current stage: 0:00:03.642 || Total: 0:00:03.642 }
[2022-03-22 15:58:18] INFO: [CheckM - analyze] Identifying marker genes in bins.
[2022-03-22 15:58:19] INFO: Identifying marker genes in 6 bins with 4 threads:
    Finished processing 6 of 6 (100.00%) bins.
[2022-03-22 16:00:30] INFO: Saving HMM info to file.
[2022-03-22 16:00:30] INFO: { Current stage: 0:02:12.072 || Total: 0:02:15.715 }
[2022-03-22 16:00:30] INFO: Parsing HMM hits to marker genes:
    Finished parsing hits for 6 of 6 (100.00%) bins.
[2022-03-22 16:00:31] INFO: Aligning marker genes with multiple hits in a single bin:
    Finished processing 6 of 6 (100.00%) bins.
[2022-03-22 16:00:32] INFO: { Current stage: 0:00:01.228 || Total: 0:02:16.944 }
[2022-03-22 16:00:32] INFO: Calculating genome statistics for 6 bins with 4 threads:
    Finished processing 6 of 6 (100.00%) bins.
[2022-03-22 16:00:32] INFO: { Current stage: 0:00:00.341 || Total: 0:02:17.285 }
[2022-03-22 16:00:32] INFO: [CheckM - qa] Tabulating genome statistics.
[2022-03-22 16:00:32] INFO: Calculating AAI between multi-copy marker genes.
[2022-03-22 16:00:32] INFO: Reading HMM info from file.
[2022-03-22 16:00:32] INFO: Parsing HMM hits to marker genes:
    Finished parsing hits for 6 of 6 (100.00%) bins.
-----------------------------------------------------------------------------------------------------------------------------------------------------
  Bin Id   Marker lineage   # genomes   # markers   # marker sets   0     1    2   3   4   5+   Completeness   Contamination   Strain heterogeneity  
-----------------------------------------------------------------------------------------------------------------------------------------------------
  bin.6       Bacteria         5449        104            58        0    104   0   0   0   0       100.00           0.00               0.00          
  bin.4       Bacteria         5449        104            58        1    103   0   0   0   0       98.28            0.00               0.00          
  bin.2       Bacteria         5449        104            58        2    102   0   0   0   0       96.55            0.00               0.00          
  bin.3       Bacteria         5449        104            58        4    100   0   0   0   0       94.83            0.00               0.00          
  bin.1       Bacteria         5449        104            58        18    85   1   0   0   0       91.38            1.72              100.00         
  bin.5       Bacteria         5449        104            58        94    10   0   0   0   0       15.52            0.00               0.00          
-----------------------------------------------------------------------------------------------------------------------------------------------------
[2022-03-22 16:00:33] INFO: { Current stage: 0:00:01.033 || Total: 0:02:18.319 }
In [7]:
# Extra: create a summary table
checkm qa data/checkm_taxonomy/Bacteria.ms data/checkm_taxonomy -f ./data/checkm_taxonomy/checkm_taxonomy_summary.txt
(metagenomics_practical) [2022-03-23 16:10:09] INFO: CheckM v1.1.3
[2022-03-23 16:10:09] INFO: checkm qa data/checkm_taxonomy/Bacteria.ms data/checkm_taxonomy -f ./data/checkm_taxonomy/checkm_taxonomy_summary.txt
[2022-03-23 16:10:09] INFO: [CheckM - qa] Tabulating genome statistics.
[2022-03-23 16:10:09] INFO: Calculating AAI between multi-copy marker genes.
[2022-03-23 16:10:09] INFO: Reading HMM info from file.
[2022-03-23 16:10:09] INFO: Parsing HMM hits to marker genes:
    Finished parsing hits for 6 of 6 (100.00%) bins.
[2022-03-23 16:10:10] INFO: QA information written to: ./data/checkm_taxonomy/checkm_taxonomy_summary.txt
[2022-03-23 16:10:10] INFO: { Current stage: 0:00:01.325 || Total: 0:00:01.325 }
(metagenomics_practical) 

In [5]:
# Extra: run the lineage_wf requiring 30+GB of RAM
checkm lineage_wf --help
usage: checkm lineage_wf [-h] [-r] [--ali] [--nt] [-g] [-u UNIQUE] [-m MULTI]
                         [--force_domain] [--no_refinement]
                         [--individual_markers] [--skip_adj_correction]
                         [--skip_pseudogene_correction]
                         [--aai_strain AAI_STRAIN] [-a ALIGNMENT_FILE]
                         [--ignore_thresholds] [-e E_VALUE] [-l LENGTH]
                         [-f FILE] [--tab_table] [-x EXTENSION] [-t THREADS]
                         [--pplacer_threads PPLACER_THREADS] [-q]
                         [--tmpdir TMPDIR]
                         bin_dir output_dir

Runs tree, lineage_set, analyze, qa

positional arguments:
  bin_dir               directory containing bins (fasta format)
  output_dir            directory to write output files

optional arguments:
  -h, --help            show this help message and exit
  -r, --reduced_tree    use reduced tree (requires <16GB of memory) for determining lineage of each bin
  --ali                 generate HMMER alignment file for each bin
  --nt                  generate nucleotide gene sequences for each bin
  -g, --genes           bins contain genes as amino acids instead of nucleotide contigs
  -u, --unique UNIQUE   minimum number of unique phylogenetic markers required to use lineage-specific marker set (default: 10)
  -m, --multi MULTI     maximum number of multi-copy phylogenetic markers before defaulting to domain-level marker set (default: 10)
  --force_domain        use domain-level sets for all bins
  --no_refinement       do not perform lineage-specific marker set refinement
  --individual_markers  treat marker as independent (i.e., ignore co-located set structure)
  --skip_adj_correction
                        do not exclude adjacent marker genes when estimating contamination
  --skip_pseudogene_correction
                        skip identification and filtering of pseudogenes
  --aai_strain AAI_STRAIN
                        AAI threshold used to identify strain heterogeneity (default: 0.9)
  -a, --alignment_file ALIGNMENT_FILE
                        produce file showing alignment of multi-copy genes and their AAI identity
  --ignore_thresholds   ignore model-specific score thresholds
  -e, --e_value E_VALUE
                        e-value cut off (default: 1e-10)
  -l, --length LENGTH   percent overlap between target and query (default: 0.7)
  -f, --file FILE       print results to file (default: stdout)
  --tab_table           print tab-separated values table
  -x, --extension EXTENSION
                        extension of bins (other files in directory are ignored) (default: fna)
  -t, --threads THREADS
                        number of threads (default: 1)
  --pplacer_threads PPLACER_THREADS
                        number of threads used by pplacer (memory usage increases linearly with additional threads) (default: 1)
  -q, --quiet           suppress console output
  --tmpdir TMPDIR       specify an alternative directory for temporary files

Example: checkm lineage_wf ./bins ./output
In [6]:
# Extra: run the lineage_wf requiring 30+GB of RAM
checkm lineage_wf ./data/bins ./data/checkm_lineage --pplacer_threads 1 -t 4 -x fa
*******************************************************************************
 [CheckM - tree] Placing bins in reference genome tree.
*******************************************************************************

  Identifying marker genes in 6 bins with 4 threads:
    Finished processing 6 of 6 (100.00%) bins.
  Saving HMM info to file.

  Calculating genome statistics for 6 bins with 4 threads:
    Finished processing 6 of 6 (100.00%) bins.

  Extracting marker genes to align.
  Parsing HMM hits to marker genes:
    Finished parsing hits for 6 of 6 (100.00%) bins.
  Extracting 43 HMMs with 4 threads:
    Finished extracting 43 of 43 (100.00%) HMMs.
  Aligning 43 marker genes with 4 threads:
    Finished aligning 43 of 43 (100.00%) marker genes.

  Reading marker alignment files.
  Concatenating alignments.
  Placing 6 bins into the genome tree with pplacer (be patient).

  { Current stage: 0:05:57.235 || Total: 0:05:57.235 }

*******************************************************************************
 [CheckM - lineage_set] Inferring lineage-specific marker sets.
*******************************************************************************

  Reading HMM info from file.
  Parsing HMM hits to marker genes:
    Finished parsing hits for 6 of 6 (100.00%) bins.

  Determining marker sets for each genome bin.
    Finished processing 6 of 6 (100.00%) bins (current: bin.6).

  Marker set written to: ./data/checkm_lineage/lineage.ms

  { Current stage: 0:00:02.867 || Total: 0:06:00.103 }

*******************************************************************************
 [CheckM - analyze] Identifying marker genes in bins.
*******************************************************************************

  Identifying marker genes in 6 bins with 4 threads:
    Finished processing 6 of 6 (100.00%) bins.
  Saving HMM info to file.

  { Current stage: 0:03:12.334 || Total: 0:09:12.438 }

  Parsing HMM hits to marker genes:
    Finished parsing hits for 6 of 6 (100.00%) bins.
  Aligning marker genes with multiple hits in a single bin:
    Finished processing 6 of 6 (100.00%) bins.

  { Current stage: 0:00:03.479 || Total: 0:09:15.918 }

  Calculating genome statistics for 6 bins with 4 threads:
    Finished processing 6 of 6 (100.00%) bins.

  { Current stage: 0:00:00.367 || Total: 0:09:16.285 }

*******************************************************************************
 [CheckM - qa] Tabulating genome statistics.
*******************************************************************************

  Calculating AAI between multi-copy marker genes.

  Reading HMM info from file.
  Parsing HMM hits to marker genes:
    Finished parsing hits for 6 of 6 (100.00%) bins.

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Bin Id            Marker lineage            # genomes   # markers   # marker sets   0     1    2   3   4   5+   Completeness   Contamination   Strain heterogeneity  
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
  bin.4    c__Alphaproteobacteria (UID3305)      564         348           229        1    346   1   0   0   0       99.56            0.44               0.00          
  bin.6      o__Burkholderiales (UID4000)        193         427           214        3    420   4   0   0   0       99.18            0.29               0.00          
  bin.1      o__Burkholderiales (UID4000)        193         427           214        18   403   6   0   0   0       98.91            1.03              50.00          
  bin.2       p__Cyanobacteria (UID2189)          82         579           450        18   561   0   0   0   0       96.56            0.00               0.00          
  bin.3       p__Cyanobacteria (UID2189)          82         579           450        31   547   1   0   0   0       94.44            0.22               0.00          
  bin.5          k__Bacteria (UID203)            5449        104            58        94    10   0   0   0   0       15.52            0.00               0.00          
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

  { Current stage: 0:00:03.862 || Total: 0:09:20.147 }
In [8]:
# Extra: create a summary table of the linage_wf
checkm qa data/checkm_lineage/lineage.ms data/checkm_lineage -f ./data/checkm_lineage/checkm_lineage_summary.txt
(metagenomics_practical) [2022-03-23 16:10:58] INFO: CheckM v1.1.3
[2022-03-23 16:10:58] INFO: checkm qa data/checkm_lineage/lineage.ms data/checkm_lineage -f ./data/checkm_lineage/checkm_lineage_summary.txt
[2022-03-23 16:10:58] INFO: [CheckM - qa] Tabulating genome statistics.
[2022-03-23 16:10:58] INFO: Calculating AAI between multi-copy marker genes.
[2022-03-23 16:10:58] INFO: Reading HMM info from file.
[2022-03-23 16:10:58] INFO: Parsing HMM hits to marker genes:
    Finished parsing hits for 6 of 6 (100.00%) bins.
[2022-03-23 16:11:01] INFO: QA information written to: ./data/checkm_lineage/checkm_lineage_summary.txt
[2022-03-23 16:11:01] INFO: { Current stage: 0:00:02.934 || Total: 0:00:02.934 }
(metagenomics_practical) 

If CheckM doesn't work propperly, you can see an example output here

[Q:] What can you say about the binning quality

[A:] Six bins have a completeness of more than 94%, this is a great score! Contamination is also low in all of these bins.

One bin does have 50% hetrogeneity, this indicates roughly that although all markers are found, and only found once, these markers are not always found in the 'sets' they are expected to. This bin might be a mix of two very similar strains of the same lineage of bacteria.

A CheckM output of the full lineage_wf is available online here

[Q:] Is the taxonomy of the bins a surprise given the nature of the sample?

[A:] No. We expected one cyanobacterium, and found a second. Cyanobacteria are often found in water, so this may be a cyanobacterium living outside this floating plant.

Burkholderiales and Rhizobiales (alphaproteobacteria) were thought to be associated with Azolla based on the paper linked in the introduction.

Bonus

You can create an extended checkm table with more information.

  1. Read the Checkm manual, and find out how to do this.
  2. Save the table in 'tab-delimited format, so you can download it and open it in Excel/google-sheets/libroffie.
  3. Choose what information you find valuable and discard the rest.
  4. Add the mean depth +/- SEM (Standard Error Mean) of each bin, per sample type.
    • Sample types are L (Leaf) and P (Plant)
  5. Congratulations, you got your first table for a manuscript/thesis about your metagenome analysis!
checkm qa --help

Bonus 2

Did you try to vary binning parameters in the previous notebook? If so, run these through Checkm as well. Remember to create clear and separate output directories.

Are the bins of similar quality?