This practical exercise teaches you to assemble, bin and annotate a metagenome of a plant: Azolla filiculoides. It was developed as part of the Introduction To Bioinformatics for Life-Sciences course at Utrecht University. A metagenome is the collection of genomes of micro-organisms associated with a specific organ or organism. Whether you are a plant biologist or a medical doctor, the techniques you learn in this practical apply to metagenomics on all kinds of organisms or organs. In this particular experiment, we focus on bacterial micro-organisms. Intermediate files, answers and extra information can be found on the practical home page
The fern genus Azolla is known for its endophytes: microorganisms that live inside the plant. These microorganisms live specifically inside a leaf cavity in every upper leaf lobe of the plant. One of these endophytes has been known for a long time: Nostoc azollae. They are best known for fixing N^2 from the air which they then provide to the plant as a nutrient. This symbiosis makes the plant independent from nitrogen (fertiliser) in its growth medium. The image included below illustrates the physiology of these leaf cavities.
Image taken from the Azolla foundation website: https://theazollafoundation.org/azolla/the-azolla-superorganism/
While the presence of N. azollae is well studied, the presence of other microbes in the Azolla leaf pocket is not. Here, we will analyse the metagenome of A. filiculoides plants, in search for these other "neglected" microbes. Hence, the open biological question we are answering today is "which species live inside these leaf cavities and what are the metabolic capacities encoded on their genomes?" Based on these genomes, you may be able to infer their function and think of hypotheses to test in further experiments.
Tip: If you stumble upon a step with a long calculation time, first read ahead to see what steps come next in the subsequent notebooks. Alternatively, you can google some information about the plant you are researching. Notice that this paper specifically describes the data that we will be using today. For a more general introduction into Azolla, perhaps have a look at this short movie (English) or this short movie (Dutch) recorded for Dutch local television.
A typical metagenomics workflow starts in the environment or the wet lab with DNA extraction from some biological sample. This is also true for this workflow. We have extracted DNA from Azolla filiculoides from the environment in two ways. Firstly, the whole Plant (P), secondly from the Leaf cavities (L). Both were collected in three biological replicates. Hence the samples we will work with are coded L1 L2 L3 and P1 P2 P3. The reasoning behind these two sampling types is that true endophytes are expected to be more abundant in leaf samples than in plant samples. So we may later use these different sampling types to infer if a microbial genome that we assembled originated from the leaf cavity, or from somewhere else in or on the plant.
Extracted DNA is processed in a sequencing library; it is prepared in a way that an Illumina sequencing_sequencing) machine can process it. The sequencing machine then provides us with FastQ files, you may have handled these before, perhaps not. We will have a look at FastQ files later in this practical. Note that often, Illumina machines sequence the two outer ends of a single DNA molecule. We call these paired-end sequences. For a quick illustration of how Illumina machines work, have a look at this video.
For publication, authors often are required to also upload the raw data on which their publication is based. For sequencing data, the NCBI Short Read Archive (SRA) and the EBI European Nucleotide Archive (ENA) serve this purpose.
[Q:] Find the paper linked above and find the sequencing data it refers to in the EBI ENA database: https://www.ebi.ac.uk/ena/browser/home
[A]
FastQ files contain the output of the Illumina sequencing machine: tens of millions of short DNA sequences. In this context, short is anything between 50 and 250 basepairs. These original short DNA sequences are further referred to as DNA reads, or just reads. FastQ files contain the unprocessed output of the machine, every read consists of four lines:
The quality string has the exact same length as the DNA read, every character in the quality string is a measure of quality corresponding to some number between 0 and 40. This measure of quality is called a phred score. The sequencing machine assigns this phred score based on "how sure it is" that this specific base is correct.
These are two reads encoded in a FastQ file.
@NS500813:28:H3M3VAFXX:1:11101:19270:1015 1:N:0:CGATGT
GNGGTGAAGAAATCAGCCATTCTAAACCACACTAATTTAGATTGCTCTCCAAGGA
+
A#AAAEEEEEEEEEEEEEA#EEEEEEEEEEEEEEEEEEEAEEEEE+/EEEEEEEEEEE
@NS500813:28:H3M3VAFXX:1:11101:24948:1015 1:N:0:CGATGT
ANATTAATTATACAATCACTAATTTACACTAATTTAGGCAGAGCATTTAGAAAGTGC
+
A#AAAAEEEEE6EEEEEEEEEEAEEEEE+/EEEEEEEAEEA#EEEEEEEEEEEEEEEE
FastQ files often contain some reads of poor quality, we definitively want to filter those out before starting our analyses. There are many ways available, for today's practical, this was done for you already with a tool called trimmomatic.
Before and after trimming, general Quality Control (QC( of a FastQ file is assessed with a tool called FastQC. The FastQC reports are rendered as html pages so we can view them in our web browsers. You can find the FastQC reports for this practical here:
[DO:] open these FastQC reports and check the quality
[Q:] Are there any "red flags"? Do you think these are a problem to continue with the experiment?
[A]
Finally, we chose here to filter the sequencing data for plant DNA since we are only interested in microbial DNA. The sequencing data was rid of any plant DNA by mapping (aligning) the reads to a reference plant genome, that of the host plant Azolla filiculoides. Only the sequencing data that did not map anything was kept for further analysis.
Assembly is the process of combining these many short DNA reads in longer contiguous strings of DNA: contigs. For the assembly process, we have used both sample types and all biological replicates as if they were one big sample. This yields the best assembly results. A metagenome assembly is different from a regular genome assembly in that it has DNA of multiple species in one sequencing file. Metagenome assembly was achieved with SPAdes.
Some more elaborate information of Assembly:
The process of assembly takes the millions of short reads in a fastq file, and processes these to thousands of longer nucleotide sequences. Typically the output is a fasta file with contiguous sequences called contigs (contigs.fasta) or a fasta file with multiple contigs scaffolded based on their relative position and orientation to each other called scaffolds, in a file scaffolds.fasta. Ideally, these scaffolds represent the genome of a single species in the metagenome. This is, however, extremely unlikely when no long-read sequencing method was used. In reality, the assembly process will often produce hundreds or thousands of longer scaffolds where a single species' genome is still represented by multiple sequences in the assembly fasta file. (Meta)Genome assembly is a complex process. If this has your interest I recommend watching this video and frankly the entire series that this video is a part of.
In biology, we typically store DNA sequences in fasta files.
Like a fastQ file, it is a text file with DNA (or protein) sequences.
Also fasta files can contain multiple sequences.
Fasta files are simpler than fastq files, they don't contain quality information.
That leaves just two types of lines: a sequence name (header) and the actual sequence.
The header line is preceded by a >
and the lines after containing the DNA sequence. The DNA sequence can continue until the next >
or until the end of the document. We may have a quick look at a fasta file:
>Sequenceheader
ACTCATGAGACTAGACA
The assembly was already done for you. With this practical, you'll find a fasta file with the scaffolds we assembled.
In case you're interested, the assembly was made with SPAdes using code like this
spades -t 12 \
-m 60 \
--pe1-1 data/reads/E1.R1.fastq.gz \
--pe1-2 data/reads/E1.R1.fastq.gz \
--pe2-1 data/reads/E2.R1.fastq.gz \
--pe2-2 data/reads/E2.R1.fastq.gz \
--pe3-1 data/reads/E3.R1.fastq.gz \
--pe3-2 data/reads/E3.R1.fastq.gz \
--pe4-1 data/reads/P1.R1.fastq.gz \
--pe4-2 data/reads/P1.R1.fastq.gz \
--pe5-1 data/reads/P2.R1.fastq.gz \
--pe5-2 data/reads/P2.R1.fastq.gz \
--pe6-1 data/reads/P3.R1.fastq.gz \
--pe6-2 data/reads/P3.R1.fastq.gz \
-o data/assembly/
From here on, you will take over the workflow. Each major step is listed below and shortly explained. All these steps are executed in separate notebooks. Often the last step has a computing time of a few minutes. During this time you may proceed to the next notebook to prepare the subsequent steps. To keep the practical feasible, we have provided you with a subset of the assembly and a subset of the reads. The complete workflow you will follow is seen on this sketch below. Whenever you are lost in the command-line, error messages and output tables, have a look again at this flowchart. In these bioinformatics workflows, it is essential to keep a feeling for where you are, where you are going and where you come from.
The programmes mentioned in the flowchart are preinstalled on this virtual machine you are currently working on:
bwa: http://bio-bwa.sourceforge.net/
samtools: http://samtools.sourceforge.net/
binningmetabat: https://bitbucket.org/berkeleylab/metabat/overview
checkm: https://github.com/Ecogenomics/CheckM/wiki/Installation#how-to-install-checkm
prokka: http://www.vicbioinformatics.com/software.prokka.shtml
Each of these steps is contained in a separate notebook:
m02
¶Before we will get our hands dirty with the 'real data', we will first do some exercises to learn how to work in a notebook like this then we will continue the metagenomics workflow. To keep all information central, I will continue here with the explanation of the workflow. The exercises will be available in another notebook after you have read the full introduction. (In the Introduction to Bioinformatics course, you have already done this exercise on day 1).
m03
& m04
¶After we have learned about bash and Jupyter notebooks, we will practice our bash skills by having a quick look at the data files and assessing the quality. Secondly, we'll use a bit of simple python code to plot the length distribution and abundances of the scaffolds in the original biological samples.
m05
¶After assembly of the scaffolds, we want to know how abundant each scaffold is in the different metagenomic samples.
To do this, we align the Illumina reads (FastQ files) to the scaffolds (fasta) of the assembly in a step called back mapping with a tool called BWA (Burrows-wheeler aligner).
BWA will provide us then with a table of which reads mapped on the metagenome assembly and their specific coördinates: a .bam
file.
m06
¶These tables in the .bam
files must be sorted by coordinate on the genome assembly.
This we achieve with a tool called samtools
.
Strictly speaking, this is still part of the back mapping procedure, but here it is a separate notebook.
m07
& m08
¶A single metagenome assembly contains scaffolds representing DNA of many species in the original sample, but without any information on which scaffold belongs to which species.
To tackle this issue, we will categorise each individual scaffold in a 'bin' which then ideally comprises one species in the metagenome.
In recent literature, these bins are often called Metagenome Assembled Genomes or MAGs.
Despite the name change, the procedure is still often called binning.
The binning procedure needs the bam files we will make in the 'back mapping' step.
The tool we use for binning is called metabat2
.
Metabat2 uses the sorted bamfiles to subdivide the scaffolds into bins.
This will produce six bins or MAGs in this particular experiment.
The binning process is again subdivided over two notebooks.
In part 1 (m07
), we take bam files, then calculate the depth of coverage for each biological sample on each scaffold in the assembly.
In part 2, we take the original scaffolds in fasta format and the depths calculated in part one to create the bins or MAGs.
At this stage, you have succesfully moved from millions of very short sequences (reads) with very little biological information to thousands of longer scaffolds with more biological information, to dozens of long sequences with a lot of biological information.
m09
¶One bin ideally represents one species.
To check whether this is the case, we can use the tool checkm
.
CheckM assesses the completeness and contamination of bins based on single-copy-marker-genes (collocated sets of genes that are ubiquitous and single-copy within a phylogenetic lineage).
m10
¶Annotation is the process of adding functional information to parts of a DNA sequence, i.e. making a table of genes and locations of these genes. These are often stored in a GFF file, the genomic feature format. Once you have this gff file, and a fasta for your specific bin/microbial species, we will visualise it using KEGG. Again, for more details please google kegg. Here we will use it to load our gff file and then explore the metabolic pathways present.
m11
¶This step is very resource intensive, hence I only show examples here.
m12
¶As a bonus exercise, you can make a simple species phylogeny of the cyanobacteria you retrieved during this practical.