Genome annotation with prokka

Now that we have some extra information about our bins, we can continue to analyse the high-quality bins. The final CheckM results will give you a good overview of the bins with low contamination and high completeness and show the bin's lowest taxonomic rank. Pick a bin that you think is interesting to study further. Alternatively, you may also make a loop to annotate multiple bins.

With this selected bin(s), we are going to do genome annotation. Whole-genome annotation is the process of identifying features of interest in a set of genomic DNA sequences and labelling them with useful information. Prokka is a software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards-compliant output files.

[DO:] make a directory for the prokka output:

In [ ]:
mkdir

[DO:] read the prokka help page Remember to look for the usage line.

In [ ]:
prokka -h

Use the options --centre X --compliant to stop prokkas complaints about ugly contig names.

[DO:] run prokka: This may take a while.

In [ ]:
prokka

Investigating prokka output

The prokka output is very elaborate and can be used to many ends. We will quickly visualise the output for the purpose of this practical. To investigate the prokka output, you can use two webservers that both can place the annotations from prokka in metabolic KEGG pathways. First, we'll inspect the contents of a GFF file.

[DO:] View prokka output: (look for the .gff file)

In [ ]:
grep -v '#' ....yourprokkaoutput.gff | head

Visualisation in IPath3 onpathways.embl.de

For visualisation in pathways.embl.de we need to add the word 'UNIRPOT:' to the list of IDs like so:

Prokka gives UniProt IDs in the gff files first we will collect these IDs and discard all other information. However, we need to make sure that IPath3 understands these are UniprotIDs. The code below helps you to do so.

[DO:] take the uniprot IDs out of the gff file:

In [ ]:
grep -o 'UniProt.*' ....yourprokkaoutput | cut -d';' -f1 | cut -d':' -f2 | sed 's/^/UNIPROT:/g'

[DO:] Go to the the website linked above, and past in your IDs in the search field. You can now browse through the pathways encoded by those genes.

Visualising in KEGG

Visualising in KEGG ( the Kyoto Encyclopedia of Genes and Genomes) allows us to zoom in further into the metabolism. The best way to approach this, is to take all protein sequences produced by prokka (the .faa file) and do a protein blast to the KEGG database with blastKOala. This takes quite some time and requires an account, so I did this for you already. The files are available in the data/blastKOala directory. Note that the bin numbers may not match up between my example and your own run, this is just for illustration of what's possible with the method.

[DO:] Upload and visualise these simple tables on KEGG reconstruct pathway

In [ ]:
ls data/blastKOala

You may feel overwhelmed by the number of pathways, modules and genes available to you. For this specific case, we are interested in Nitrogen metabolism. You may have a look at the Dijkhuizen et al. 2018 paper on Azolla endophytes. Find of that paper, this shows the nitrogen metabolism of multiple microbes. Next, some hypotheses are derived and tested in the wet lab. Does your plot of the nitrogen metabolism overlap with the one published? Or did you maybe discover a new endophyte!