metabat2
¶Now, we continue the binning procedure with metabat2
.
In this notebook, we will create the actual bins!
We will need:
First, remember where the first two items listed above are.
Use ls
to confirm in the cell below.
[DO:] Locate the scaffolds file:
ls data/assembly
[DO:] Locate the depth matrix:
ls data/depth_matrix.tab
[DO:] Make a new directory to store your bins.
Make sure this directory is data/bins
mkdir data/bins
[DO:] Read the help page of metabat2
Find out which options you have to use minimally, then make sure you tell MetaBAT to use one thread only!
Supply your depth matrix as the --abdFile. Short for AbundanceFile.
metabat2 -h
[DO:] Bin your scaffolds with metabat2
:
metabat2 -i ./data/assembly/scaffolds.fasta -o ./data/bins/bin -a ./data/depth_matrix.tab
[DO:] How many bins were created? Check the directory where you stored your bins.
[Q:] Did you get more or less bins than expected from the length/depth plot you made earlier?
[A:]
The depth/length plot we made in python showed three main clusters, hence we get more bins than expected from that plot.
However, the advanced plot with taxonomy added indicated at least 4 abundant bins, and possible 4 or more low abundant bins. In that case we get less bins than we had expected.
See the answers of notebook m04 for more details.
Now we have our metagenome binned! Congratulations. Let's try to visualise this similarly as done in Binning part 1. We will use some command line tricks to get all data in a similar sheet. In this case, these are given to you already. If you feel up to the challenge, try to reverse engineer the code and understand what is happening.
[DO:] Run the code below to make data/binlist.txt
# First, we make an empty file with a header in which we will make our table.
echo -e 'bin\tcontigName' > binlist
# Then we move to the folder in which we made the bins.
cd ./data/bins/
# Now, we start a loop for each file that ends with `.fa`
for f in *.fa
do # For each `.fa` file, we extract the bin number and make a variable that we call `name`
name=$(echo $f | cut -d '.' -f 2)
# Continuing in this iteration of the loop, we filter all fasta headers
grep '>' $f | sed "s/^>/$name\t/g"
done | sort -k2 -V >> ../binlist.txt
# directly after filtering, replace the fasta header sign '>' with the `name` variable defined earlier.
# we end the loop and sort all resulting tables at once on the second column
# after sorting, we append our newly made table to the 'binlist' file we defined earlier.
cd ../../
[DO:] Check what the file looks like:
head data/binlist.txt
[DO:] compare these with the names we have in our depth_matrix.txt. They must be exactly the same to join these two different tables into one.
cut -f 1 data/depth_matrix.tab | head
Now, we use the join
command to join the two tables.
There must be a shared field in both tables.
In the first table, this is the second column -1 2
, and in the second file, this is the first column -2 1
.
We then take both files and give them to the join command.
However, since join
is very picky in how files are sorted, we re-sort them on-the-fly like so <(sort -k2d ./somefile.txt)
(second column, sort as dictionary).
Lastly, since both files have headers, we supply the --header
option and save the result as a new file.
Notice that I use the \
character to spread out this very long join
commandline over several lines.
[DO:] fill in the path to your depth matrix below and run the code
join -1 2 -2 1 \
<(sort -k2d ./data/binlist) \
<(sort -k1d <<...your original depth matrix...>>) \
--header \
| tr ' ' "\t" \
> ./data/binned_depth_matrix.tab
[DO:] Visualise the binned_depth_matrix.tab in excel
binned_depth_matrix.tab
and open it in excel.[Q:] Does this colour pattern make more sense than it did before?
[A:] Yes, when sorting on bin number rows with similar colour patterns are grouped together.
[Q:] Are there any outliers or mistakes you can spot?
[A:] Yes, some rows don't share the colour pattern of the majority of rows in a certain bin. These might be wrongfully clustered in that bin. Perhaps the algorithm clustered these together based on k-mer profiles.
[Q:]Can you determine the depth of the six bins from this table? Why is this not taking the mean of all depths? Think about the difference between depth and coverage. You will need to make a pivot table in excel/LibreOffice/googledrive.
[A:] Making sure to correct for contig length, the depth per bin looks like the table below. Note that bin numbers may not coincide for your particular binning! This is normal behaviour.
bin | Bin Depth |
---|---|
1 | 205.48 |
2 | 5.00 |
3 | 3.53 |
4 | 3.87 |
5 | 4.28 |
6 | 4.62 |
[Q:] Can you determine the depth of each bin per sample(type)? In other words, which bin is abundant in the L samples, and which bin is abundant in the P samples.
[A:] This takes some excel or R magic but it is possible for sure. See the example sheet linked above to see how I appreached this. The final table looks like this:
In this particular order of bins, bin nr1 is clearly the most abundant and enriched inside the leaf (L) samples of the plant.
Bins 2 to 4 are substantially less abundant and seem to be less abundant in the L samples than the P samples. The bacteria corresponding with these genomes are likely living outside the plant. Bin 5 and 6 are also lowly abundant, but they are enriched inside the leaf samples. The bacteria corresponding with these genomes are likely living inside the plant leaves.
[Q:] The research this practical is based on focusses on microbes inside the leaves (L samples). Which bins would you advise me to study further?
[A:] Bin 5 and 6, these are bacteria enriched inside the leaves of the plant.
If you'd like, you can try to vary input signals. For each variation, make sure you save the bins in a separate directory with a clear name.
To modify your depth matrix, have a look at the collumns present:
head -n 3 <<your depth matrix>>
You can select certain columns with the cut command. This example shows you how to select only one replicate of one sample type and save this as a separate depth matrix.
cut -f 1-3,10,11 data/depth_matrix > data/depth_matrix_P1