Binning part 2: create bins with `metabat2`¶

Now, we continue the binning procedure with metabat2. In this notebook, we will create the actual bins! We will need:

The scaffolds of the assembly, to bin.
The depth matrix we made with the jgi script in binning part1.
A new folder, to store the newly created bins.

First, remember where the first two items listed above are. Use ls to confirm in the cell below.

[DO:] Locate the scaffolds file:

ls

[DO:] Locate the depth matrix:

ls

[DO:] Make a new directory to store your bins. Make sure this directory is data/bins

mkdir

[DO:] Read the help page of metabat2

Find out which options you have to use minimally, then make sure you tell MetaBAT to use one thread only!

Supply your depth matrix as the --abdFile. Short for AbundanceFile.

metabat2 -h

[DO:] Bin your scaffolds with metabat2 :

metabat2

[DO:] How many bins were created? Check the directory where you stored your bins.

[Q:] Did you get more or less bins than expected from the length/depth plot you made earlier?

[A:]

visualisation¶

Now we have our metagenome binned! Congratulations. Let's try to visualise this similarly as done in Binning part 1. We will use some command line tricks to get all data in a similar sheet. In this case, these are given to you already. If you feel up to the challenge, try to reverse engineer the code and understand what is happening.

[DO:] Run the code below to make data/binlist.txt

# First, we make an empty file with a header in which we will make our table. 
echo -e 'bin\tcontigName' > binlist
# Then we move to the folder in which we made the bins. 
cd ./data/bins/
# Now, we start a loop for each file that ends with `.fa`
for f in *.fa
do  # For each `.fa` file, we extract the bin number and make a variable that we call `name`
    name=$(echo $f | cut -d '.' -f 2)
    # Continuing in this iteration of the loop, we filter all fasta headers
    grep '>' $f | sed "s/^>/$name\t/g"
done | sort -k2 -V >> ../binlist.txt
# directly after filtering, replace the fasta header sign '>' with the `name` variable defined earlier.
# we end the loop and sort all resulting tables at once on the second column 
# after sorting, we append our newly made table to the 'binlist' file we defined earlier.
cd ../../

[DO:] Check what the file looks like:

head data/binlist.txt

[DO:] compare these with the names we have in our depth_matrix.txt. They must be exactly the same to join these two different tables into one.

cut -f 1  <<your original depth matrix>> | head

Now, we use the join command to join the two tables. There must be a shared field in both tables. In the first table, this is the second column -1 2, and in the second file, this is the first column -2 1.

We then take both files and give them to the join command. However, since join is very picky in how files are sorted, we re-sort them on-the-fly like so <(sort -k2d ./somefile.txt) (second column, sort as dictionary).

Lastly, since both files have headers, we supply the --header option and save the result as a new file.

Notice that I use the \ character to spread out this very long join commandline over several lines.

[DO:] fill in the path to your depth matrix below and run the code

join -1 2 -2 1                                         \
     <(sort -k2d ./data/binlist)                       \
     <(sort -k1d <<...your original depth matrix...>>) \
     --header                                          \
     | tr ' ' "\t"                                     \
     > ./data/binned_depth_matrix.tab

[DO:] Visualise the binned_depth_matrix.tab in excel

Like we did before, download this resulting table binned_depth_matrix.tab and open it in excel.
Sort the file by the bin number.
Erase columns you do not want to visualise
Use conditional formatting to visualise depth profiles over the different samples per bin.

[Q:] Does this colour pattern make more sense than it did before?

[A:]

[Q:] Are there any outliers or mistakes you can spot?

[A:]

Bin depth¶

[Q:]Can you determine the depth of the six bins from this table? Why is this not taking the mean of all depths? Think about the difference between depth and coverage. You will need to make a pivot table in excel/LibreOffice/googledrive.

[A:]

[Q:] Can you determine the depth of each bin per sample(type)? In other words, which bin is abundant in the L samples, and which bin is abundant in the P samples.

[A:]

[Q:] The research this practical is based on focusses on microbes inside the leaves (L samples). Which bins would you advise me to study further?

[A:]

Bonus: vary binning signals¶

If you'd like, you can try to vary input signals. For each variation, make sure you save the bins in a separate directory with a clear name.

First, you can try to run metabat2 without the depth matrix. How many bins do you get then?
You can edit the depth matrix to only contain samples from one type (E or P)
You can edit the depth matrix to only contain one replicate per sample type, or two replicates?

To modify your depth matrix, have a look at the collumns present:

head -n 3 <<your depth matrix>>

You can select certain columns with the cut command. This example shows you how to select only one replicate of one sample type and save this as a separate depth matrix.

cut -f 1-3,10,11 data/depth_matrix > data/depth_matrix_P1

Binning part 2: create bins with metabat2¶

visualisation¶

Bin depth¶

Bonus: vary binning signals¶

Binning part 2: create bins with `metabat2`¶