How to work in a Jupyter notebook with the Bash language

The webpage you are looking at is called a Jupyter notebook. It is a webpage on which you can write text (like this text) and also code. You can execute the code in the webpage, and the output returns to you within the same notebook! This may sound trivial, but it's "really cool" to put it in non-scientific terms. Code and text are entered in individual cells, a code cell or a text cell. This cell you are reading now is a text cell. Next, let's look at a code cell and execute it. There are two ways of executing a cell. First you select a cell, either with your mouse or with the up and down keys on your keyboard. Then, you execute it by hitting the 'run' button in the toolbar above, or you hit CTRL+RETURN.

Jupyter

First we'll learn how to work with Jupyter notebooks, then we'll move on to learn the basics of Bash.

Working with Cells in Jupyter is quite straightforward. You learn best by doing, so do all of the things listed below:

  • You can select a cell with your mouse or the arrow keys on your keyboard.
  • You can edit a cell by hitting RETURN or by double-clicking it.
  • A new cell is a code cell by default turn in into a markdown (text) cell by hitting 'm'
  • A code cell can be executed by hitting CTRL+RETURN.
  • A markdown cell can be rendered by hitting CTRL+RETURN.
  • Add an additional cell by hitting the '+' button in the toolbar.
  • Add an additional cell by clicking between two cells
  • Add an additional cell by using the keyboard
    • add a cell below by hitting the 'b' key
    • and above by hitting the 'a' key
  • Whenever your notebook turns out to be unresponsive, you may interrupt the underlying programme running the code: the kernel, by
    • hitting the square stop button in the toolbar
    • clicking 'restart' or 'interupt' in the 'kernel menu' in your menu bar.
    • clicking 'close and halt' in the File menu.

Try creating a new cell below this cell. Make at least a text cell and a code cell.

Jupyter kernels

In the metagenomics practical, we will be working mostly in Linux' mother tongue: BASH. However, juPYter notebooks work natively in PYthon. Still, we can work in BASH. Often, this works by itself. Sometimes, you will need to type either %%bash at the beginning of a code cell. Or by preceding every bash command with an exclamation mark. Now, let's get to work and learn some Bash!

bash basics

execute the code cell below:

In [1]:
echo "hello world"
hello world

In the bash language, the first word you type is always the command. So, in this case, that was:

echo

This command 'echoes' whatever you give it in the terminal. After you 'call' the command, you give it an argument in this case that is:

"hello world"

This basic structure of 'command' 'arguments' comes back through the metagenomics practical.

do: now try to change "hello world" to something else in the cell below:

In [2]:
echo "hello world"
hello world

Often, an argument is a path to a file. We have the ls command to see what files we have.

In [3]:
ls
data
docs
environment.yml
LICENSE
m00-prepare_download_and_subset_reads.ipynb
m01-introduction.ipynb
m02-jupyter_and_bash_basics.ipynb
m03-assess_raw_data.ipynb
m04-plot_assembly_length.ipynb
m05-backmapping.ipynb
m06-sorting_bamfiles.ipynb
m07-binning_part1.ipynb
m08-binning-part2.ipynb
m09-QC_checkm.ipynb
m10-annotation.ipynb
m11-bonus_exercise_bin_taxonomy.ipynb
m12-bonus_exercise_phylogeny_of_bins
README.md

We are learning fast. So we now know what a command is, and we know what an argument is. Finally, we also need to know what options are. Options are optional extra information that we pass to the command. Options are often provided in between the command and the argument. They look either like this

ls --size --human-readable

or in shortened versions like this

ls -sh

Note that the above two commands are synonymous. Try out in the cell below:

In [4]:
ls --size --human-readable data/
total 636K
   0 assembly     0 mapped     0 reads  636K workflowsketch.png
In [5]:
ls -sh ./data/
total 636K
   0 assembly     0 mapped     0 reads  636K workflowsketch.png

Commands, options and arguments are separated by spaces. Also note that options can have their own arguments. If this is the case, the manual or help page will specify this. We will get the manual and help pages later.

auto-complete

Auto-complete is one of the best features of the bash language and your greatest friend during this practical. Let's say we want to list (ls) the contents of the data/ folder but are too lazy to type the whole word 'data/'. Then we can type

ls da

and then hit the TAB button on your keyboard. Bash should either automatically complete the path to

ls data/

or if there are multiple options to auto-complete, bash will give you a little menu with these options.

Using autocomplete does not only make your life a lot easier, but it also prevents you from making typos! If bash autocomplete doesn't work, odds are something in your command or argument is wrong. Best to check before you proceed!

Try out auto-complete below

In [6]:
ls data
assembly  mapped  reads  workflowsketch.png

pipes

Bash can hand the output of one program to another. This is called piping. If you pipe the output of multiple programs to each other, you made a 'pipeline'. Pipelines look somewhat like this

command1 | command2 | command 3

One trick with pipes that we will often use is the | head pipe. This pipe shows you only the first ten lines of the output of some command. | head -n 1 changes this number to 1. See for yourself below.

In [7]:
ls -1 data/reads/
L1.R1.fastq.gz
L1.R2.fastq.gz
L2.R1.fastq.gz
L2.R2.fastq.gz
L3.R1.fastq.gz
L3.R2.fastq.gz
P1.R1.fastq.gz
P1.R2.fastq.gz
P2.R1.fastq.gz
P2.R2.fastq.gz
P3.R1.fastq.gz
P3.R2.fastq.gz
In [8]:
ls -1 data/reads/ | head -n 1
L1.R1.fastq.gz

loops

Loops are a useful feature you'll find in most (if not all) programming languages. They are also quite intuitive to use. A loop simply is a series of commands that does the same thing multiple times, but with some small adaptation. Let's make a loop together. First, we need to have two concepts clear

  • variable
  • array

A variable is a specific word that means something else; this something may vary. Hence the name. We can specify a variable like this:

variable1=coffee

To refer to the content of a variable, we use a $ sign. So this looks like so

echo $variable1

Now enter the cell below and try for yourself. You can name a variable anything you want.

In [9]:
variable1=coffee
echo $variable1
coffee

If you are working in a Bash kernel, your variables will be remembered in the entire notebook. If you work with a Python kernel, then your variables are only remembered in one cell. Check what kernel this notebook is running in the top right op this page.

An array is a list of variables; it's that simple. To make an array, we type something like this

samples=(L1 L2 L3)

To refer back to an array, we type this

echo ${samples[@]}

This looks a bit more complicated. The [@] part means: 'all contents in the array'. Hence, if you type echo ${samples[0]}, you will only get the first variable in the array. Again, try for yourself below in a new cell.

In [10]:
samples=(L1 L2 L3)
In [11]:
echo ${samples[@]}
L1 L2 L3
In [12]:
echo ${samples[0]}
L1
In [13]:
echo ${samples[2]}
L3

Now we get to loops. Let's keep it simple, I will define a loop for you, and you see how it works.

In [1]:
break=(coffee tea cookies)
In [2]:
for   i in ${break[@]}
do    echo $i
done
coffee
tea
cookies

Do you get the loop? Make sure you do. You will make your loops in the following parts of the practical.

Have you completed all exercises above? Then move on to this:

Bash basics extra

  • Wildcards*
  • Base filenames.
  • paths
  • manual /help pages

wildcards

wildcards can be used in the command-line. For example: list every folder/file inside the ./data/ folder

In [3]:
ls ./data/*
./data/workflowsketch.png

./data/assembly:
scaffolds.fasta      scaffolds.fasta.bwt  scaffolds.fasta.sa
scaffolds.fasta.amb  scaffolds.fasta.gz
scaffolds.fasta.ann  scaffolds.fasta.pac

./data/mapped:
E1.mapped.bam  E3.mapped.bam  P2.mapped.bam
E2.mapped.bam  P1.mapped.bam  P3.mapped.bam

./data/reads:
L1.R1.fastq.gz  L2.R2.fastq.gz  P1.R1.fastq.gz  P2.R2.fastq.gz
L1.R2.fastq.gz  L3.R1.fastq.gz  P1.R2.fastq.gz  P3.R1.fastq.gz
L2.R1.fastq.gz  L3.R2.fastq.gz  P2.R1.fastq.gz  P3.R2.fastq.gz

or list every file in ./data/reads that ends on .gz

In [4]:
ls ./data/reads/*.gz
./data/reads/L1.R1.fastq.gz  ./data/reads/P1.R1.fastq.gz
./data/reads/L1.R2.fastq.gz  ./data/reads/P1.R2.fastq.gz
./data/reads/L2.R1.fastq.gz  ./data/reads/P2.R1.fastq.gz
./data/reads/L2.R2.fastq.gz  ./data/reads/P2.R2.fastq.gz
./data/reads/L3.R1.fastq.gz  ./data/reads/P3.R1.fastq.gz
./data/reads/L3.R2.fastq.gz  ./data/reads/P3.R2.fastq.gz

Now, list every file in ./data/reads that starts with L and ends in .gz

In [5]:
ls data/reads/L*.gz
data/reads/L1.R1.fastq.gz  data/reads/L2.R1.fastq.gz  data/reads/L3.R1.fastq.gz
data/reads/L1.R2.fastq.gz  data/reads/L2.R2.fastq.gz  data/reads/L3.R2.fastq.gz

base filenames

The base of a file is the part before the extension or extensions; you will need this later in the practical.

paths

As we have seen now, you can specify folders with a /. You can move from folder to folder. If you ever wonder what folder you are in now, you can 'print work directory' or pwd.

In [6]:
pwd
/home/laura/gitprojects/metagenomicspractical

The current folder you are in is denoted as a dot: .
Hence, if you type a path like ./data/reads you tell the computer explicitly to start in the current folder, then move to the data folder, and then move to the reads folder. This dot is not required, data/reads means the exact same. Try in the two cells below:

In [1]:
ls ./data/reads
L1.R1.fastq.gz  L2.R2.fastq.gz  P1.R1.fastq.gz  P2.R2.fastq.gz
L1.R2.fastq.gz  L3.R1.fastq.gz  P1.R2.fastq.gz  P3.R1.fastq.gz
L2.R1.fastq.gz  L3.R2.fastq.gz  P2.R1.fastq.gz  P3.R2.fastq.gz
In [2]:
ls data/reads
L1.R1.fastq.gz  L2.R2.fastq.gz  P1.R1.fastq.gz  P2.R2.fastq.gz
L1.R2.fastq.gz  L3.R1.fastq.gz  P1.R2.fastq.gz  P3.R1.fastq.gz
L2.R1.fastq.gz  L3.R2.fastq.gz  P2.R1.fastq.gz  P3.R2.fastq.gz

If you type ls /, then you ask the computer to list the root of the filesystem, the highest level on the hard drive. Somewhat like C:/ on windows computers.

Whenever you see a manual page or a prewritten command with code like this

somecommand /path/to/file

Then it is implied that you substitute the /path/to/file with a path to a file you want to use or create.

If you get errors like file not found, perhaps check where you are with pwd and ls to see if you accidentally moved somewhere you did not mean to.

change your working directory.

Although you won't need to in this practical, you can change working directories. I designed this practical not to bother you with this, but in the long run it is important to be aware of where you are in a directory structure. You can Change Directory with the command cd. For example, if you want to move into a data folder, you may type cd ./data. If you want to move back up again, you type cd ... Where .. means one folder up, you can also use this with ls like ls ... If you are completely lost, just type cd without any argument; this will take you back to your home directory. If you inadvertently move directories and you need to get back, you now know how to do so.

help and manual pages

Whenever you are asked to use some command or programme, and you don't know exactly how it works, we can ask the computer for help.

  • type the command without any argument or options
  • type the command with option --help
  • type the command with option -h
  • get the manual page man some-command

Not all of these always work for every command, but one or two always do; trial and error.

On these webpages, the man command doesn't work too well. Better to stick to the --help pages.

In [3]:
head --help
Usage: head [OPTION]... [FILE]...
Print the first 10 lines of each FILE to standard output.
With more than one FILE, precede each with a header giving the file name.

With no FILE, or when FILE is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
  -c, --bytes=[-]NUM       print the first NUM bytes of each file;
                             with the leading '-', print all but the last
                             NUM bytes of each file
  -n, --lines=[-]NUM       print the first NUM lines instead of the first 10;
                             with the leading '-', print all but the last
                             NUM lines of each file
  -q, --quiet, --silent    never print headers giving file names
  -v, --verbose            always print headers giving file names
  -z, --zero-terminated    line delimiter is NUL, not newline
      --help     display this help and exit
      --version  output version information and exit

NUM may have a multiplier suffix:
b 512, kB 1000, K 1024, MB 1000*1000, M 1024*1024,
GB 1000*1000*1000, G 1024*1024*1024, and so on for T, P, E, Z, Y.

GNU coreutils online help: <https://www.gnu.org/software/coreutils/>
Full documentation at: <https://www.gnu.org/software/coreutils/head>
or available locally via: info '(coreutils) head invocation'

Quite often, you'll find a 'usage' line at the top of the help page. This tells you how to use the command. In the example of head, it tells you first to type head, then any options, and then any file. Those entries in [square brackets] are optional. Entries without any brackets, or with <arrows> are required.

That's it!

You are now ready to work with Bash in Jupyter notebooks! Congratulations. Whenever you get stuck in the subsequent notebooks that use Bash code, maybe come back here for advice.