Get the read data from the EBI Sequencing Read Archive and make subsets of these

Make a temporary directory

In [1]:
mkdir ./data/fullreads 2> /dev/null

Download the files: This may take a while! Check-in your file browser if the files are downloading.

In [2]:
wget ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/002/ERR2114812/ERR2114812_1.fastq.gz -O ./data/fullreads/P1.R1.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/002/ERR2114812/ERR2114812_2.fastq.gz -O ./data/fullreads/P1.R2.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/001/ERR2114811/ERR2114811_1.fastq.gz -O ./data/fullreads/P2.R1.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/001/ERR2114811/ERR2114811_2.fastq.gz -O ./data/fullreads/P2.R2.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/000/ERR2114810/ERR2114810_1.fastq.gz -O ./data/fullreads/P3.R1.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/000/ERR2114810/ERR2114810_2.fastq.gz -O ./data/fullreads/P3.R2.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/009/ERR2114809/ERR2114809_1.fastq.gz -O ./data/fullreads/L1.R1.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/009/ERR2114809/ERR2114809_2.fastq.gz -O ./data/fullreads/L1.R2.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/008/ERR2114808/ERR2114808_1.fastq.gz -O ./data/fullreads/L2.R1.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/008/ERR2114808/ERR2114808_2.fastq.gz -O ./data/fullreads/L2.R2.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/007/ERR2114807/ERR2114807_1.fastq.gz -O ./data/fullreads/L3.R1.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/007/ERR2114807/ERR2114807_2.fastq.gz -O ./data/fullreads/L3.R2.fastq.gz
--2022-03-22 11:46:52--  http://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/002/ERR2114812/ERR2114812_1.fastq.gz
Resolving ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)... 193.62.197.74
Connecting to ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)|193.62.197.74|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 963597040 (919M) [application/octet-stream]
Saving to: ‘./data/fullreads/P1.R1.fastq.gz’

./data/fullreads/P1  21%[===>                ] 197,39M  --.-KB/s    in 53s     

2022-03-22 11:47:47 (3,70 MB/s) - Read error at byte 206975832/963597040 (Connection reset by peer). Retrying.

--2022-03-22 11:47:48--  (try: 2)  http://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/002/ERR2114812/ERR2114812_1.fastq.gz
Connecting to ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)|193.62.197.74|:80... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 963597040 (919M), 756621208 (722M) remaining [application/octet-stream]
Saving to: ‘./data/fullreads/P1.R1.fastq.gz’

./data/fullreads/P1 100%[++++===============>] 918,96M  5,71MB/s    in 2m 9s   

2022-03-22 11:49:57 (5,58 MB/s) - ‘./data/fullreads/P1.R1.fastq.gz’ saved [963597040/963597040]

--2022-03-22 11:49:57--  http://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/002/ERR2114812/ERR2114812_2.fastq.gz
Resolving ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)... 193.62.193.138
Connecting to ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)|193.62.193.138|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 970082343 (925M) [application/octet-stream]
Saving to: ‘./data/fullreads/P1.R2.fastq.gz’

./data/fullreads/P1  23%[===>                ] 219,98M   139KB/s    in 2m 6s   

2022-03-22 11:52:04 (1,74 MB/s) - Connection closed at byte 230667436. Retrying.

--2022-03-22 11:52:05--  (try: 2)  http://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/002/ERR2114812/ERR2114812_2.fastq.gz
Connecting to ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)|193.62.193.138|:80... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 970082343 (925M), 739414907 (705M) remaining [application/octet-stream]
Saving to: ‘./data/fullreads/P1.R2.fastq.gz’

./data/fullreads/P1 100%[++++===============>] 925,14M  8,83MB/s    in 86s     

2022-03-22 11:53:31 (8,24 MB/s) - ‘./data/fullreads/P1.R2.fastq.gz’ saved [970082343/970082343]

--2022-03-22 11:53:31--  http://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/001/ERR2114811/ERR2114811_1.fastq.gz
Resolving ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)... 193.62.193.138
Connecting to ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)|193.62.193.138|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1054107626 (1005M) [application/octet-stream]
Saving to: ‘./data/fullreads/P2.R1.fastq.gz’

./data/fullreads/P2 100%[===================>]   1005M  9,41MB/s    in 1m 57s  

2022-03-22 11:55:28 (8,62 MB/s) - ‘./data/fullreads/P2.R1.fastq.gz’ saved [1054107626/1054107626]

--2022-03-22 11:55:28--  http://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/001/ERR2114811/ERR2114811_2.fastq.gz
Resolving ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)... 193.62.197.74
Connecting to ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)|193.62.197.74|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1061632388 (1012M) [application/octet-stream]
Saving to: ‘./data/fullreads/P2.R2.fastq.gz’

./data/fullreads/P2 100%[===================>]   1012M  5,72MB/s    in 3m 6s   

2022-03-22 11:58:35 (5,44 MB/s) - ‘./data/fullreads/P2.R2.fastq.gz’ saved [1061632388/1061632388]

--2022-03-22 11:58:35--  http://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/000/ERR2114810/ERR2114810_1.fastq.gz
Resolving ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)... 193.62.197.74
Connecting to ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)|193.62.197.74|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 800418381 (763M) [application/octet-stream]
Saving to: ‘./data/fullreads/P3.R1.fastq.gz’

./data/fullreads/P3 100%[===================>] 763,34M  5,71MB/s    in 2m 18s  

2022-03-22 12:00:53 (5,54 MB/s) - ‘./data/fullreads/P3.R1.fastq.gz’ saved [800418381/800418381]

--2022-03-22 12:00:53--  http://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/000/ERR2114810/ERR2114810_2.fastq.gz
Resolving ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)... 193.62.193.138
Connecting to ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)|193.62.193.138|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 808762868 (771M) [application/octet-stream]
Saving to: ‘./data/fullreads/P3.R2.fastq.gz’

./data/fullreads/P3 100%[===================>] 771,30M  9,49MB/s    in 98s     

2022-03-22 12:02:32 (7,84 MB/s) - ‘./data/fullreads/P3.R2.fastq.gz’ saved [808762868/808762868]

--2022-03-22 12:02:32--  http://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/009/ERR2114809/ERR2114809_1.fastq.gz
Resolving ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)... 193.62.193.138
Connecting to ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)|193.62.193.138|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1036588497 (989M) [application/octet-stream]
Saving to: ‘./data/fullreads/L1.R1.fastq.gz’

./data/fullreads/L1 100%[===================>] 988,57M  9,48MB/s    in 1m 54s  

2022-03-22 12:04:27 (8,65 MB/s) - ‘./data/fullreads/L1.R1.fastq.gz’ saved [1036588497/1036588497]

--2022-03-22 12:04:27--  http://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/009/ERR2114809/ERR2114809_2.fastq.gz
Resolving ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)... 193.62.197.74
Connecting to ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)|193.62.197.74|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1048892630 (1000M) [application/octet-stream]
Saving to: ‘./data/fullreads/L1.R2.fastq.gz’

./data/fullreads/L1 100%[===================>]   1000M  5,72MB/s    in 3m 1s   

2022-03-22 12:07:29 (5,53 MB/s) - ‘./data/fullreads/L1.R2.fastq.gz’ saved [1048892630/1048892630]

--2022-03-22 12:07:29--  http://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/008/ERR2114808/ERR2114808_1.fastq.gz
Resolving ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)... 193.62.197.74
Connecting to ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)|193.62.197.74|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 882809536 (842M) [application/octet-stream]
Saving to: ‘./data/fullreads/L2.R1.fastq.gz’

./data/fullreads/L2 100%[===================>] 841,91M  5,70MB/s    in 2m 34s  

2022-03-22 12:10:03 (5,47 MB/s) - ‘./data/fullreads/L2.R1.fastq.gz’ saved [882809536/882809536]

--2022-03-22 12:10:03--  http://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/008/ERR2114808/ERR2114808_2.fastq.gz
Resolving ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)... 193.62.197.74
Connecting to ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)|193.62.197.74|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 890205497 (849M) [application/octet-stream]
Saving to: ‘./data/fullreads/L2.R2.fastq.gz’

./data/fullreads/L2 100%[===================>] 848,97M  5,71MB/s    in 2m 39s  

2022-03-22 12:12:43 (5,32 MB/s) - ‘./data/fullreads/L2.R2.fastq.gz’ saved [890205497/890205497]

--2022-03-22 12:12:43--  http://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/007/ERR2114807/ERR2114807_1.fastq.gz
Resolving ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)... 193.62.197.74
Connecting to ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)|193.62.197.74|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 955266242 (911M) [application/octet-stream]
Saving to: ‘./data/fullreads/L3.R1.fastq.gz’

./data/fullreads/L3 100%[===================>] 911,01M  4,05MB/s    in 2m 49s  

2022-03-22 12:15:33 (5,39 MB/s) - ‘./data/fullreads/L3.R1.fastq.gz’ saved [955266242/955266242]

--2022-03-22 12:15:33--  http://ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/007/ERR2114807/ERR2114807_2.fastq.gz
Resolving ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)... 193.62.193.138
Connecting to ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)|193.62.193.138|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 964610851 (920M) [application/octet-stream]
Saving to: ‘./data/fullreads/L3.R2.fastq.gz’

./data/fullreads/L3 100%[===================>] 919,92M  9,41MB/s    in 2m 0s   

2022-03-22 12:17:33 (7,66 MB/s) - ‘./data/fullreads/L3.R2.fastq.gz’ saved [964610851/964610851]

If you want to calculations steps to be fast (recommended), then execute the cells below to make subsets of 1 million reads per file. (4 million fastq lines)

In [3]:
mkdir ./data/reads/ 2> /dev/null
for f in ./data/fullreads/*.fastq.gz
do  name=$(echo $f | rev| cut -f 1 -d '/' | rev | sed 's/\.fastq\.gz$//')
    echo "subsetting $name"
    zcat $f 2> /dev/null | head -n 4000000 | gzip -c > ./data/reads/$name.fastq.gz 
done
subsetting L1.R1
subsetting L1.R2
subsetting L2.R1
subsetting L2.R2
subsetting L3.R1
subsetting L3.R2
subsetting P1.R1
subsetting P1.R2
subsetting P2.R1
subsetting P2.R2
subsetting P3.R1
subsetting P3.R2

Check if the files were subsetted ok, then remove the ./data/fullreads folder

In [4]:
ls -sh ./data/reads
total 1,1G
87M L1.R1.fastq.gz  87M L2.R2.fastq.gz  87M P1.R1.fastq.gz  87M P2.R2.fastq.gz
87M L1.R2.fastq.gz  87M L3.R1.fastq.gz  88M P1.R2.fastq.gz  88M P3.R1.fastq.gz
87M L2.R1.fastq.gz  87M L3.R2.fastq.gz  87M P2.R1.fastq.gz  89M P3.R2.fastq.gz
In [5]:
rm -rf ./data/fullreads

If you don't want to be fast, but you want to work with the 'real size files' ( Go you! ) Then, don't execute the lines above, or at least don't remove anything. Instead, use the fastq files in the ./data/fullreads folder whenever the practical points you to the ./data/reads folder. That's all.