Get the read data from the EBI Sequencing Read Archive and make subsets of these

Make a temporary directory

In [ ]:
mkdir ./data/fullreads 2> /dev/null

Download the files: This may take a while! Check-in your file browser if the files are downloading.

In [ ]:
wget ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/002/ERR2114812/ERR2114812_1.fastq.gz -O ./data/fullreads/P1.R1.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/002/ERR2114812/ERR2114812_2.fastq.gz -O ./data/fullreads/P1.R2.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/001/ERR2114811/ERR2114811_1.fastq.gz -O ./data/fullreads/P2.R1.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/001/ERR2114811/ERR2114811_2.fastq.gz -O ./data/fullreads/P2.R2.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/000/ERR2114810/ERR2114810_1.fastq.gz -O ./data/fullreads/P3.R1.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/000/ERR2114810/ERR2114810_2.fastq.gz -O ./data/fullreads/P3.R2.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/009/ERR2114809/ERR2114809_1.fastq.gz -O ./data/fullreads/E1.R1.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/009/ERR2114809/ERR2114809_2.fastq.gz -O ./data/fullreads/E1.R2.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/008/ERR2114808/ERR2114808_1.fastq.gz -O ./data/fullreads/E2.R1.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/008/ERR2114808/ERR2114808_2.fastq.gz -O ./data/fullreads/E2.R2.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/007/ERR2114807/ERR2114807_1.fastq.gz -O ./data/fullreads/E3.R1.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/ERR211/007/ERR2114807/ERR2114807_2.fastq.gz -O ./data/fullreads/E3.R2.fastq.gz

If you want to calculations steps to be fast (recommended), then execute the cells below to make subsets of 1 million reads per file. (4 million fastq lines)

In [ ]:
mkdir ./data/reads/ 2> /dev/null
for f in ./data/fullreads/*.fastq.gz
do  name=$(echo $f | rev| cut -f 1 -d '/' | rev | sed 's/\.fastq\.gz$//')
    echo "subsetting $name"
    zcat $f 2> /dev/null | head -n 4000000 | gzip -c > ./data/reads/$name.fastq.gz 
done

Check if the files were subsetted ok, then remove the ./data/fullreads folder

In [ ]:
ls -sh ./data/reads
In [ ]:
rm -rf ./data/fullreads

If you don't want to be fast, but you want to work with the 'real size files' ( Go you! ) Then, don't execute the lines above, or at least don't remove anything. Instead, use the fastq files in the ./data/fullreads folder whenever the practical points you to the ./data/reads folder. That's all.