genomewalker
7/8/2016 - 8:46 AM

meta-sourcetracker

meta-sourcetracker

# First you need to download the files
# Using as example ERP002469 in /bioinf/projects/megx/meta_sourcetrack/
# We will use the file ERP002469.txt downloaded from ENA, the script assumes 
# that the file has the following fields and tab delimited (I think is the default). Example:
#PRJEB1786       ERP002469       SAMEA1906452    ERS235496       ERX234720       ERR260132       9606    Homo sapiens    Illumina HiSeq 2000     PAIRED  ftp.sra.ebi.ac.uk/vol1/fastq/ERR260/ERR260132/ERR260132_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/ERR260/ERR260132/ERR260132_2.fastq.gz   ftp.sra.ebi.ac.uk/vol1/fastq/ERR260/ERR260132/ERR260132_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/ERR260/ERR260132/ERR260132_2.fastq.gz  ftp.sra.ebi.ac.uk/vol1/ERA206/ERA206883/fastq/NG-5636_304_1_sequence.fastq.gz;ftp.sra.ebi.ac.uk/vol1/ERA206/ERA206883/fastq/NG-5636_304_2_sequence.fastq.gz     ftp.sra.ebi.ac.uk/vol1/ERA206/ERA206883/fastq/NG-5636_304_1_sequence.fastq.gz;ftp.sra.ebi.ac.uk/vol1/ERA206/ERA206883/fastq/NG-5636_304_2_sequence.fastq.gz

# We want to get the field number 11
tail -n+2 ERP002469.txt | cut -f 11 | tr ';' $'\n' | awk '{print "http://"$0}' | grep _ > links_ena.txt

# The file links_ena.txt contains:
#ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR260/ERR260132/ERR260132_1.fastq.gz
#ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR260/ERR260132/ERR260132_2.fastq.gz

# Then we can use wget, curl, aria to download the files from links_ena.txt

# Once we get the files we will execute ./scripts/meta_sourcetracker/sge_runner.sh 
# from the folder /bioinf/projects/megx/meta_sourcetrack where ERP002469 is the folder 
# we want to crunch

./scripts/meta_sourcetracker/sge_runner.sh ERP002469

# The script will distribute the jobs in the cluster, for now 5 simultaneous jobs due our space restrictions

# Interesting results are in:

# ERR260132_kaiju_report.txt that reports at genus level
#     %       reads       genus
# ---------------------------------
# 15.79     1038842       Bacteroides
#  7.66      504018       Eubacterium
#  4.73      311178       Roseburia

# and ERR260132_kaiju_domain.txt that summarises the classified/unclassified and at the domain level
# Classified:5633122
# UnClassified:944306

# Viruses:4266
# Archaea:5799
# Bacteria:5590819
# Eukaryota:16645