# A sourmash tutorial This tutorial should run without modification on Ubuntu 15.10 ("wily"); it was developed on AWS ami-05384865 (us-west region). You'll need about 15 GB of free disk space to download the database, and about 1-2 GB of RAM. The tutorial should take about 20 minutes total to run. ## Installing sourmash To install sourmash, run: ``` sudo apt-get -y update && \ sudo apt-get install -y python3.5-dev python3.5-venv make \ libc6-dev g++ zlib1g-dev ``` this installs Python 3.5. Now, create a local software install and populate it with Jupyter and other dependencies: ``` python3.5 -m venv ~/py3 . ~/py3/bin/activate pip install -U pip pip install -U Cython pip install -U jupyter jupyter_client ipython pandas matplotlib scipy scikit-learn pip install https://github.com/dib-lab/khmer/archive/master.zip pip install https://github.com/dib-lab/sourmash/archive/spacegraphcats.zip ``` ## Generate a signature for Illumina reads Download some reads and a reference genome: ``` mkdir ~/data cd ~/data wget http://public.ged.msu.edu.s3.amazonaws.com/ecoli_ref-5m-trim.pe.fq.gz wget https://s3.amazonaws.com/public.ged.msu.edu/ecoliMG1655.fa.gz ``` Compute a scaled MinHash signature from our reads: ``` mkdir ~/sourmash cd ~/sourmash sourmash compute --scaled 10000 ~/data/ecoli_ref*pe*.fq.gz -o ecoli-reads.sig -k 31 ``` ## Compare reads to assemblies Use case: how much of the read content is contained in the reference genome? Build a signature for an E. coli genome: ``` sourmash compute --scaled 10000 -k 31 ~/data/ecoliMG1655.fa.gz -o ecoli-genome.sig ``` and now evaluate *containment*, that is, what fraction of the read content is contained in the genome: ``` sourmash search -k 31 ecoli-reads.sig ecoli-genome.sig --containment ``` and you should see: ``` # running sourmash subcommand: search loaded query: /home/ubuntu/data/ecoli_ref-5m... (k=31, DNA) loading db of signatures from 1 files loaded 1 signatures total. 1 matches: /home/ubuntu/data/ecoliMG1655.fa.gz 0.466 ecoli-genome.sig ``` Try the reverse - why is it bigger? ``` sourmash search -k 31 ecoli-genome.sig ecoli-reads.sig --containment ``` ## Make and search a database quickly. Suppose that we have a collection of signatures (made with `sourmash compute` as above) and we want to search it with our newly assembled genome (or the reads, even!). How would we do that? Let's grab a sample collection of 50 E. coli genomes and unpack it -- ``` mkdir ecoli_many_sigs cd ecoli_many_sigs curl -O -L https://github.com/dib-lab/sourmash/raw/update/doc_sbts/data/eschericia-sigs.tar.gz tar xzf eschericia-sigs.tar.gz rm eschericia-sigs.tar.gz cd ../ ``` This will produce 50 files named `ecoli-N.sig` in the `ecoli_many_sigs` -- ``` ls ecoli_many_sigs ``` Let's turn this into an easily-searchable database with `sourmash sbt_index` -- ``` sourmash sbt_index -k 31 ecolidb ecoli_many_sigs/*.sig ``` and now we can search! ``` sourmash sbt_search ecolidb.sbt.json ecoli-genome.sig | head ``` You should see output like this: ``` # running sourmash subcommand: sbt_search loaded query: /home/ubuntu/data/ecoliMG1655.... (k=31, DNA) similarity match ---------- ----- 88.4% NZ_GG774190.1 Escherichia coli MS 196-1 Scfld2538, whole genome shotgun sequence 87.8% NZ_JMGW01000001.1 Escherichia coli 1-176-05_S4_C2 e117605S4C2.contig.0_1, whole genome shotgun sequence 86.6% NZ_JMGU01000001.1 Escherichia coli 2-011-08_S3_C2 e201108S3C2.contig.0_1, whole genome shotgun sequence ``` ## Compare many signatures and build a tree. Adjust plotting (this is a bug in sourmash :) -- ``` echo 'backend : Agg' > matplotlibrc ``` Compare all the things: ``` sourmash compare ecoli_many_sigs/* -o ecoli_cmp ``` and then plot: ``` sourmash plot --pdf --labels ecoli_cmp ``` which will produce a file `ecoli_cmp.matrix.pdf` and `ecoli_cmp.dendro.pdf` which you can then download via your file browser and view on your local computer. Here's a PNG version: ![E. coli comparison plot](_static/ecoli_cmp.matrix.png) ## What's in my metagenome? Download and unpack a newer version of the k=31 RefSeq index described in [CTB's blog post](http://ivory.idyll.org/blog/2016-sourmash-sbt-more.html) -- this one contains sketches of all 100k Genbank microbes. (See [available databases](databases.html) for more information.) ``` curl -O https://s3-us-west-1.amazonaws.com/spacegraphcats.ucdavis.edu/microbe-genbank-sbt-k31-2017.05.09.tar.gz tar xzf microbe-genbank-sbt-k31-2017.05.09.tar.gz ``` This produces a file `genbank-k31.sbt.json` and a whole bunch of hidden files in the directory `.sbt.genbank-k31`. Next, run the 'gather' command to see what's in your ecoli genome -- ``` sourmash sbt_gather -k 31 genbank-k31.sbt.json ecoli-genome.sig ``` and you should get: ``` # running sourmash subcommand: sbt_gather loaded query: /home/ubuntu/data/ecoliMG1655.... (k=31, DNA) overlap p_query p_match --------- ------- -------- 4.9 Mbp 100.0% 99.8% CP011320.1 Escherichia coli strain SQ37, found 1 matches total; the recovered matches hit 100.0% of the query ``` In this case, the output is kind of boring because this is a single genome. But! You can use this on metagenomes (assembled and unassembled) as well; you've just got to make the signature files. To see this in action, here is gather running on a signature generated from some sequences that assemble (but don't align to known genomes) from the [Shakya et al. 2013 mock metagenome paper](https://www.ncbi.nlm.nih.gov/pubmed/23387867). ``` wget https://github.com/dib-lab/sourmash/raw/update/doc_sbts/doc/_static/shakya-unaligned-contigs.sig sourmash sbt_gather -k 31 genbank-k31.sbt.json shakya-unaligned-contigs.sig ``` This should yield: ``` # running sourmash subcommand: sbt_gather loaded query: mqc500.QC.AMBIGUOUS.99.unalign... (k=31, DNA) overlap p_query p_match --------- ------- -------- 1.4 Mbp 11.0% 58.0% JANA01000001.1 Fusobacterium sp. OBRC1 c 1.0 Mbp 7.7% 25.9% CP001957.1 Haloferax volcanii DS2 plasmi 0.9 Mbp 7.5% 11.8% BA000019.2 Nostoc sp. PCC 7120 DNA, comp 0.7 Mbp 5.9% 23.0% FOVK01000036.1 Proteiniclasticum ruminis 0.7 Mbp 5.3% 17.6% AE017285.1 Desulfovibrio vulgaris subsp. 0.6 Mbp 4.9% 11.1% CP001252.1 Shewanella baltica OS223, com 0.6 Mbp 4.8% 27.3% AP008226.1 Thermus thermophilus HB8 geno 0.6 Mbp 4.4% 11.2% CP000031.2 Ruegeria pomeroyi DSS-3, comp 480.0 kbp 3.8% 7.6% CP000875.1 Herpetosiphon aurantiacus DSM 410.0 kbp 3.3% 10.5% CH959317.1 Sulfitobacter sp. NAS-14.1 sc 1.4 Mbp 10.9% 11.8% LN831027.1 Fusobacterium nucleatum subsp 0.5 Mbp 4.1% 5.3% CP000753.1 Shewanella baltica OS185, com 420.0 kbp 3.3% 7.7% FNDZ01000023.1 Proteiniclasticum ruminis 150.0 kbp 1.2% 4.5% CP015081.1 Deinococcus radiodurans R1 ch 150.0 kbp 1.2% 8.2% CP000969.1 Thermotoga sp. RQ2, complete 290.0 kbp 2.3% 4.1% CH959311.1 Sulfitobacter sp. EE-36 scf_1 1.2 Mbp 9.4% 5.0% CP013328.1 Fusobacterium nucleatum subsp 110.0 kbp 0.9% 3.5% FREL01000833.1 Enterococcus faecalis iso 0.6 Mbp 5.0% 2.8% CP000527.1 Desulfovibrio vulgaris DP4, c 340.0 kbp 2.7% 3.3% KQ235732.1 Fusobacterium nucleatum subsp 70.0 kbp 0.6% 1.2% CP000850.1 Salinispora arenicola CNS-205 60.0 kbp 0.5% 0.7% CP000270.1 Burkholderia xenovorans LB400 50.0 kbp 0.4% 2.6% CP001080.1 Sulfurihydrogenibium sp. YO3A 50.0 kbp 0.4% 3.2% L77117.1 Methanocaldococcus jannaschii D found less than 40.0 kbp in common. => exiting found 24 matches total; the recovered matches hit 73.4% of the query ``` It is straightforward to build your own databases for use with `sbt_search` and `sbt_gather`; ping us if you want us to write that up. [Return to index](index.html)