sourmash: working with private collections of signatures

Running this notebook.

You can run this notebook interactively via mybinder; click on this button: Binder

A rendered version of this notebook is available at sourmash.readthedocs.io under “Tutorials and notebooks”.

You can also get this notebook from the doc/ subdirectory of the sourmash github repository. See binder/environment.yaml for installation dependencies.

What is this?

This is a Jupyter Notebook using Python 3. If you are running this via binder, you can use Shift-ENTER to run cells, and double click on code cells to edit them.

Contact: C. Titus Brown, ctbrown@ucdavis.edu. Please file issues on GitHub if you have any questions or comments!

download a bunch of genomes

[1]:
!mkdir -p big_genomes
!curl -L https://osf.io/8uxj9/?action=download | (cd big_genomes && tar xzf -)
/Users/t/dev/sourmash/doc/big_genomes
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   459  100   459    0     0    750      0 --:--:-- --:--:-- --:--:--   750
100 61.1M  100 61.1M    0     0  2966k      0  0:00:21  0:00:21 --:--:-- 3496k

compute signatures for each file

[2]:
!cd big_genomes/ && sourmash compute -k 31 --scaled=1000 --name-from-first *.fa
/Users/t/dev/sourmash/doc/big_genomes
== This is sourmash version 2.0.0a12.dev48+ga92289b. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

setting num_hashes to 0 because --scaled is set
computing signatures for files: 0.fa, 1.fa, 10.fa, 11.fa, 12.fa, 13.fa, 14.fa, 15.fa, 16.fa, 17.fa, 18.fa, 19.fa, 2.fa, 20.fa, 21.fa, 22.fa, 23.fa, 24.fa, 25.fa, 26.fa, 27.fa, 28.fa, 29.fa, 3.fa, 30.fa, 31.fa, 32.fa, 33.fa, 34.fa, 35.fa, 36.fa, 37.fa, 38.fa, 39.fa, 4.fa, 40.fa, 41.fa, 42.fa, 43.fa, 44.fa, 45.fa, 46.fa, 47.fa, 48.fa, 49.fa, 5.fa, 50.fa, 51.fa, 52.fa, 53.fa, 54.fa, 55.fa, 56.fa, 57.fa, 58.fa, 59.fa, 6.fa, 60.fa, 61.fa, 62.fa, 63.fa, 7.fa, 8.fa, 9.fa
Computing signature for ksizes: [31]
Computing only nucleotide (and not protein) signatures.
Computing a total of 1 signature(s).
... reading sequences from 0.fa
calculated 1 signatures for 1 sequences in 0.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 1.fa
calculated 1 signatures for 1 sequences in 1.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 10.fa
calculated 1 signatures for 1 sequences in 10.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 11.fa
calculated 1 signatures for 1 sequences in 11.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 12.fa
calculated 1 signatures for 1 sequences in 12.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 13.fa
calculated 1 signatures for 1 sequences in 13.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 14.fa
calculated 1 signatures for 1 sequences in 14.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 15.fa
calculated 1 signatures for 1 sequences in 15.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 16.fa
calculated 1 signatures for 4 sequences in 16.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 17.fa
calculated 1 signatures for 2 sequences in 17.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 18.fa
calculated 1 signatures for 1 sequences in 18.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 19.fa
calculated 1 signatures for 9 sequences in 19.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 2.fa
calculated 1 signatures for 1 sequences in 2.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 20.fa
calculated 1 signatures for 1 sequences in 20.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 21.fa
calculated 1 signatures for 1 sequences in 21.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 22.fa
calculated 1 signatures for 1 sequences in 22.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 23.fa
calculated 1 signatures for 5 sequences in 23.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 24.fa
calculated 1 signatures for 3 sequences in 24.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 25.fa
calculated 1 signatures for 1 sequences in 25.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 26.fa
calculated 1 signatures for 1 sequences in 26.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 27.fa
calculated 1 signatures for 1 sequences in 27.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 28.fa
calculated 1 signatures for 3 sequences in 28.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 29.fa
calculated 1 signatures for 1 sequences in 29.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 3.fa
calculated 1 signatures for 1 sequences in 3.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 30.fa
calculated 1 signatures for 1 sequences in 30.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 31.fa
calculated 1 signatures for 1 sequences in 31.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 32.fa
calculated 1 signatures for 1 sequences in 32.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 33.fa
calculated 1 signatures for 1 sequences in 33.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 34.fa
calculated 1 signatures for 1 sequences in 34.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 35.fa
calculated 1 signatures for 7 sequences in 35.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 36.fa
calculated 1 signatures for 1 sequences in 36.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 37.fa
calculated 1 signatures for 1 sequences in 37.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 38.fa
calculated 1 signatures for 1 sequences in 38.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 39.fa
calculated 1 signatures for 1 sequences in 39.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 4.fa
calculated 1 signatures for 1 sequences in 4.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 40.fa
calculated 1 signatures for 1 sequences in 40.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 41.fa
calculated 1 signatures for 1 sequences in 41.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 42.fa
calculated 1 signatures for 1 sequences in 42.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 43.fa
calculated 1 signatures for 1 sequences in 43.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 44.fa
calculated 1 signatures for 2 sequences in 44.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 45.fa
calculated 1 signatures for 1 sequences in 45.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 46.fa
calculated 1 signatures for 1 sequences in 46.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 47.fa
calculated 1 signatures for 2 sequences in 47.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 48.fa
calculated 1 signatures for 1 sequences in 48.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 49.fa
calculated 1 signatures for 228 sequences in 49.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 5.fa
calculated 1 signatures for 1 sequences in 5.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 50.fa
calculated 1 signatures for 1 sequences in 50.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 51.fa
calculated 1 signatures for 1 sequences in 51.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 52.fa
calculated 1 signatures for 1 sequences in 52.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 53.fa
calculated 1 signatures for 1 sequences in 53.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 54.fa
calculated 1 signatures for 1 sequences in 54.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 55.fa
calculated 1 signatures for 1 sequences in 55.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 56.fa
calculated 1 signatures for 1 sequences in 56.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 57.fa
calculated 1 signatures for 1 sequences in 57.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 58.fa
calculated 1 signatures for 30 sequences in 58.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 59.fa
calculated 1 signatures for 5 sequences in 59.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 6.fa
calculated 1 signatures for 76 sequences in 6.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 60.fa
calculated 1 signatures for 11 sequences in 60.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 61.fa
calculated 1 signatures for 47 sequences in 61.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 62.fa
calculated 1 signatures for 1 sequences in 62.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 63.fa
calculated 1 signatures for 4 sequences in 63.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 7.fa
calculated 1 signatures for 3 sequences in 7.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 8.fa
calculated 1 signatures for 1 sequences in 8.fa
saved 1 signature(s). Note: signature license is CC0.
... reading sequences from 9.fa
calculated 1 signatures for 3 sequences in 9.fa
saved 1 signature(s). Note: signature license is CC0.

Compare them all

[3]:
!sourmash compare big_genomes/*.sig -o compare_all.mat
!sourmash plot compare_all.mat
== This is sourmash version 2.0.0a12.dev48+ga92289b. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loaded 64 signatures total.
downsampling to scaled value of 1000

min similarity in matrix: 0.000
saving labels to: compare_all.mat.labels.txt
saving distance matrix to: compare_all.mat
== This is sourmash version 2.0.0a12.dev48+ga92289b. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading comparison matrix from compare_all.mat...
...got 64 x 64 matrix.
loading labels from compare_all.mat.labels.txt
saving histogram of matrix values => compare_all.mat.hist.png
wrote dendrogram to: compare_all.mat.dendro.png
wrote numpy distance matrix to: compare_all.mat.matrix.png
[4]:
from IPython.display import Image
Image(filename='compare_all.mat.matrix.png')
[4]:
_images/sourmash-collections_7_0.png

make a fast(er) search database for all of them

[5]:
!sourmash index -k 31 all-genomes big_genomes/*.sig
== This is sourmash version 2.0.0a12.dev48+ga92289b. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading 64 files into SBT
reading from big_genomes/9.fa.sig (63 signatures so far))
loaded 64 sigs; saving SBT under "all-genomes"
127 of 127 nodes saved
Finished saving nodes, now saving SBT json file.

You can now use this to search, and gather.

[6]:
!sourmash search shew_os185.fa.sig all-genomes --threshold=0.001
== This is sourmash version 2.0.0a12.dev48+ga92289b. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

selecting default query k=31.
loaded query: NC_009665.1 Shewanella baltica... (k=31, DNA)
loaded 1 databases.

2 matches:
similarity   match
----------   -----
  9.5%       NC_009665.1 Shewanella baltica OS185, complete genome
  4.4%       NC_011663.1 Shewanella baltica OS223, complete genome
[7]:
# (make fake metagenome again, just in case)
!cat genomes/*.fa > fake-metagenome.fa
!sourmash compute -k 31 --scaled=1000 fake-metagenome.fa
== This is sourmash version 2.0.0a12.dev48+ga92289b. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

setting num_hashes to 0 because --scaled is set
computing signatures for files: fake-metagenome.fa
Computing signature for ksizes: [31]
Computing only nucleotide (and not protein) signatures.
Computing a total of 1 signature(s).
skipping fake-metagenome.fa - already done
[8]:
!sourmash gather fake-metagenome.fa.sig all-genomes
== This is sourmash version 2.0.0a12.dev48+ga92289b. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

select query k=31 automatically.
loaded query: fake-metagenome.fa... (k=31, DNA)
loaded 1 databases.


overlap     p_query p_match
---------   ------- -------
0.5 Mbp       42.2%   10.5%    NC_011663.1 Shewanella baltica OS223,...
499.0 kbp     38.4%   18.5%    CP001071.1 Akkermansia muciniphila AT...
0.5 Mbp       19.4%    4.9%    NC_009665.1 Shewanella baltica OS185,...

found 3 matches total;
the recovered matches hit 100.0% of the query

build a database with taxonomic information –

for this, we need to provide a metadata file that contains accession => tax information.

[9]:
import pandas
df = pandas.read_csv('podar-lineage.csv')
df
[9]:
accession taxid superkingdom phylum class order family genus species strain
0 AE000782 224325 Archaea Euryarchaeota Archaeoglobi Archaeoglobales Archaeoglobaceae Archaeoglobus Archaeoglobus fulgidus Archaeoglobus fulgidus DSM 4304
1 NC_000909 243232 Archaea Euryarchaeota Methanococci Methanococcales Methanocaldococcaceae Methanocaldococcus Methanocaldococcus jannaschii Methanocaldococcus jannaschii DSM 2661
2 NC_003272 103690 Bacteria Cyanobacteria NaN Nostocales Nostocaceae Nostoc Nostoc sp. PCC 7120 NaN
3 AE009441 178306 Archaea Crenarchaeota Thermoprotei Thermoproteales Thermoproteaceae Pyrobaculum Pyrobaculum aerophilum Pyrobaculum aerophilum str. IM2
4 AE009950 186497 Archaea Euryarchaeota Thermococci Thermococcales Thermococcaceae Pyrococcus Pyrococcus furiosus Pyrococcus furiosus DSM 3638
5 AE009951 190304 Bacteria Fusobacteria Fusobacteriia Fusobacteriales Fusobacteriaceae Fusobacterium Fusobacterium nucleatum NaN
6 AE010299 188937 Archaea Euryarchaeota Methanomicrobia Methanosarcinales Methanosarcinaceae Methanosarcina Methanosarcina acetivorans Methanosarcina acetivorans C2A
7 AE009439 190192 Archaea Euryarchaeota Methanopyri Methanopyrales Methanopyraceae Methanopyrus Methanopyrus kandleri Methanopyrus kandleri AV19
8 NC_003911 246200 Bacteria Proteobacteria Alphaproteobacteria Rhodobacterales Rhodobacteraceae Ruegeria Ruegeria pomeroyi Ruegeria pomeroyi DSS-3
9 AE006470 194439 Bacteria Chlorobi Chlorobia Chlorobiales Chlorobiaceae Chlorobaculum Chlorobaculum tepidum Chlorobaculum tepidum TLS
10 AE015928 226186 Bacteria Bacteroidetes Bacteroidia Bacteroidales Bacteroidaceae Bacteroides Bacteroides thetaiotaomicron Bacteroides thetaiotaomicron VPI-5482
11 AL954747 228410 Bacteria Proteobacteria Betaproteobacteria Nitrosomonadales Nitrosomonadaceae Nitrosomonas Nitrosomonas europaea Nitrosomonas europaea ATCC 19718
12 BX119912 243090 Bacteria Planctomycetes Planctomycetia Planctomycetales Planctomycetaceae Rhodopirellula Rhodopirellula baltica Rhodopirellula baltica SH 1
13 BX571656 273121 Bacteria Proteobacteria Epsilonproteobacteria Campylobacterales Helicobacteraceae Wolinella Wolinella succinogenes Wolinella succinogenes DSM 1740
14 AE017180 243231 Bacteria Proteobacteria Deltaproteobacteria Desulfuromonadales Geobacteraceae Geobacter Geobacter sulfurreducens Geobacter sulfurreducens PCA
15 AE017226 243275 Bacteria Spirochaetes Spirochaetia Spirochaetales Spirochaetaceae Treponema Treponema denticola Treponema denticola ATCC 35405
16 BX950229 267377 Archaea Euryarchaeota Methanococci Methanococcales Methanococcaceae Methanococcus Methanococcus maripaludis Methanococcus maripaludis S2
17 AE017221 262724 Bacteria Deinococcus-Thermus Deinococci Thermales Thermaceae Thermus Thermus thermophilus Thermus thermophilus HB27
18 BA000001 70601 Archaea Euryarchaeota Thermococci Thermococcales Thermococcaceae Pyrococcus Pyrococcus horikoshii Pyrococcus horikoshii OT3
19 BA000023 273063 Archaea Crenarchaeota Thermoprotei Sulfolobales Sulfolobaceae Sulfolobus Sulfolobus tokodaii Sulfolobus tokodaii str. 7
20 NC_007951 266265 Bacteria Proteobacteria Betaproteobacteria Burkholderiales Burkholderiaceae Paraburkholderia Paraburkholderia xenovorans Paraburkholderia xenovorans LB400
21 CP000492 290317 Bacteria Chlorobi Chlorobia Chlorobiales Chlorobiaceae Chlorobium Chlorobium phaeobacteroides Chlorobium phaeobacteroides DSM 266
22 NC_008751 391774 Bacteria Proteobacteria Deltaproteobacteria Desulfovibrionales Desulfovibrionaceae Desulfovibrio Desulfovibrio vulgaris Desulfovibrio vulgaris DP4
23 CP000568 203119 Bacteria Firmicutes Clostridia Clostridiales Ruminococcaceae Ruminiclostridium Ruminiclostridium thermocellum Ruminiclostridium thermocellum ATCC 27405
24 CP000561 410359 Archaea Crenarchaeota Thermoprotei Thermoproteales Thermoproteaceae Pyrobaculum Pyrobaculum calidifontis Pyrobaculum calidifontis JCM 11548
25 CP000609 402880 Archaea Euryarchaeota Methanococci Methanococcales Methanococcaceae Methanococcus Methanococcus maripaludis Methanococcus maripaludis C5
26 CP000607 290318 Bacteria Chlorobi Chlorobia Chlorobiales Chlorobiaceae Chlorobium Chlorobium phaeovibrioides Chlorobium phaeovibrioides DSM 265
27 CP000660 340102 Archaea Crenarchaeota Thermoprotei Thermoproteales Thermoproteaceae Pyrobaculum Pyrobaculum arsenaticum Pyrobaculum arsenaticum DSM 13514
28 CP000667 369723 Bacteria Actinobacteria Actinobacteria Micromonosporales Micromonosporaceae Salinispora Salinispora tropica Salinispora tropica CNB-440
29 CP000679 351627 Bacteria Firmicutes Clostridia Thermoanaerobacterales Thermoanaerobacterales Family III. Incertae Sedis Caldicellulosiruptor Caldicellulosiruptor saccharolyticus Caldicellulosiruptor saccharolyticus DSM 8903
... ... ... ... ... ... ... ... ... ... ...
34 CP000850 391037 Bacteria Actinobacteria Actinobacteria Micromonosporales Micromonosporaceae Salinispora Salinispora arenicola Salinispora arenicola CNS-205
35 CP000909 324602 Bacteria Chloroflexi Chloroflexia Chloroflexales Chloroflexaceae Chloroflexus Chloroflexus aurantiacus Chloroflexus aurantiacus J-10-fl
36 CP000924 340099 Bacteria Firmicutes Clostridia Thermoanaerobacterales Thermoanaerobacteraceae Thermoanaerobacter Thermoanaerobacter pseudethanolicus Thermoanaerobacter pseudethanolicus ATCC 33223
37 CP000969 126740 Bacteria Thermotogae Thermotogae Thermotogales Thermotogaceae Thermotoga Thermotoga sp. RQ2 NaN
38 CP001013 395495 Bacteria Proteobacteria Betaproteobacteria Burkholderiales NaN Leptothrix Leptothrix cholodnii Leptothrix cholodnii SP-6
39 CP001071 349741 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila Akkermansia muciniphila ATCC BAA-835
40 AP009380 431947 Bacteria Bacteroidetes Bacteroidia Bacteroidales Porphyromonadaceae Porphyromonas Porphyromonas gingivalis Porphyromonas gingivalis ATCC 33277
41 NC_010730 436114 Bacteria Aquificae Aquificae Aquificales Hydrogenothermaceae Sulfurihydrogenibium Sulfurihydrogenibium sp. YO3AOP1 NaN
42 CP001097 290315 Bacteria Chlorobi Chlorobia Chlorobiales Chlorobiaceae Chlorobium Chlorobium limicola Chlorobium limicola DSM 245
43 CP001110 324925 Bacteria Chlorobi Chlorobia Chlorobiales Chlorobiaceae Pelodictyon Pelodictyon phaeoclathratiforme Pelodictyon phaeoclathratiforme BU-1
44 CP001130 380749 Bacteria Aquificae Aquificae Aquificales Aquificaceae Hydrogenobaculum Hydrogenobaculum sp. Y04AAS1 NaN
45 NZ_CH959311 52598 Bacteria Proteobacteria Alphaproteobacteria Rhodobacterales Rhodobacteraceae Sulfitobacter Sulfitobacter sp. EE-36 NaN
46 NZ_CH959317 314267 Bacteria Proteobacteria Alphaproteobacteria Rhodobacterales Rhodobacteraceae Sulfitobacter Sulfitobacter sp. NAS-14.1 NaN
47 CP001251 515635 Bacteria Dictyoglomi Dictyoglomia Dictyoglomales Dictyoglomaceae Dictyoglomus Dictyoglomus turgidum Dictyoglomus turgidum DSM 6724
48 NC_011663 407976 Bacteria Proteobacteria Gammaproteobacteria Alteromonadales Shewanellaceae Shewanella Shewanella baltica Shewanella baltica OS223
49 CP000916 309803 Bacteria Thermotogae Thermotogae Thermotogales Thermotogaceae Thermotoga Thermotoga neapolitana Thermotoga neapolitana DSM 4359
50 NZ_DS996397 411464 Bacteria Proteobacteria Deltaproteobacteria Desulfovibrionales Desulfovibrionaceae Desulfovibrio Desulfovibrio piger Desulfovibrio piger ATCC 29098
51 CP001230 123214 Bacteria Aquificae Aquificae Aquificales Hydrogenothermaceae Persephonella Persephonella marina Persephonella marina EX-H1
52 CP001472 240015 Bacteria Acidobacteria Acidobacteriia Acidobacteriales Acidobacteriaceae Acidobacterium Acidobacterium capsulatum Acidobacterium capsulatum ATCC 51196
53 AP009153 379066 Bacteria Gemmatimonadetes Gemmatimonadetes Gemmatimonadales Gemmatimonadaceae Gemmatimonas Gemmatimonas aurantiaca Gemmatimonas aurantiaca T-27
54 CP001941 439481 Archaea Euryarchaeota NaN NaN NaN Aciduliprofundum Aciduliprofundum boonei Aciduliprofundum boonei T469
55 NC_013968 309800 Archaea Euryarchaeota Halobacteria Haloferacales Haloferacaceae Haloferax Haloferax volcanii Haloferax volcanii DS2
56 NZ_KE136524 226185 Bacteria Firmicutes Bacilli Lactobacillales Enterococcaceae Enterococcus Enterococcus faecalis Enterococcus faecalis V583
57 NZ_KQ961402 542 Bacteria Proteobacteria Alphaproteobacteria Sphingomonadales Sphingomonadaceae Zymomonas Zymomonas mobilis NaN
58 NZ_CP015081 243230 Bacteria Deinococcus-Thermus Deinococci Deinococcales Deinococcaceae Deinococcus Deinococcus radiodurans Deinococcus radiodurans R1
59 NZ_ABZS01000228 432331 Bacteria Aquificae Aquificae Aquificales Hydrogenothermaceae Sulfurihydrogenibium Sulfurihydrogenibium yellowstonense Sulfurihydrogenibium yellowstonense SS-5
60 NZ_JGWU01000001 1458259 Bacteria Proteobacteria Betaproteobacteria Burkholderiales Alcaligenaceae Bordetella Bordetella bronchiseptica Bordetella bronchiseptica D989
61 NZ_FWDH01000003 31899 Bacteria Firmicutes Clostridia Thermoanaerobacterales Thermoanaerobacterales Family III. Incertae Sedis Caldicellulosiruptor Caldicellulosiruptor bescii NaN
62 NC_009972 316274 Bacteria Chloroflexi Chloroflexia Herpetosiphonales Herpetosiphonaceae Herpetosiphon Herpetosiphon aurantiacus Herpetosiphon aurantiacus DSM 785
63 NC_005213 228908 Archaea Nanoarchaeota NaN Nanoarchaeales Nanoarchaeaceae Nanoarchaeum Nanoarchaeum equitans Nanoarchaeum equitans Kin4-M

64 rows × 10 columns

[10]:
!sourmash lca index podar-lineage.csv taxdb big_genomes/*.sig  -C 3 --split-identifiers
== This is sourmash version 2.0.0a12.dev48+ga92289b. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

examining spreadsheet headers...
** assuming column 'accession' is identifiers in spreadsheet
64 distinct identities in spreadsheet out of 64 rows.
64 distinct lineages in spreadsheet out of 64 rows.
64 assigned lineages out of 64 distinct lineages in spreadsheet. 64)
64 identifiers used out of 64 distinct identifiers in spreadsheet.
saving to LCA DB: taxdb.lca.json

This database ‘taxdb.lca.json’ can be used for search and gather as above:

[11]:
!sourmash gather fake-metagenome.fa.sig taxdb.lca.json
== This is sourmash version 2.0.0a12.dev48+ga92289b. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

select query k=31 automatically.
loaded query: fake-metagenome.fa... (k=31, DNA)
loaded 1 databases.


overlap     p_query p_match
---------   ------- -------
0.6 Mbp       46.7%   11.6%    NC_011663.1 Shewanella baltica OS223,...
0.5 Mbp       38.7%   19.3%    CP001071.1 Akkermansia muciniphila AT...
0.5 Mbp       14.6%    3.9%    NC_009665.1 Shewanella baltica OS185,...

found 3 matches total;
the recovered matches hit 100.0% of the query

…but can also be used for taxonomic summarization:

[12]:
!sourmash lca summarize --query fake-metagenome.fa.sig --db taxdb.lca.json
== This is sourmash version 2.0.0a12.dev48+ga92289b. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loaded 1 LCA databases. ksize=31, scaled=10000
finding query signatures...
loaded 1 signatures from 1 files total.of 1)
38.7%    53   Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835
38.7%    53   Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila
38.7%    53   Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia
38.7%    53   Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae
38.7%    53   Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales
38.7%    53   Bacteria;Verrucomicrobia;Verrucomicrobiae
38.7%    53   Bacteria;Verrucomicrobia
100.0%   137   Bacteria
61.3%    84   Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella;Shewanella baltica
61.3%    84   Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella
61.3%    84   Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae
61.3%    84   Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales
61.3%    84   Bacteria;Proteobacteria;Gammaproteobacteria
61.3%    84   Bacteria;Proteobacteria
22.6%    31   Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella;Shewanella baltica;Shewanella baltica OS223
14.6%    20   Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella;Shewanella baltica;Shewanella baltica OS185

An introduction to k-mers for genome comparison and analysis

Some sourmash command line examples!

Working with private collections of signatures.