Welcome to sourmash!¶
sourmash is a command-line tool and Python library for computing hash sketches from DNA sequences, comparing them to each other, and plotting the results. This allows you to estimate sequence similarity between even very large data sets quickly and accurately.
sourmash can be used to quickly search large databases of genomes for matches to query genomes and metagenomes; see our list of available databases.
sourmash also includes k-mer based taxonomic exploration and classification routines for genome and metagenome analysis. These routines can use the NCBI taxonomy but do not depend on it in any way.
The paper Large-scale sequence comparisons with sourmash (Pierce et al., 2019)
gives an overview of how sourmash works and what its major use cases are.
Please also see the
mash software and
paper (Ondov et al., 2016) for
background information on how and why MinHash works.
Questions? Thoughts? Ask us on the sourmash issue tracker!
To use sourmash, you must be comfortable with the UNIX command line; programmers may find the Python library and API useful as well.
If you use sourmash, please cite us!
Brown and Irber (2016), sourmash: a library for MinHash sketching of DNA. Journal of Open Source Software, 1(5), 27, doi:10.21105/joss.00027
sourmash in brief¶
sourmash uses MinHash-style sketching to create “signatures”, compressed representations of DNA/RNA sequence. These signatures can then be stored, searched, explored, and taxonomically annotated.
sourmashprovides command line utilities for creating, comparing, and searching signatures, as well as plotting and clustering signatures by similarity (see the command-line docs).
sourmashcan search very large collections of signatures to find matches to a query.
sourmashcan also identify parts of metagenomes that match known genomes, and can taxonomically classify genomes and metagenomes against databases of known species.
sourmashcan be used to search databases of public sequences (e.g. all of GenBank) and can also be used to create and search databases of private sequencing data.
sourmashsupports saving, loading, and communication of signatures via JSON, a ~human-readable and editable format.
sourmashalso has a simple Python API for interacting with signatures, including support for online updating and querying of signatures (see the API docs).
sourmashrelies on an underlying Rust core for performance.
You can use pip:
$ pip install sourmash
$ conda install -c conda-forge -c bioconda sourmash
Please see the README file in github.com/dib-lab/sourmash for more information.
Memory and speed¶
sourmash has relatively small disk and memory requirements compared to many other software programs used for genome search and taxonomic classification.
sourmash search and
sourmash gather can be used to search all
genbank microbial genomes (using our prepared databases
with about 20 GB of disk and in under 1 GB of RAM.
Typically a search for a single genome takes about 30 seconds on a laptop.
sourmash lca can be used to search/classify against all genbank
microbial genomes with about 200 MB of disk space and about 10 GB of
RAM. Typically a metagenome classification takes about 1 minute on a
sourmash cannot find matches across large evolutionary distances.
sourmash seems to work well to search and compare data sets for matches at the species and genus level, but does not have much sensitivity beyond that. (It seems to be particularly good at strain-level analysis.) You should use protein-based analyses to do searches across larger evolutionary distances.
sourmash signatures can be very large.
We use a modification of the MinHash sketch approach that allows us to search the contents of metagenomes and large genomes with no loss of sensitivity, but there is a tradeoff: there is no guaranteed limit to signature size when using ‘scaled’ signatures.
- Using sourmash from the command line
- sourmash tutorials and notebooks
- Using sourmash: a practical guide
- Classifying signatures:
- Prepared search databases
- Additional information on sourmash
- Developer information