Welcome to sourmash!¶
sourmash is a command-line tool and Python library for computing hash sketches from DNA sequences, comparing them to each other, and plotting the results. This allows you to estimate sequence similarity between even very large data sets quickly and accurately.
sourmash can be used to quickly search large databases of genomes for matches to query genomes and metagenomes; see our list of available databases.
sourmash also includes k-mer based taxonomic exploration and classification routines for genome and metagenome analysis. These routines can use the NCBI and GTDB taxonomies but do not depend on them specifically.
The paper Large-scale sequence comparisons with sourmash (Pierce et al., 2019)
gives an overview of how sourmash works and what its major use cases are.
Please also see the
mash software and
paper (Ondov et al., 2016) for
background information on how and why MinHash works.
Questions? Thoughts? Ask us on the sourmash issue tracker!
Want to migrate to sourmash v4? sourmash v4 is now available, and has a number of incompatibilites with v2 and v3. Please see our migration guide!
To use sourmash, you must be comfortable with the UNIX command line; programmers may find the Python library and API useful as well.
If you use sourmash, please cite us!
Brown and Irber (2016), sourmash: a library for MinHash sketching of DNA. Journal of Open Source Software, 1(5), 27, doi:10.21105/joss.00027
sourmash in brief¶
sourmash uses MinHash-style sketching to create “signatures”, compressed representations of DNA/RNA sequence. These signatures can then be stored, searched, explored, and taxonomically annotated.
sourmashprovides command line utilities for creating, comparing, and searching signatures, as well as plotting and clustering signatures by similarity (see the command-line docs).
sourmashcan search very large collections of signatures to find matches to a query.
sourmashcan also identify parts of metagenomes that match known genomes, and can taxonomically classify genomes and metagenomes against databases of known species.
sourmashcan be used to search databases of public sequences (e.g. all of GenBank) and can also be used to create and search databases of private sequencing data.
sourmashsupports saving, loading, and communication of signatures via JSON, a ~human-readable and editable format.
sourmashalso has a simple Python API for interacting with signatures, including support for online updating and querying of signatures (see the API docs).
sourmashrelies on an underlying Rust core for performance.
You can use pip:
$ pip install sourmash
$ conda install -c conda-forge -c bioconda sourmash
Please see the README file in github.com/sourmash-bio/sourmash for more information.
Memory and speed¶
sourmash has relatively small disk and memory requirements compared to many other software programs used for genome search and taxonomic classification.
sourmash search and
sourmash gather can be used to search 100k
genbank microbial genomes (using our prepared databases)
with about 20 GB of disk and in under 1 GB of RAM.
Typically a search for a single genome takes about 30 seconds on a laptop.
sourmash lca can be used to search/classify against all genbank
microbial genomes with about 200 MB of disk space and about 10 GB of
RAM. Typically a metagenome classification takes about 1 minute on a
We support the use of sourmash in pipelines and applications by communicating clearly about bug fixes, feature additions, and feature changes. We use version numbers as follows:
Major releases, like v4.0.0, may break backwards compatibility at the command line as well as top-level Python/Rust APIs.
Minor releases, like v4.1.0, will remain backwards compatible but may introduce significant new features.
Patch releases, like v4.1.1, are for minor bug fixes; full backwards compatibility is retained.
If you are relying on sourmash in a pipeline or application, we
suggest specifying your version requirements at the major release,
e.g. in conda you would specify
See the Versioning docs for more information on what our versioning policy means in detail, and how to migrate between major versions!
sourmash cannot find matches across large evolutionary distances.
sourmash seems to work well to search and compare data sets for nucleotide matches at the species and genus level, but does not have much sensitivity beyond that. (It seems to be particularly good at strain-level analysis.) You should use protein-based analyses to do searches across larger evolutionary distances.
sourmash signatures can be very large.
We use a modification of the MinHash sketch approach that allows us to search the contents of metagenomes and large genomes with no loss of sensitivity, but there is a tradeoff: there is no guaranteed limit to signature size when using ‘scaled’ signatures.
The sourmash logo was designed by Stéfanie Fares Sabbag, with feedback from Clara Barcelos, Taylor Reiter and Luiz Irber.
The logo is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
- Using sourmash from the command line
- sourmash tutorials and notebooks
- Using sourmash: a practical guide
- Classifying signatures:
- Prepared databases
- Types of databases
- Taxonomic Information (for non-LCA databases)
- Downloading and using the databases
- GTDB R08-RS214 - DNA databases
- Genbank genomes from March 2022
- GTDB R07-RS207 - DNA databases
- GTDB R06-RS202 - DNA databases
- Appendix: database use and construction details
- Appendix: Memory and time requirements
- Appendix: legacy databases
- Additional information on sourmash
- Support, Versioning, and Migration
- Developer information