Welcome to sourmash!

sourmash is a command-line tool and Python library for computing MinHash sketches from DNA sequences, comparing them to each other, and plotting the results. This allows you to estimate sequence similarity between even very large data sets quickly and accurately.

sourmash can also be used to quickly search large databases of genomes for matches to query genomes and metagenomes; see our list of available databases.

Please see the mash software and the mash paper (Ondov et al., 2016) for background information on how and why MinHash sketches work.

To use sourmash, you must be comfortable with the UNIX command line; programmers may find the Python library and API useful as well.

In brief,

  • sourmash provides command line utilities for creating, comparing, and searching MinHash sketches, as well as plotting and clustering sketches by distance (see the command-line docs).
  • sourmash supports saving, loading, and communication of MinHash sketches via JSON, a ~human-readable & editable format.
  • sourmash also has a simple Python API for interacting with sketches, including support for online updating and querying of sketches (see the API docs).
  • sourmash isn’t terribly slow, and relies on an underlying CPython module.
  • sourmash is developed on GitHub and is freely and openly available under the BSD 3-clause license. Please see the README for more information on development, support, and contributing.

You can take a look at sourmash analyses on real data in a saved Jupyter notebook, and experiment with it yourself interactively with a binder at mybinder.org.

Indices and tables