Welcome to sourmash!¶
sourmash is a command-line tool and Python library for computing MinHash sketches from DNA sequences, comparing them to each other, and plotting the results. This allows you to estimate sequence similarity between even very large data sets quickly and accurately.
Please see the mash software and the mash paper (Ondov et al., 2016) for background information on how and why MinHash sketches work.
To use sourmash, you must be comfortable with the UNIX command line; programmers may find the Python library and API useful as well.
In brief,
sourmash
provides command line utilities for creating, comparing, and searching MinHash sketches, as well as plotting and clustering sketches by distance (see the command-line docs).sourmash
supports saving, loading, and communication of MinHash sketches via YAML, a ~human-readable & editable format.sourmash
also has a simple Python API for interacting with sketches, including support for online updating and querying of sketches (see the API docs).sourmash
isn’t terribly slow, and relies on an underlying CPython module.sourmash
is developed on GitHub and is freely and openly available under the BSD 3-clause license. Please see the README for more information on development, support, and contributing.
You can take a look at sourmash analyses on real data in a saved Jupyter notebook, and experiment with it yourself interactively with a binder at mybinder.org.