Additional information on sourmash

Other MinHash implementations for DNA

In addition to mash, also see:

  • RKMH: Read Classification by Kmers
  • mashtree: For building trees using Mash
  • Finch: "Fast sketches, count histograms, better filtering."
  • BBMap and SendSketch: part of Brian Bushnell's tool collection.
  • PATRIC uses MinHash for genome search.

If you are interested in exactly how these MinHash approaches calculate the hashes of DNA sequences, please see some simple Python code in sourmash, utils/compute-dna-mh-another-way.py

Presentations and posters

Taxonomic classification of microbial metagenomes using MinHash signatures, Brooks et al., 2017. Presented at ASM.

JSON format for the signature

The JSON format is not necessarily final; this is a TODO item for future releases. In particular, we'd like to update it to store more metadata for samples.

Interoperability with mash

The default sketches computed by sourmash and mash are comparable, but we are still working on ways to convert the file formats

Developing sourmash

Please see:

Known issues

For at least some versions of matplotlib, users may encounter an error "Failed to connect to server socket:" or "RuntimeError: Invalid DISPLAY variable". This is because by default matplotlib tries to connect to X11 to use the Tkinter backend.

The solution is to force the use of the 'Agg' backend in matplotlib; see this stackoverflow answer or this sourmash issue comment.

Newer versions of matplotlib do not seem to have this problem.