Additional information on sourmash¶
Other MinHash implementations for DNA¶
In addition to mash, also see:
- RKMH: Read Classification by Kmers.
- mashtree for building trees using Mash distances.
- Finch: a Mash implementation in Rust. Quote, “Fast sketches, count histograms, better filtering.”
If you are interested in exactly how these MinHash approaches calculate the hashes of DNA sequences, please see some simple Python code in sourmash, utils/compute-dna-mh-another-way.py.
Blog posts¶
We have a number of blog posts on sourmash and MinHash more generally:
- Applying MinHash to cluster RNAseq samples
- MinHash signatures as ways to find samples, and collaborators?
- Efficiently searching MinHash Sketch collections - indexing and search 42,000 bacterial genomes with Sequence Bloom Trees.
- Quickly searching all the microbial genomes, mark 2 - now with archaea, phage, fungi, and protists! - indexing and searching 50,000 microbial genomes, round 2.
- What metadata should we put in MinHash Sketch signatures? - crowdsourcing ideas for what metadata belongs in a signature file.
- Minhashing all the things (part 1): microbial genomes - on approaches to computing MinHashes for large collections of public data.
JSON format for the signature¶
The JSON format is not necessarily final; this is a TODO item for future releases. In particular, we’d like to update it to store more metadata for samples.
Interoperability with mash¶
The default sketches computed by sourmash and mash are comparable, but we are still working on ways to convert the file formats.