mash is an awesome piece of software that inspired sourmash - hence the name we chose for sourmash! But are they the same? No, they are not!
mash is based on MinHash sketching, a technique for randomly selecting a very small set of k-mers to represent a potentially very large sequence. MinHash sketches can be used to estimate Jaccard similarity (a distance metric that lets you find closely related sequences) with very high accuracy, under one condition - that the two sequences be of similar size.
So, MinHash sketching is great when comparing bacterial genomes, which are all around 2-10 MB in size.
But… MinHash sketching doesn’t work when comparing metagenomes to genomes, because metagenomes are usually much larger than genomes.
So we built sourmash around a different kind of sketch - FracMinHash - which can estimate Jaccard similarity between sets of very different sizes. FracMinHash sketches also support overlap and containment analysis, and are convenient in a variety of other ways - see our paper on FracMinHash, Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers.
There are some drawbacks to FracMinHash sketches - read on!
There are two drawbacks to FracMinHash relative to MinHash. Neither one is a showstopper, but you should know about them!
One drawback is that FracMinHash sketches can be a lot bigger than MinHash sketches - technically speaking, MinHash sketches are bounded in size (you pick the number of hashes to keep!) while FracMinHash sketches can get arbitrarily large. In practice this means that when you’re sketching large metagenomes, you end up with large sketches. How large depends on the cardinality of the metagenome k-mers - if a metagenome has 1 billion unique k-mers, a FracMinHash sketch with scaled parameter of 1000 will contain a million hashes.
The other drawback is that FracMinHash sketches don’t work well for very small sequences. Our default parameter choice for DNA (scaled=1000) works well for finding 10 kb or larger matches between sequences - some simple Poisson matching math suggests that about 99.98% of 5kb overlaps will be found with scaled=1000.
Please see the k-mers and minhash tutorial.
I would suggest reading these four papers, in order:
Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers, Irber et al., 2022. This is the fullest technical description of FracMinHash available.
Mash: fast genome and metagenome distance estimation using MinHash, Ondov et al., 2016. This is the original paper that inspired sourmash. It discusses sketching with MinHash and does a great job of showing how well Jaccard estimation works for comparing genomes! A good contrasting point to take into account is that MinHash cannot do overlap or containment estimation, which nicely motivates the previous paper and the next two.
Mash Screen: high-throughput sequence containment estimation for genome discovery, Ondov et al., 2019; and CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices. Both papers discusses containment and metagenome analysis extensively, and use an approach that can be usefully contrasted with sourmash. There is a nice blog post on mash screen that is worth reading, too!
If you want a nice chaser, please see this section at the end of the blog post above:
It would be great to see additional methods developed to process containment scores, reduce the output redundancy, and report accurate compositional estimates for metagenomes. One easy approach is a “winner take all” model, like sourmash implements.
The short answer is: for DNA, use k=31.
Slightly longer answer: when we look at the k-mer distribution across all of the bacterial genomes in GTDB, we find that 99% (or more) of 31-mers are genome, species, or genus specific.
If you go lower (say, k=21), then you get a few percent of k-mers that match above the genus level - family or above. This is useful for fuzzier matching, e.g. if you have a metagenome with a high fraction of unknown k-mers.
If you go higher (k=51), a higher percentage of k-mers are genome-specific. This can be valuable in situations where you have highly specific reference genomes (e.g. isolates, single-cell genomes, or MAGs) that should match very closely to this metagenome.
For the core sourmash operations - search, gather, and compare - we believe (with evidence!) that (a) the differences between k=21, k=31, and k=51 are negligible; and that (b) k=31 works fine for most day-to-day use of sourmash.
We also provide Genbank and GTDB databases for k=21, k=31, and k=51, so choosing from those k-mer sizes for your own sketches will allow you to directly use those databases.
For some background on k-mer specificity, we recommend this paper: MetaPalette: a k-mer Painting Approach for Metagenomic Taxonomic Profiling and Quantification of Novel Strain Variation, Koslicki & Falush, 2016.
We recommend scaled=1000 or scaled=10000 when working with bacterial and archaeal sketches and DNA. We have quite a bit of experience with this, and even some published benchmarks showing that this works very well. You may need to use lower scaled values with smaller query and target sequences, such as viral genomes or genes, but we do not have systematic advice on this.
That having been said, you can always use a lower scaled value - the only consequence is that memory and compute requirements increase.
Also, sourmash will automatically use the larger of two scaled values when comparing two sketches with different scaled values. So if, for example, you use the precomputed databases, you will always end up using your query sketches at a minimum scaled of 1000, even if you created them with a lower scaled value.
Please also see What resolution should my signatures be?.
--threshold-bp sets the minimum estimated overlap for reporting
a match, in both the
prefetch commands. The default is 50kb, and
this works well for microbial-genome-scale work, where the genomes are often
quite large (one or more megabases).
In case you need more sensitivity, setting
--threshold-bp=0 will return any
match that shares at least one hash. This will also increase potential
false positives, however.
We have found a good intermediate threshold is 3 times the
scaled value, e.g.
--threshold-bp=3000 for a scaled value of 1000. This requires at least three
overlapping hashes before a match is reported. If you are using a lower scaled
value (a higher density sketch) because you are looking for matches between
shorter sequences, then setting threshold-bp to 3 times that scaled value will
take advantage of the increased sensitivity to short matches without introducing
more false positives.
tl;dr very well! But it’s a bit one sided: if k-mers match, reads will map, but not necessarily vice versa. So read mapping rates are almost always higher than k-mer matching rates.
Let’s start by looking at a simple example: suppose you have Illumina
shotgun sequencing of a new isolate, and you want to compare it to a
reference genome for a member of the same species. You calculate
k-mer containment of the reference genome in the isolate shotgun
sequence to be 65%, using e.g.
sourmash search --containment ref.sig.gz isolate.sig.gz. You then map the reads from the isolate
data to the reference genome. What should you expect to see?
What you should see is that 65% or more of the reference genome is covered by at least one read. This is known as the mapping-based “detection” of the reference genome in the read data set, and k-mer detection typically underestimates mapping-based detection.
(If you want to know how many of the reads will map, you need to use
a different number that is output by
sourmash gather - although, for
single genomes with only a few repeats, the percentage of reads that
map should match the k-mer detection number you calculate with
containment. Read the section below on metagenome analysis for more
information on read mapping rates!)
K-mers underestimate read mapping because k-mers rely on exact matches, while read mapping tolerates mismatches. So unless you are looking at entirely error free data with no strain variation, mismatches will sometimes prevent k-mers from matching.
For some math: consider a k-mer of size 21. It will only match to a specific 21 base sequence exactly. If you are matching two isolate genomes that align completely but have a 95% average nucleotide identity, then one out of every 20 bases will be different (on average), and, on average, every 21-mer will be different! (In practice you’d expect about 1/3 of the k-mers to match if the variation is random, since there will be many 21-base windows that don’t have any mismatches.) However, alignment-based approaches like read mapping will happily align across single-base mismatches, and so even at 95% ANI you would expect many reads to align.
In practice, the numbers will differ, but the intuition remains!
Shotgun metagenome sequencing adds several complicating factors when comparing metagenome reads to a reference database. First, each genome will generally be present at a different abundance in the metagenome. Second, metagenomes rarely contain the exact strain genome present in a reference database; often the genomes in the metagenome differ both in terms of SNPs and accessory elements from what’s in the reference database. And third, metagenomes often contain complicated mixtures of strains - even if one strain is dominant.
sourmash gather method is built for analyzing metagenomes against
reference databases, and it does so by finding the shortest list of
reference genomes that “cover” all of the k-mers in the metagenome.
This list is arranged by how many k-mers from the metagenome are covered
by that entry in the output: the first match is the biggest, and the
second match is the second biggest, and so on. We call this a “minimum
metagenome cover” and it is described in the Irber et al., 2022, paper below.
When we construct the minimum metagenome cover, it correlates well with mapping (per Irber et al., 2022), with one caveat: you need to assign reads to the reference genomes in the rank order output by gather. This is needed to properly assign reads that map to multiple genomes to the “best” match - the one that will capture the most reads.
sourmash gather is still a research method but it seems to work
pretty well - in particular, it is both highly sensitive and highly
specific in taxonomic benchmarking. Please ask questions as you have them!
Yes! (And see the FAQ above, How do k-mer analyses compare with read mapping?)
If you’re interested in picking a single best reference genome (from a large database) for read mapping, you can do the following:
sketch all your reference genomes with
sourmash sketch dna, and/or download one of our prepared databases. You can use default parameters for
sketch your read collection with
sourmash sketch dna.
sourmash prefetch reads.sig <reference genome sketch collection(s)> -o matches.csv
f_match_querycolumn and pick the highest value - this is the k-mer detection - and pick the match name from the
use that reference genome as your mapping target.
If you want to map a metagenome to multiple references, consider
sourmash gather and/or
the genome-grist workflow.
(This is also known as “read recruitment.”)
If sourmash reports that a particular strain or genome is present in a metagenome, how do you retrieve the reads using sourmash?
The short answer is: you have to use a different tool. You can do read mapping between the metagenome and the relevant reference genome (which can be automated with the genome-grist workflow; or, if you are interested in retrieving accessory elements, you can try out spacegraphcats.
sourmash hashes k-mers into 64-bit numbers, so the size of what is stored is independent of the k-mer size. The only impact of k-mer size on sourmash behavior is then more on the biology side - how many matches do you gain (or lose) with that k-mer size? And do you have a lot of new k-mers that pop up with a longer k-mer size (e.g. because of included variation)? These questions must be answered by experimentation and may be data-set specific.
sourmash is currently single-threaded, but the branchwater plugin for sourmash provides faster and lower-memory multithreaded implementations of several important sourmash features - sketching, searching, and gather (metagenome decomposition). It does so by implementing higher-level functions in Rust on top of the core Rust library of sourmash. As a result it provides some of the same functionality as sourmash, but 10-100x faster and in 10x lower memory. Note that this code is functional and tested, but does not have all of the features of sourmash. Code and features will be integrated back into sourmash as they mature.