A guide to the internal design and structure of sourmash

sourmash was created in 2015, and has been repeatedly reorganized, refactored, and optimized to support ever larger databases, faster queries, and new use cases. We’ve also regularly added new functionality and features. So sourmash can be pretty complicated internally, and our user-facing documentation only covers a fraction of its potential!

This document is a brain dump intended for expert users and sourmash developers who want to understand how, why, and when to use various sourmash features. It is unlikely ever to be comprehensive, so the information you are interested in may not yet exist in this document, but we are always happy to add to it - just ask in an issue!

Signatures and sketches

sourmash operates on sketches. Each sketch is a collection of hashes, which are in turn built from k-mers by applying a hash function (currently always murmurhash) and a filtering function. Each sketch is contained in a signature wrapper that contains some metadata.

Internally, sketches (class MinHash) contain the following information:

  • a set of hashes;

  • an optional abundance for each hash (when track_abund is True);

  • a seed;

  • a k-mer size;

  • a molecule type;

  • either a num (for MinHash) or a scaled value (for FracMinHash);

Signature objects (class SourmashSignature) contain a sketch (property .minhash) as well as additional information:

  • an optional name

  • an optional filename

  • a license (currently must be CC0);

  • an md5sum(...) method that returns a hash of the sketch.

For now, we tend to refer to signatures and sketches interchangeably, because they are almost entirely 1:1 in the code base (but see sourmash#616).

The default signature interchange/serialization format is JSON, optionally gzipped. This format is written and read by Rust code.

In general, a lot of effort in sourmash is spent managing collections of signatures before actually doing comparisons with them; see manifests, and Index objects, below.

Making sketches

Sketches are produced by hashing k-mers with murmurhash and then keeping either the lowest num hashes (for MinHashes sketches) or keeping all hashes below 2**64 / scaled (for FracMinHash sketches). This has the effect of selecting approximately one hash for every scaled k-mers - so, when sketching a set of 100,000 distinct k-mers, a scaled value of 1,000 would yield approximately 100 hashes to be retained in the sketch.

The default MinHash sketches use parameters so that they are compatible with mash sketches.

See utils/compute-dna-mh-another-way.py for details on how k-mers are hashed.

Note that if hashes are produced some other way (with a different hash function) or from some source other than DNA, sourmash can still work with them; only sourmash sketch actually cares about DNA sequences, everything else works with hashes.

Compatibility checking

The point of the signatures and sketches is to enable certain kinds of rapid comparisons - Jaccard similarity and number of overlapping k-mers, specifically. However, these comparisons can only be done between compatible sketches.

Here, “compatible” means -

  • the same MurmurHash seed (default 42);

  • the same k-mer size/ksize (see k-mer sizes, below);

  • the same molecule type (see molecule types, below);

  • the same num or scaled (although see this downsampling discussion, and the next two sections);

sourmash uses selectors (Index.select(...)) to select sketches with compatible ksizes, molecule types, and sketch types.

Scaled (FracMinHash) sketches support similarity and containment

Per our discussion in Irber et al., 2022, FracMinHash sketches can always be compared by downsampling to the max of the two scaled values. (This is not always indexed collections of sketches, e.g. SBTs; see sourmash#1799.)

In practice, sourmash does all necessary downsampling dynamically, but returns the original sketches. This means that (for example) you can do a low-resolution/high-scaled search quickly by specifying a high scaled value, and then use a higher resolution comparison with the resulting matches for a more refined and accurate analysis (see below, Speeding up gather and search.)

Num (MinHash) sketches support Jaccard similarity

“Regular” MinHash (or “num MinHash”) sketches are implemented the same way as in mash. However, they are less well supported in sourmash, because they don’t offer the same opportunities for metagenome analysis. (See also sourmash#1354.)

Num MinHash sketches can always be compared by downsampling to a common num value. This may need to be done manually using sourmash sig downsample, however.

Conversion between Scaled (FracMinHash) and Num (MinHash) signatures with downsample

As discussed in the previous sections, it is possible to adjust the scaled and num values to compare two FracMinHash signatures or two Num MinHash signatures. However, it is also possible to covert between the scaled and num signatures with the sourmash sig downsample command. For more details, review the command line docs for sig downsample.

Operations you can do safely with FracMinHash sketches

As described in Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers, FracMinHash sketches support a wide range of operations that mirror actions taken on the underlying data set without revisiting the underlying data. This allows users to build sketches once (requiring the original data) and then do all sorts of manipulations on the sketches, and know that the results of the sketch manipulations represent what would happen if they did the same thing on the original data. For example,

  • set unions, intersections, and subtractions all perform the same when done on the sketches as when applied to the underling data. So, for example, you can sketch two files separately and merge the sketches (with sig merge), and get the same result as if you’d concatenated the files first and then sketched them.

  • if you filter hashes on abundance with sig filter, you get the same result as if you filtered the data set on k-mer abundance and then sketched it.

  • downsampling: you can sketch the original data set at a high resolution (e.g. scaled=100) and then downsample it later (to e.g. scaled=1000), and get the same result as if you’d sketched the data set at scaled=1000.

K-mer sizes

There is no explicit restriction on k-mer sizes built into sourmash.

For highly specific genome and metagenome comparisons, we typically use k=21, k=31, or k=51. For a longer discussion, see Assembly Free Analysis with K-mers from STAMPS 2022 and a more general overview at Using sourmash:a practical guide.

Molecule types - DNA, protein, Dayhoff, and hydrophobic-polar

sourmash supports four different sequence encodings, which we refer to as “molecule”: DNA (--dna), protein (--protein), Dayhoff encoding, (--dayhoff), and hydrophobic-polar (--hp).

All FracMinHash sketches have exactly one molecule type, and can only be compared to the same molecule type (and ksize).

DNA moltype sketches can be constructed from DNA input sequences using sourmash sketch dna.

Protein, Dayhoff, and HP moltype sketches can be constructed from protein input sequences using sourmash sketch protein, or from DNA input sequences using sourmash sketch translate; translate will translate in all six reading frames (see also orpheum from Botvinnik et al., 2021). By default protein sketches will be created; dayhoff sketches can be created by including dayhoff in the param string, e.g. sourmash sketch protein -p dayhoff, and hydrophobic-polar sketches can be built with hp in the param string, e.g. sourmash sketch protein -p hp.

Manifests

Manifests are catalogs of sketches: they include most of the information about a sketch, except for the actual hashes. The idea of manifests is that you can use them to identify which sketches you are interested in before actually working with them (loading them, for example).

sourmash makes extensive use of signature manifests to support rapid selection and lazy loading of signatures based on signature metadata (name, ksize, moltype, etc.) See Blog post: Scaling sourmash to millions of samples for some of the motivation.

Manifests are an internal format that is not meant to be particularly human readable, but the CSV format can be loaded into a spreadsheet program if you’re curious :).

If a collection of signatures is in a zipfile (.zip) or SBT zipfile (.sbt.zip), manifests must be named SOURMASH-MANIFEST.csv. They can also be stored directly on disk in CSV/gzipped CSV, or in a sqlite database; see sourmash sig manifest, sourmash sig check, and sourmash sig collect for manifest creation, management, and export utilities.

Where signatures are stored individually in Index collections, e.g. as separate files in a zipfile, manifests may be stored alongside them; for other subclasses of Index such as the inverted indices, manifests are generated dynamically by the class itself.

Currently (sourmash 4.x) manifests do not contain information about the hash seed or sketch license. This will be fixed in the future - see sourmash#1849.

Manifests are very flexible and, especially when stored in a sqlite database, can be extremely performant for organizing hundreds of thousands to millions of sketches. Please see StandaloneManifestIndex for a lazy-loading Index class that supports such massive-scale organization.

Index implementations

The Index class and its various subclasses (in sourmash.index) are containers that provide an API for organizing, selecting, and searching (potentially) large numbers of signatures.

sourmash sig summarize is a good way to determine what type of Index class is used to handle a collection.

Loading and saving of Index objects is handled separately from the class: loading can be done in Python via the sourmash.load_file_as_index(...) method, while creation and/or updating of Index objects is done via sourmash.sourmash_args.SaveSignaturesToLocation(...). These are the same APIs used by the command-line functionality.

There are quite a few different Index subclasses and they all have distinct features. We have a high-level guide to which collection type to use here.

Conceptually, Index classes are either organized around storing individual signatures often with metadata that permits loading, selecting, and/or searching them more efficiently (e.g. ZipFileLinearIndex and SBTs); or they store signatures as inverted indices (LCA_Database and SqliteIndex) that permit certain kinds of fast queries.

Unless otherwise noted, the Index classes below can be loaded concurrently in “read only” mode - that is, you should build the collection once, and then use it from multiple processes. We currently do not test for or support concurrent read/write. Note also that (generally speaking) memory footprints will be additive, so loading the same LCA_Database twice will consume twice the memory. (If you’re interested in concurrency, we suggest using the sqlite containers - see SqliteIndex.)

Zipfile collections

ZipFileLinearIndex stores signature files in a zip file with an accompanying manifest. This is the most versatile and compressed option for working with large collections of sketches - it supports rapid selection and loading of specific sketches from disk, and can store and search any mixture of sketch types (ksize, molecule type, scaled values, etc.)

By default, ZipFileLinearIndex stores one signature (equiv. one sketch) in each member file in the zip archive. Each signature is stored uncompressed. The accompanying manifest stores the full member file path in internal_location, so that sketches can be retrieved directly.

Searching a ZipFileLinearIndex is done linearly, as implied by the name. This is fine for gather but if you are doing repeated queries with search you may want to use an SBT or LCA database instead; see below.

In the future we expect to parallelize searching ZipFileLinearIndex files in Rust; see sourmash#1752.

ZipFileLinearIndex does support zip files without manifests as well as multiple signatures in a single file; this was originally intended to support simply zipping entire directory hierarchies into a zipfile. However, this slows down performance and is not recommended. If you have an existing zipfile (or really any collection of signatures) and you want to turn them into a proper ZipFileLinearIndex, you can use sig cat <collection(s)> -o combined.zip to create a ZipFileLinearIndex file named combined.zip that will have a manifest and signatures broken out into individual files.

Sequence Bloom Trees (SBTs)

Sequence Bloom Trees (SBTs; see the Kingsford Lab page for details) provide a faster (but more memory intensive) on-disk storage and search mechanism. In brief, SBTs implement a binary tree organization of an arbitrary number of signatures; each internal node is a Bloom filter containing all of the hashes for the nodes below them. This permits potentially rapid elimination of irrelevant nodes on search.

SBTs are restricted to storing and searching sketches with the same/single k-mer size and molecule type, as well as either a single num value or a single scaled value.

We suggest using SBTs when you are doing multiple Jaccard search or containment searches with genomes via sourmash search.

Lowest common ancestor (LCA) databases

The LCA_Database index class stores signatures in an inverted index, where a Python dictionary is used to link individual hashes back to signatures and/or taxonomic lineages. This supports the individualized hash analyses used in the lca submodule.

LCA databases only support a single ksize, moltype, and scaled. They can only be used with FracMinHash (scaled) sketches.

The default LCA_Database class is serialized via JSON, and loads everything into memory when requested. The load time incurs a significant latency penalty when used from the command line, as well as having a potentially large memory footprint; this makes it difficult to use the default LCA_Database for very large databases, e.g. genbank bacteria.

The newer LCA_SqliteDatabase (based on SqliteIndex, described below) also supports LCA-style queries, and is stored on disk, is fast to load, and uses very little memory. The tradeoff is that the underlying sqlite database can be quite large. LCA_SqliteDatabase should also support rapid concurrent access (see sourmash#909).

Both types of LCA database can be constructed with sourmash lca index.

SqliteIndex

The SqliteIndex storage class uses sqlite3 to store hashes and sketch information for search and retrieval; see this blog post for background information and details. These are fast, low-memory, on-disk databases, with the tradeoff that they can be quite large. This is probably currently the best solution for concurrent access to sketches via e.g. a Web server (see also sourmash#909).

SqliteIndex can only contain FracMinHash sketches and can only store sketches with the same scaled parameter. However, it can store multiple ksizes and moltypes as long as the same scaled is used.

SqliteIndex objects can be constructed using sourmash sig cat ... -o filename.sqldb.

Standalone manifests

The StandaloneManifestIndex class loads standalone manifests generated by sourmash sig collect. They support rapid selection and lazy loading on potentially extremely large collections of signatures.

The underlying mechanism uses the internal_location field of manifests to point to the container file. When particular sketches are requested, the container file is loaded into an Index object with sourmash.load_file_as_index and the md5 values of the requested sketches are used as a picklist to retrieve the desired signatures.

Thus, while standalone manifests can point at any kind of container, including JSON files or LCA databases, they are most efficient when internal_location points at a file with either a single sketch in it, or a manifest that supports direct loading of sketches. Therefore, we suggest using standalone manifest indices.

Note that searching a standalone manifest is currently done through a linear iteration, and does not use any features of indexed containers such as SBTs or LCAs. This is fine for gather with the default approach, but is probably suboptimal for a search.

Pathlists and --from-file

All (or most) sourmash commands natively support taking in lists of signature collections via pathlists, --from-file, or paths to directories. This is useful for situations where you have thousands of signature files and don’t want to provide them explicitly on the command line; you can simply put a list of the files in a text file, and pass it in directly (or use --from-file to pass it in).

Both pathlists and files passed to --from-file contain a list of paths to be loaded; relatives paths will be interpreted relative to the current working directory of sourmash. Pathlists should be universally available on sourmash commands. When --from-file is available for a command, sourmash will behave as if the file paths in the file were provided on the command line.

We suggest avoiding pathlists. Instead, we suggest using --from-file or a standalone manifest index (generated with sourmash sig collect). This is because the signatures from pathlists are loaded into memory (see MultiIndex, above) it is generally a bad idea to use them - they may be slow to load and may consume a lot of memory. They also do not support good loading error messages; see sourmash#1414.

Extensions for outputting index classes

Most commands that support saving signatures will save them in a variety of formats, based on the extension provided (see sourmash#1890 for exceptions). The supported extensions are -

  • .zip for ZipFileLinearIndex

  • .sqldb for SqliteIndex

  • .sig or .sig.gz for JSON/gzipped JSON

  • dirname/ to save in a directory hierarchy

The default signature save format is JSON, if the extension is not recognized.

Taxonomy and assigning lineages

All sourmash taxonomy handling is done within the lca and tax subcommands (CLI) and submodules (Python).

In the case of the lca subcommands, the taxonomic information is incorporated into the LCA database construction (see the lca index command), while the tax subcommands load taxonomic information on demand from taxonomy databases (CSVs or databases).

sourmash anchors all taxonomy to identifiers, and uses the signature name to do so - this is the name as set by the --name parameter to sourmash sketch, and output by sourmash sig describe as the signature: field.

Identifier handling

sourmash prefers identifiers to be the first space-separated token in the signature name. This token can contain any alphanumeric letters other than space, and should contain at most one period. The version of the identifier will be the component after the period.

So, for example, for a signature name of

CP001941.1 Aciduliprofundum boonei T469, complete genome

the identifier would be CP001941.1 and the version would be 1. There are no other constraints placed on the identifier, and versions are not handled in any special way other than as below.

The lca index and tax commands both support some modified identifier handling in sourmash 3.x and 4.x, but in the future, we plan to deprecate these as they mostly cause confusion and internal complexity.

The two modifiers are:

  • --keep-full-identifiers will use the entire signature name instead of just the first space-separated token. It is by default off (set to False).

  • --keep-identifier-versions turns on keeping the full identifier, including what is after the first period. It is by default off (set to False), stripping identifiers of their version on load. When it is on (True), identifiers are not stripped of their version on load.

Taxonomies, or lineage spreadsheets

sourmash supports arbitrary (free) taxonomies, and new taxonomic lineages can be created and used internally as long as they are provided in the appropriate spreadsheet format.

You can also mix and match taxonomies as you need; for example, it is entirely legitimate in sourmash-land to combine the GTDB taxonomy for bacterial and archaeal sequence classification, with the NCBI taxonomy for eukaryotic and viral sequence classification. (You probably don’t want to mix and match within superkingdoms, though!)

As of sourmash v4, lineage spreadsheets should contain columns for superkingdom, phylum, class, order, family, genus, and species. Some commands may also support a ‘strain’ column, although this is inconsistently handled within sourmash internally at the moment.

For spreadsheet organization, lca index expects the columns to be present in order from superkingdom on down, while the tax subcommands use CSV column headers instead. We are planning to consolidate around the tax subcommand handling in the future (see sourmash#2198).

An example spreadsheet is here, bacteria_refseq_lineage.csv. (The taxid column is not used by most sourmash functions and is mostly ignored, but it is needed for the kreport and bioboxes report formats.)

LCA_SqliteDatabase - a special case

The LCA_SqliteDatabase index class can serve multiple purposes: as an index of sketches (for regular search and gather); as a taxonomy database for use with the tax subcommands; and as an LCA database for use with the lca subcommands.

When used as a taxonomy database, an LCA_SqliteDatabase file contains the same SQL tables as a sqlite taxonomy database.

When used as an LCA database, an LCA_SqliteDatabase dynamically loads the taxonomic lineages from the sqlite database and applies them to the individual hashes, permitting the same kind of hash-to-lineage query capability as the LCA_Database.

Picklists

Picklists are a generic mechanism used to select a (potentially small) subset of signatures for search/display.

The general idea of picklists is that you create a list of signatures you’re interested in - by name, or identifier, or md5sum - and then supply that list in a csvfile on the command line via --picklist. For example, --picklist list.csv:colname:ident would load the values in the column named colname in the file list.csv as identifiers to be used to restrict the search.

The support picklist column types are name, ident (space-delimited identifier), identprefix (identifier with version removed), md5, md5prefix8, and md5short. Generally the md5 and derived values are used to reference signatures found some other way with sourmash, while the identifiers are more broadly useful.

There are also four special column types that can be used without a column name: gather, prefetch, search, and manifest. These take the CSV output of the respective sourmash commands as inputs for picklists, so that you can use prefetch to generate a picklist and then use that picklist with --picklist prefetch_out.csv.gz::prefetch.

Differing internal behavior

Picklists behave differently with different Index classes.

For indexed databases like SBT, LCA, and SqliteIndex, the search is done first, and then only those results that match the picklist are selected.

For linear search databases like ZipFileLinearIndex or standalone manifests, picklists are first used to subselect the desired signatures, and only those signatures are searched.

This means that picklists can dramatically speed up searches on some Index types, but won’t affect performance on others. But the results will be the same.

Taxonomy / lineage spreadsheets as picklists

Note that lineage CSV spreadsheets, as consumed by sourmash tax commands and as output by sourmash tax grep, can be used as ident picklists.

Online and streaming; and adding to collections of sketches.

One of the big challenges with Big Data is looking at it all at once - loading all your data into memory, for example, will fail with really large data sets. The ability to look at subsets of data without looking at all of it is called “streaming” (much like when you watch a streaming movie online - you can start watching the movie without downloading the whole video, and you can also usually jump to a particular location in the video without downloading the intervening bits.)

Another related challenge is analyzing data against a database that is constantly growing, either because you’re adding to it or because it’s being updated by others. For example, in genomics, often you want to repeat the same analysis you did last time but with more reference genomes. With many software packages, this requires rebuilding your indexed database, which can be challenging for large genomes. In computer science parlance, the ability to add new data at the end without performing an expensive reindexing operation is referred to as “online”.

sourmash tackles these challenges in a few different ways, and does its best to support streaming and online behavior.

First, all sourmash commands can take multiple databases and will return the same results with multiple databases as they would with a single database containing the same sketches, unless otherwise noted. This allows you to incrementally expand your sketch collections over time without building new databases. Performance may vary (i.e. if you’re using an SBT to do search, and you add an unindexed collection of sketches to the search, the search may take longer than if you’d add the new sketches to the SBT) but the results will be the same. In this sense, many of the sourmash algorithms are online.

Second, several sourmash algorithms use streaming when searching databases - in particular, prefetch will load and unload sketches as it goes, as long as the underlying collection data structure supports it (.sig.gz and LCA JSON databases do not, but zip files, SBTs, and SQLite databases do). This lets you do containment searches against really large collections without consuming large amounts of memory. Another example is the manysearch command in the pyo3_branchwater plugin, which loads and searches a limited number of metagenomes from a large collection, rather than loading the entire collection into memory - which would be impossible.

Last but not least, one of the interesting guarantees that FracMinHash sketches provide is that no hash is ever removed when sketching. This supports various types of input streaming, which we haven’t spent too much time exploring, but (for example) means that “watching” sequencing runs and/or downloads of sequencing data, and reporting interim results with certainty, is possible. If you’re interested in making use of this, please reach out!

Gather on multiple collections, and order of search and reporting

Since sourmash gather will pick only one “best match” if there are several (and will ignore the others), the order of searching can matter for large collections. How does this work?

In brief, sourmash doesn’t guarantee a particular load order for sketches in a single collection, but it does guarantee that collections are loaded and searched in their entirety in the order that you provide them. So, for example, if you have a large zipfile database of sketches that contains duplicates, you can’t predict which of the duplicates will be chosen as a match; but you can build your own collection of prioritized matches as a separate database, and put it first on the command line. A practical application of this might be to list the GTDB “representatives” database first on the command line, with the full GTDB database second, in order to prioritize choosing representative genomes as matches over the rest.

This also plays a role in the order of reporting for prefetch output - prefetch will report matching sketches in the order it encounters them, which will match the order in which collections are given to sourmash prefetch on the command line.

Formats natively understood by sourmash

sourmash should always autodetect the format of a collection or database, in most cases based on its content (and not its filename). Please file a bug report if this doesn’t work for you!

sourmash sig summarize is a good way to examine the properties of a signature collection.

Reading and writing gzipped CSV files

(As of sourmash v4.5)

When a CSV filename is specified (e.g. sourmash gather ... -o mygather.csv), you can always provide a name that ends with .gz to produce a gzip-compressed file instead. This can save quite a bit of space for prefetch results and manifests in particular!

All sourmash commands that take in a CSV (via manifest, or picklist, or taxonomy) will autodetect a gzipped CSV based on content (the file does not need to end with .gz). The one exception is manifests, where the CSV needs to end with .gz to be loaded as a gzipped CSV; see sourmash#2214 for an issue to fix this.