sourmash databases - advanced usage information.

tl;dr use zip files for heterogeneous collections of sourmash sketches, and RocksDB indexes for fast, low-memory searches at a specific ksize/moltype/scaled.

sourmash supports a variety of different mechanisms and formats for storing, organizing, indexing, and searching signatures. Some of these mechanisms, “collections”, just store the signatures; others (“indexed” databases) provide indices on the signatures for fast content-based search. Most of the mechanisms now use manifests that permit fast selection and loading of signatures based on metadata. Below we refer to “databases” generically as any on-disk storage mechanism for sourmash signatures.

Which database type is best to use depends on what you’re doing - which is what this document is about! In general, however, sourmash should be fast enough that database choice will only impact performance when searching thousands of signatures, or doing thousands of searches.

The recommended file extensions below are conventions used to signal the output format when using -o with sourmash sketch and the sourmash sig subcommands; so, for example, sourmash sketch dna *.fa -o xyz.zip will output signatures in the .zip format. Indexed formats (SBT, LCA, and RocksDB) need to be constructed with sourmash index.

sourmash will automatically detect and load the database, based on the database content and not the database extension, in most cases.

Unless noted otherwise, the below database formats are supported in all releases since sourmash v4.9.0.

How are signatures actually stored?

sourmash signatures are typically serialized into JSON for on-disk storage, with rare exceptions (SQLite and LCA databases). The internal sourmash code automatically detects and properly handles compressed (gzipped) JSON data.

Storing signatures in .zip files

This is our recommended format for storing collections of signatures.

Multiple signatures can be stored in a single .zip file. The best way to construct that zip file is from within sourmash, by specifying -o filename.zip when outputting signatures. Zip files created from within sourmash will automatically have manifests; this enables rapid subselection and direct loading of signatures via e.g. picklists.

Zip files are not indexed by content, so they can be slow for searching. But they are small, and provide a good compromise between disk size (small), flexibility (can store any mixture of signatures), and speed (good for gather, not good for search).

Zip file collections can contain any number of signatures, of any type (num or scaled, DNA/protein/dayhoff/hp/skipm1n3/skipm2n3).

You should create your own zip files by using sourmash cat ... -o <filename>.sig.zip; this will create a zip file with an internal manifest that will speed up many operations, including picklists.

RocksDB indexes.

This is our recommended format for indexing signatures for search.

RocksDB indexes are fast and low-memory on-disk inverted indexes that support massive-scale content-based search. They can be built with sourmash index -F rocksdb. RocksDB indexes are fully supported since sourmash v4.9.0.

Standalone manifests

(This format is ideal for many advanced use cases.)

Manifests are catalogs of signature metadata - name, molecule type, k-mer size, and other information - that can be used to select specific signatures for searching or processing. Typically when using manifests the actual signatures themselves are not loaded until they are needed, although the efficiency of this depends on the signature storage mechanism; for example, JSON-format containers (.sig and .lca.json files) must be entirely loaded before any signature in the file them can be used, unlike zip containers.

As of sourmash 4.4 manifests can be directly loaded from the command line as standalone collections. This lets manifests serve as a catalog of signatures stored in many different locations. Sketches can be selected by name, k-mer size, molecule type, and other features without loading the actual sketch data.

Standalone manifests are preferable to both directory storage and pathlists (below), because they support fast selection and direct lazy loading. This means that sourmash operations that support streaming or online search (such as prefetch and gather, among others) can avoid loading everything all at once.

Standalone manifests are the most effective solution for managing custom collections of thousands to millions of signatures, as well as working with multiple large sketches.

They can be created with sourmash sig collect and sourmash sig check (sourmash v4.4 and later).

Sourmash supports two manifest file formats - CSV and SQLite. SQLite manifests are much faster and lower-memory than CSV manifests.

Storing JSON in .sig and .sig.gz files: the original format.

(This format is not recommended. Use zip files instead.)

Multiple signatures can be stored in a single JSON file. However, this file will be loaded in its entirety by sourmash, even if you only select one for later analysis.

This is the least efficient way to store multiple signatures, because all of the JSON must be loaded before any signature can be selected or searched. But it is the oldest format and so a lot of our documentation describes it!

Storing signatures in SQLite databases

(This format is not recommended any more; use zip files.)

As of sourmash 4.4, we support storing signatures directly in a SQLite database (-o .sqldb). This is a fast, low-memory, on-disk format that is suitable for use with search and can support multiple simultaneous queries. However, the resulting file is also rather large, so we do not distribute databases in this format.

SQLite databases are implemented as an inverted index, with hashes stored directly in a table.

SQLite databases are limited to scaled signatures, and can only contain sketches with the same scaled value across the entire database. They can store multiple molecule types.

We do not recommend using SQLite databases for storing signatures, although they are still fully supported.

Other Indexed collections - SBTs and LCAs.

(These formats are not recommended any more, although they are still supported; use RocksDB indexes instead.)

We provide two other indexed collection formats, Sequence Bloom Trees (SBTs) and LCA databases.

SBTs implement our version of Sequence Bloom Trees, a fast tree-based index that support rapid search for matches; they are particularly effective when searching for best matches across large databases. They are relatively low memory and typically about twice the size of .zip files on disk. They can be constructed with sourmash index.

LCA databases are inverted indices that support individual hash lookup. They provide fast search and gather, and also support all of the sourmash lca subcommands for hash-based taxonomic analysis. There are two LCA database formats, JSON and SQLite; JSON is small on disk but JSON LCA databases consume a lot of memory when loaded, while SQLite LCA databases are large on disk but low-memory and fast. JSON LCA databases do not support multiprocess queries. LCA databases can be constructed with sourmash lca index.

Both SBTs and LCA databases can only store homogeneous collections of signature types - all signatures must have the same molecule type and scaled or num value. Furthermore, LCA databases can only store scaled signatures.

We no longer recommend SBT and LCA databases. As of sourmash v4.9.0, sourmash supports RocksDB indexes, which are much faster and lower memory. See the index documentation. The taxonomic functionality of LCAs is also no longer recommended; use sourmash tax instead.

Directories

(No longer recommended. Use zip files or standalone manifests instead.)

Directory hierarchies of signatures are read natively by sourmash, and can be created or extended by specifying -o dirname/ (with a trailing slash).

To read from a directory, specify the directory name on the sourmash command line. When reading from directories, the entire directory hierarchy is traversed and all .sig and .sig.gz files are loaded as signatures. If --force is specified, all files will be read, and failures will be ignored.

When directories are specified as outputs, the signatures will be saved by their complete md5sum underneath the directory.

We don’t recommend loading signatures from directory hierarchies, since the implementation is not particularly memory efficient and most of the use cases for directories are now covered by other approaches - in particular, standalone manifests.

Pathlists

(No longer recommended. Use zip files or standalone manifests instead.)

Pathlists are text files containing paths to one or more sourmash databases; any type of sourmash-readable collection can be listed.

The paths in pathlists can be relative or absolute within the file system. If they are relative, they must resolve with respect to the current working directory of the sourmash command.

We don’t recommend using pathlists, since the original use cases are now supported with picklists and standalone manifests, but they are still supported. Loading sketches from pathlists is also not very efficient.

Pathlists are not output by any sourmash commands.

Many commands support --query-from-file or --from-file as a way to pass in a file containing many paths to sketches or collections. The internal implementation of sourmash simply adds these to the command-line arguments, and this is an effective and efficient way to provide long lists of files to commands like sig check and sig collect that create standalone manifests to support efficient lazy loading.

Storing taxonomies

sourmash supports taxonomic information output via the sourmash lca and sourmash tax subcommands. Both sets of commands rely on the same 7 taxonomic ranks: superkingdom, phylum, class, order, family, genus, and species (with limited support for a ‘strain’ rank). And both sets of subcommands take lineage spreadsheets that link specific identifiers to taxonomic lineages.

Lineage spreadsheets can be provided in two on-disk formats, CSV and SQLite.

CSV is the original format, and consists of separate columns for identifier and each taxonomic rank.

SQLite taxonomy databases are typically built from CSV using sourmash tax prepare. They contain a single table, sourmash_taxonomy, with columns for ident and each taxonomic rank. Only the sourmash tax command supports SQLite taxonomy databases.

Appendix: SQLite complexities

The SQLite implementation of signature storage, metadata manifests, and LCA databases is all bundled into a single SQLite database. Because of this, sourmash must examine the database tables to decide what kind of sourmash structure the database is - the logic is roughly this:

  • does the database store both sketch information and taxonomy information? It’s an LCA database!

  • if it has sketch information but no taxonomy information, it’s just a regular index.

  • if it only has manifest information, it’s a manifest!

  • if it only has taxonomy information, it’s a taxonomy!

This is complicated by several other details -

  • we can treat SQLite databases with sketch information as read-only manifests, but because the sketch information is tightly coupled to the manifest table, we cannot insert new manifest entries;

  • we can treat SQLite databases with sketch information as read/write taxonomy files, since the taxonomy information is not tightly coupled to the sketches;

Last but not least, the hashes in SQLite are stored as signed 64-bit integers and must be converted to unsigned 64-bit numbers internally by sourmash; negative numbers in the SQLite table represent unsigned ints that are larger than 2**63 - 1. Please see this blog post for more information.

The SQLite schema itself is not very complicated and can be used for lineage and manifest querying by other scripts. However, we recommend doing hash value querying/search via the Python code.