sourmash databases - advanced usage information.

sourmash uses a variety of different mechanisms and formats for storing, organizing, and searching signatures. Some of these mechanisms, “collections”, just store the signatures; others (“indexed” databases) provide indices on the signatures for fast content-based search. Most of the mechanisms now use manifests that permit fast selection and loading of signatures based on metadata. Below we refer to “databases” generically as any on-disk storage mechanism for sourmash signatures.

Which database type is best to use depends on what you’re doing - which is what this document is about! In general, however, sourmash should be fast enough that database choice will only impact performance when searching thousands of signatures, or doing thousands of searches.

The recommended file extensions below are conventions used to signal the output format when using -o with sourmash sketch and the sourmash sig subcommands; so, for example, sourmash sketch dna *.fa -o will output signatures in the .zip format.

sourmash will automatically detect and load the database, based on the database content and not the database extension, in most cases.

Unless noted otherwise, the below database formats are supported in all release since sourmash v3.5.

How are signatures actually stored?

sourmash signatures are typically serialized into JSON for on-disk storage, with rare exceptions (SQLite and LCA databases). The internal sourmash code automatically detects and properly handles compressed (gzipped) JSON data.

Storing JSON in .sig and .sig.gz files: the original format.

Multiple signatures can be stored in a single JSON file. However, this file will be loaded in its entirety by sourmash, even if you only select one for later analysis.

This is the least efficient way to store multiple signatures, because all of the JSON must be loaded before any signature can be selected or searched. But it is the oldest format and so a lot of our documentation describes it!

Storing taxonomies

sourmash supports taxonomic information output via the sourmash lca and sourmash tax subcommands. Both sets of commands rely on the same 7 taxonomic ranks: superkingdom, phylum, class, order, family, genus, and species (with limited support for a ‘strain’ rank). And both sets of subcommands take lineage spreadsheets that link specific identifiers to taxonomic lineages.

Lineage spreadsheets can be provided in two on-disk formats, CSV and SQLite.

CSV is the original format, and consists of separate columns for identifier and each taxonomic rank.

SQLite taxonomy databases are typically built from CSV using sourmash tax prepare. They contain a single table, sourmash_taxonomy, with columns for ident and each taxonomic rank. Only the sourmash tax command supports SQLite taxonomy databases.

Appendix: SQLite complexities

The SQLite implementation of signature storage, metadata manifests, and LCA databases is all bundled into a single SQLite database. Beacuse of this, sourmash must examine the database tables to decide what kind of sourmash structure the database is - the logic is roughly this:

  • does the database store both sketch information and taxonomy information? It’s an LCA database!

  • if it has sketch information but no taxonomy information, it’s just a regular index.

  • if it only has manifest information, it’s a manifest!

  • if it only has taxonomy information, it’s a taxonomy!

This is complicated by several other details -

  • we can treat SQLite databases with sketch information as read-only manifests, but because the sketch information is tightly coupled to the manifest table, we cannot insert new manifest entries;

  • we can treat SQLite databases with sketch information as read/write taxonomy files, since the taxonomy information is not tightly coupled to the sketches;

Last but not least, the hashes in SQLite are stored as signed 64-bit integers and must be converted to unsigned 64-bit numbers internally by sourmash; negative numbers in the SQLite table represent unsigned ints that are larger than 2**63 - 1. Please see this blog post for more information.

The SQLite schema itself is not very complicated and can be used for lineage and manifest querying by other scripts. However, we recommend doing hash value querying/search via the Python code.