A guide to the internal design and structure of sourmash¶
sourmash was created in 2015, and has been repeatedly reorganized, refactored, and optimized to support ever larger databases, faster queries, and new use cases. We’ve also regularly added new functionality and features. So sourmash can be pretty complicated internally, and our user-facing documentation only covers a fraction of its potential!
This document is a brain dump intended for expert users and sourmash developers who want to understand how, why, and when to use various sourmash features. It is unlikely ever to be comprehensive, so the information you are interested in may not yet exist in this document, but we are always happy to add to it - just ask in an issue!
Signatures and sketches¶
sourmash operates on sketches. Each sketch is a collection of hashes, which are in turn built from k-mers by applying a hash function (currently always murmurhash) and a filtering function. Each sketch is contained in a signature wrapper that contains some metadata.
Internally, sketches (class MinHash
) contain the following information:
a set of hashes;
an optional abundance for each hash (when
track_abund
is True);a seed;
a k-mer size;
a molecule type;
either a
num
(for MinHash) or ascaled
value (for FracMinHash);
Signature objects (class SourmashSignature
) contain a sketch (property .minhash
) as well as additional information:
an optional
name
an optional
filename
a license (currently must be CC0);
an
md5sum(...)
method that returns a hash of the sketch.
For now, we tend to refer to signatures and sketches interchangeably, because they are almost entirely 1:1 in the code base (but see sourmash#616).
The default signature interchange/serialization format is JSON, optionally gzipped. This format is written and read by Rust code.
In general, a lot of effort in sourmash is spent managing collections of
signatures before actually doing comparisons with them; see manifests,
and Index
objects, below.
Making sketches¶
Sketches are produced by hashing k-mers with murmurhash and then
keeping either the lowest num
hashes (for MinHashes sketches) or
keeping all hashes below 2**64 / scaled
(for FracMinHash sketches).
This has the effect of selecting approximately one hash for every
scaled
k-mers - so, when sketching a set of 100,000 distinct k-mers,
a scaled value of 1,000 would yield approximately 100 hashes to be
retained in the sketch.
The default MinHash sketches use parameters so that they are compatible with mash sketches.
See utils/compute-dna-mh-another-way.py for details on how k-mers are hashed.
Note that if hashes are produced some other way (with a different hash
function) or from some source other than DNA, sourmash can still work
with them; only sourmash sketch
actually cares about DNA sequences,
everything else works with hashes.
Compatibility checking¶
The point of the signatures and sketches is to enable certain kinds of rapid comparisons - Jaccard similarity and number of overlapping k-mers, specifically. However, these comparisons can only be done between compatible sketches.
Here, “compatible” means -
the same MurmurHash seed (default 42);
the same k-mer size/ksize (see k-mer sizes, below);
the same molecule type (see molecule types, below);
the same
num
orscaled
(although see this downsampling discussion, and the next two sections);
sourmash uses selectors (Index.select(...)
) to select sketches with
compatible ksizes, molecule types, and sketch types.
Scaled (FracMinHash) sketches support similarity and containment¶
Per our discussion in Irber et al., 2022, FracMinHash sketches can always be compared by downsampling to the max of the two scaled values. (This is not always indexed collections of sketches, e.g. SBTs; see sourmash#1799.)
In practice, sourmash does all necessary downsampling dynamically, but
returns the original sketches. This means that (for example) you can
do a low-resolution/high-scaled search quickly by specifying a
high scaled
value, and then use a higher resolution comparison with
the resulting matches for a more refined and accurate analysis (see
below, Speeding up gather and search.)
Num (MinHash) sketches support Jaccard similarity¶
“Regular” MinHash (or “num MinHash”) sketches are implemented the same way as in mash. However, they are less well supported in sourmash, because they don’t offer the same opportunities for metagenome analysis. (See also sourmash#1354.)
Num MinHash sketches can always be compared by downsampling to a
common num
value. This may need to be done manually using sourmash sig downsample
, however.
Conversion between Scaled (FracMinHash) and Num (MinHash) signatures with downsample
¶
As discussed in the previous sections, it is possible to adjust the scaled
and num
values to compare two FracMinHash signatures or two Num MinHash signatures. However, it is also possible to covert between the scaled
and num
signatures with the sourmash sig downsample
command. For more details, review the command line docs for sig downsample
.
Operations you can do safely with FracMinHash sketches¶
As described in Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers, FracMinHash sketches support a wide range of operations that mirror actions taken on the underlying data set without revisiting the underlying data. This allows users to build sketches once (requiring the original data) and then do all sorts of manipulations on the sketches, and know that the results of the sketch manipulations represent what would happen if they did the same thing on the original data. For example,
set unions, intersections, and subtractions all perform the same when done on the sketches as when applied to the underling data. So, for example, you can sketch two files separately and merge the sketches (with
sig merge
), and get the same result as if you’d concatenated the files first and then sketched them.if you filter hashes on abundance with
sig filter
, you get the same result as if you filtered the data set on k-mer abundance and then sketched it.downsampling: you can sketch the original data set at a high resolution (e.g. scaled=100) and then downsample it later (to e.g. scaled=1000), and get the same result as if you’d sketched the data set at scaled=1000.
K-mer sizes¶
There is no explicit restriction on k-mer sizes built into sourmash.
For highly specific genome and metagenome comparisons, we typically use k=21, k=31, or k=51. For a longer discussion, see Assembly Free Analysis with K-mers from STAMPS 2022 and a more general overview at Using sourmash:a practical guide.
Molecule types - DNA, protein, Dayhoff, and hydrophobic-polar¶
sourmash supports four different sequence encodings, which we refer to
as “molecule”: DNA (--dna
), protein (--protein
), Dayhoff encoding,
(--dayhoff
), and hydrophobic-polar (--hp
).
All FracMinHash sketches have exactly one molecule type, and can only be compared to the same molecule type (and ksize).
DNA moltype sketches can be constructed from DNA input sequences using
sourmash sketch dna
.
Protein, Dayhoff, and HP moltype sketches can be constructed from
protein input sequences using sourmash sketch protein
, or from DNA
input sequences using sourmash sketch translate
; translate
will
translate in all six reading frames (see also
orpheum from
Botvinnik et al., 2021).
By default protein sketches will be created; dayhoff sketches can be
created by including dayhoff
in the param string, e.g. sourmash sketch protein -p dayhoff
, and hydrophobic-polar sketches can be
built with hp
in the param string, e.g. sourmash sketch protein -p hp
.
Manifests¶
Manifests are catalogs of sketches: they include most of the information about a sketch, except for the actual hashes. The idea of manifests is that you can use them to identify which sketches you are interested in before actually working with them (loading them, for example).
sourmash makes extensive use of signature manifests to support rapid selection and lazy loading of signatures based on signature metadata (name, ksize, moltype, etc.) See Blog post: Scaling sourmash to millions of samples for some of the motivation.
Manifests are an internal format that is not meant to be particularly human readable, but the CSV format can be loaded into a spreadsheet program if you’re curious :).
If a collection of signatures is in a zipfile (.zip
) or SBT zipfile (.sbt.zip
), manifests must
be named SOURMASH-MANIFEST.csv
. They can also be stored directly on
disk in CSV/gzipped CSV, or in a sqlite database; see
sourmash sig manifest
, sourmash sig check
, and sourmash sig collect
for manifest creation, management, and export utilities.
Where signatures are stored individually in Index
collections,
e.g. as separate files in a zipfile, manifests may be stored alongside
them; for other subclasses of Index
such as the inverted indices,
manifests are generated dynamically by the class itself.
Currently (sourmash 4.x) manifests do not contain information about the hash seed or sketch license. This will be fixed in the future - see sourmash#1849.
Manifests are very flexible and, especially when stored in a sqlite
database, can be extremely performant for organizing hundreds of
thousands to millions of sketches. Please see StandaloneManifestIndex
for a lazy-loading Index
class that supports such massive-scale
organization.
Index implementations¶
The Index
class and its various subclasses (in sourmash.index
) are
containers that provide an API for organizing, selecting, and
searching (potentially) large numbers of signatures.
sourmash sig summarize
is a good way to determine what type of Index
class is used to handle a collection.
Loading and saving of Index
objects is handled separately from the
class: loading can be done in Python via the
sourmash.load_file_as_index(...)
method, while creation and/or
updating of Index
objects is done via
sourmash.sourmash_args.SaveSignaturesToLocation(...)
. These are the
same APIs used by the command-line functionality.
There are quite a few different Index
subclasses and they all have
distinct features. We have a high-level guide to which collection
type to use here.
Conceptually, Index
classes are either organized around storing
individual signatures often with metadata that permits loading,
selecting, and/or searching them more efficiently
(e.g. ZipFileLinearIndex
and SBTs
); or they store signatures
as inverted indices (LCA_Database
and SqliteIndex
) that permit
certain kinds of fast queries.
Unless otherwise noted, the Index
classes below can be loaded
concurrently in “read only” mode - that is, you should build the
collection once, and then use it from multiple processes. We
currently do not test for or support concurrent read/write. Note also
that (generally speaking) memory footprints will be additive, so
loading the same LCA_Database
twice will consume twice the memory.
(If you’re interested in concurrency, we suggest using the sqlite
containers - see SqliteIndex
.)
In-memory storage and search.¶
The simplest way to handle collections of signatures is to load them into memory, but it is also the least performant and most memory intensive mechanism!
LinearIndex
and MultiIndex
both support sketches loaded from
JSON files; both will load the sketches once and then keep them in
memory. LinearIndex
does not use manifests while MultiIndex
builds
a manifest as it loads the sketches.
Note that MultiIndex
is the class used to load signatures from
pathlists, directory hierarchies, and so on; because it stores
sketches in memory, this can incur a significant memory penalty (see
sourmash#1899). Therefore where possible we suggest building a standalone
manifest (StandaloneManifestIndex
) to do lazy loading from the disk
instead; you can use sourmash sig collect
to do this.
Zipfile collections¶
ZipFileLinearIndex
stores signature files in a zip file with an
accompanying manifest. This is the most versatile and compressed
option for working with large collections of sketches - it supports
rapid selection and loading of specific sketches from disk, and can
store and search any mixture of sketch types (ksize, molecule type,
scaled values, etc.)
By default, ZipFileLinearIndex
stores one signature (equiv. one
sketch) in each member file in the zip archive. Each signature is
stored uncompressed. The accompanying manifest stores the full member
file path in internal_location
, so that sketches can be retrieved
directly.
Searching a ZipFileLinearIndex
is done linearly, as implied by the
name. This is fine for gather
but if you are doing repeated queries
with search
you may want to use an SBT or LCA database instead; see
below.
In the future we expect to parallelize searching ZipFileLinearIndex
files in Rust; see sourmash#1752.
ZipFileLinearIndex
does support zip files without manifests as well
as multiple signatures in a single file; this was originally intended
to support simply zipping entire directory hierarchies into a zipfile.
However, this slows down performance and is not recommended. If you
have an existing zipfile (or really any collection of signatures) and
you want to turn them into a proper ZipFileLinearIndex
, you can use
sig cat <collection(s)> -o combined.zip
to create a
ZipFileLinearIndex
file named combined.zip
that will have a
manifest and signatures broken out into individual files.
Sequence Bloom Trees (SBTs)¶
Sequence Bloom Trees (SBTs; see the Kingsford Lab page for details) provide a faster (but more memory intensive) on-disk storage and search mechanism. In brief, SBTs implement a binary tree organization of an arbitrary number of signatures; each internal node is a Bloom filter containing all of the hashes for the nodes below them. This permits potentially rapid elimination of irrelevant nodes on search.
SBTs are restricted to storing and searching sketches with the same/single k-mer size and molecule type, as well as either a single num value or a single scaled value.
We suggest using SBTs when you are doing multiple Jaccard search or
containment searches with genomes via sourmash search
.
Lowest common ancestor (LCA) databases¶
The LCA_Database
index class stores signatures in an inverted index,
where a Python dictionary is used to link individual hashes back to
signatures and/or taxonomic lineages. This supports the individualized
hash analyses used in the lca
submodule.
LCA databases only support a single ksize, moltype, and scaled. They can only be used with FracMinHash (scaled) sketches.
The default LCA_Database
class is serialized via JSON, and loads
everything into memory when requested. The load time incurs a
significant latency penalty when used from the command line, as well
as having a potentially large memory footprint; this makes it
difficult to use the default LCA_Database
for very large databases,
e.g. genbank bacteria.
The newer LCA_SqliteDatabase
(based on SqliteIndex
, described
below) also supports LCA-style queries, and is stored on disk, is fast to
load, and uses very little memory. The tradeoff is that the underlying
sqlite database can be quite large. LCA_SqliteDatabase
should also
support rapid concurrent access (see sourmash#909).
Both types of LCA database can be constructed with sourmash lca index
.
SqliteIndex¶
The SqliteIndex
storage class uses sqlite3 to store hashes and
sketch information for search and retrieval; see
this blog post
for background information and details. These are fast, low-memory,
on-disk databases, with the tradeoff that they can be quite large.
This is probably currently the best solution for concurrent access to
sketches via e.g. a Web server (see also sourmash#909).
SqliteIndex
can only contain FracMinHash sketches and can only store
sketches with the same scaled parameter. However, it can store
multiple ksizes and moltypes as long as the same scaled is used.
SqliteIndex
objects can be constructed using sourmash sig cat ... -o filename.sqldb
.
Standalone manifests¶
The StandaloneManifestIndex
class loads standalone manifests generated
by sourmash sig collect
. They support rapid selection and lazy loading
on potentially extremely large collections of signatures.
The underlying mechanism uses the internal_location
field of
manifests to point to the container file. When particular sketches are
requested, the container file is loaded into an Index
object with
sourmash.load_file_as_index
and the md5
values of the requested
sketches are used as a picklist to retrieve the desired signatures.
Thus, while standalone manifests can point at any kind of container,
including JSON files or LCA databases, they are most efficient when
internal_location
points at a file with either a single sketch in
it, or a manifest that supports direct loading of sketches. Therefore,
we suggest using standalone manifest indices. Note that sourmash
interprets paths to locations in standalone manifests relative to the
manifest filename; see the --relpath
behavior in sig check
and
sig collect
to output manifests that deal with relative filenames
properly.
Note that searching a standalone manifest is currently done through a
linear iteration, and does not use any features of indexed containers
such as SBTs or LCAs. This is fine for gather
with the default
approach, but is probably suboptimal for a search
.
Pathlists and --from-file
¶
All (or most) sourmash commands natively support taking in lists of
signature collections via pathlists, --from-file
, or paths to
directories. This is useful for situations where you have thousands of
signature files and don’t want to provide them explicitly on the
command line; you can simply put a list of the files in a text file,
and pass it in directly (or use --from-file
to pass it in).
Both pathlists and files passed to --from-file
contain a list of
paths to be loaded; relatives paths will be interpreted relative to
the current working directory of sourmash. Pathlists should be
universally available on sourmash commands. When --from-file
is
available for a command, sourmash will behave as if the file paths in
the file were provided on the command line.
We suggest avoiding pathlists. Instead, we suggest using --from-file
or a standalone manifest index (generated with sourmash sig collect
). This is because the signatures from pathlists are loaded
into memory (see MultiIndex
, above) it is generally a bad idea to
use them - they may be slow to load and may consume a lot of
memory. They also do not support good loading error messages;
see sourmash#1414.
Extensions for outputting index classes¶
Most commands that support saving signatures will save them in a variety of formats, based on the extension provided (see sourmash#1890 for exceptions). The supported extensions are -
.zip
forZipFileLinearIndex
.sqldb
forSqliteIndex
.sig
or.sig.gz
for JSON/gzipped JSONdirname/
to save in a directory hierarchy
The default signature save format is JSON, if the extension is not recognized.
Speeding up gather
and search
¶
There are two primary search commands in sourmash: gather
and
search
.
gather
calculates a minimum metagenome cover as discussed in Irber et al., 2022. It
is mostly intended for querying a database with a metagenome, although
it can be used with genome queries, as well. This approach relies on overlaps
between genomes and metagenomes and can only be used with FracMinHash sketches.
search
does a straight Jaccard similarity search on MinHash and
FracMinHash sketches (or, with --containment
, a containment search
on FracMinHash sketches). It is typically used to find matches to a
query genome sketch in a large database of sketches.
The prefetch
command does a containment search and is intended for
power users; it is a standalone implementation of the prefetch
algorithm discussed below for gather
. It only works with FracMinHash
sketches.
Note that all of these commands work with any and all Index
collection/container types, and will return the same results however
the collections are organized - see the “online behavior” section,
below. In practice this means that you can provide additional
collections of signatures via the command line without building a
combined index of all your signatures. It also means that the only
reason to choose different collections/containers is for
optimization - you should select the containers that help you achieve
the desired performance characteristics for your search
(i.e. the right memory/time/disk space tradeoffs).
Running search
many times on the same database¶
search
typically is used to search a large database of sketches for
all similarity or containment matches above a threshold. Depending on
the query and the database, certain kinds of database indices may make
search much faster, especially when only a few matches are expected.
If you are doing many searches against the same database, indexing the
database as an SBT (with sourmash index
) or as a SqliteIndex
/sqldb
database is likely to provide a significant speed increase, albeit
with increased memory usage (SBT) or increased disk space (sqldb).
Conversely, ZipFileLinearIndex
and the default LCA_Database
are likely
to be poor choices for many searches - the former only supports linear
searches, and the latter needs to be loaded from disk and deserialized each
time.
Running gather
once¶
gather
is typically used to search a metagenome against a large
database of sketches, as part of finding a minimum set cover. This can
be quite slow! Our current implementation (as of sourmash 4.1.0, pull request sourmash#1370) does a single pass across the database to find all matches
with a Jaccard similarity or containment above the provided threshold, and then organizes
the matches for rapid min-set-cov analysis. This single pass across the
database is called a “prefetch”, and it is also implemented in the
prefetch
subcommand.
With this single pass approach, benchmarks - sourmash#2014 - show that a
linearly searchable database is performant enough to be used with
gather
. We therefore suggest using a ZipFileLinearIndex
container
with gather, or in cases where low-memory concurrency is desired, a
SqliteIndex
container.
Using prefetch
and gather
together¶
If you want to use prefetch
independently of gather
, you can use
the prefetch output as a picklist passed into gather - see
picklists, below. This can be useful when you want to
experiment with different threshold parameters for gather
- first,
do a very sensitive/low-threshold search with prefetch
and save the
results to a CSV file with -o
,
Repeated gathers and searches. CTB
Using prefetch explicitly. CTB
Using a higher scaled value¶
With FracMinHash sketches, you can downsample the query to make both
search
and gather
much faster. A good rule of thumb is to use a
scaled value that is about 5x smaller than the minimum overlap to
detect; so, if you want to be able to detect 50kb of similarity, you
can use a scaled value of 10,000. Conversely, the default scaled value
of 1,000 (for DNA sketches) should robustly detect overlaps of 5kb.
You can supply --scaled
to gather
and prefetch
to dynamically
downsample the query FracMinHash. For search
you will need to use
sourmash sig downsample
to generate a downsampled sketch.
Running gather
many times - multigather
¶
In situations where loading the search database is slow (e.g.
LCA_Database
or zipfiles with very large manifests), the sourmash multigather
command supports many queries against many databases.
(We don’t particularly suggest using multigather
; we would prefer
to make search databases faster. But it’s there! :)
Much faster search and gather with branchwater¶
We also have a reasonably stable plugin, pyo3_branchwater, that implements multithreaded operations using Rust. It is 100-1000 times faster than sourmash, and 5-50 times lower memory. In exchange, it’s not quite as flexible as the full sourmash package. But if you’re running into speed or memory problems, you should give it a try!
Taxonomy and assigning lineages¶
All sourmash taxonomy handling is done within the lca
and tax
subcommands (CLI) and submodules (Python).
In the case of the lca
subcommands, the taxonomic information is
incorporated into the LCA database construction (see the lca index
command), while the tax
subcommands load taxonomic information
on demand from taxonomy databases (CSVs or databases).
sourmash anchors all taxonomy to identifiers, and uses the signature
name to do so - this is the name as set by the --name
parameter to
sourmash sketch
, and output by sourmash sig describe
as the
signature:
field.
Identifier handling¶
sourmash prefers identifiers to be the first space-separated token in the signature name. This token can contain any alphanumeric letters other than space, and should contain at most one period. The version of the identifier will be the component after the period.
So, for example, for a signature name of
CP001941.1 Aciduliprofundum boonei T469, complete genome
the identifier would be CP001941.1
and the version would be 1.
There are no other constraints placed on the identifier, and
versions are not handled in any special way other than as below.
The lca index
and tax
commands both support some modified
identifier handling in sourmash 3.x and 4.x, but in the future, we
plan to deprecate these as they mostly cause confusion and internal
complexity.
The two modifiers are:
--keep-full-identifiers
will use the entire signature name instead of just the first space-separated token. It is by default off (set to False).--keep-identifier-versions
turns on keeping the full identifier, including what is after the first period. It is by default off (set to False), stripping identifiers of their version on load. When it is on (True), identifiers are not stripped of their version on load.
Taxonomies, or lineage spreadsheets¶
sourmash supports arbitrary (free) taxonomies, and new taxonomic lineages can be created and used internally as long as they are provided in the appropriate spreadsheet format.
You can also mix and match taxonomies as you need; for example, it is entirely legitimate in sourmash-land to combine the GTDB taxonomy for bacterial and archaeal sequence classification, with the NCBI taxonomy for eukaryotic and viral sequence classification. (You probably don’t want to mix and match within superkingdoms, though!)
As of sourmash v4, lineage spreadsheets should contain columns for superkingdom, phylum, class, order, family, genus, and species. Some commands may also support a ‘strain’ column, although this is inconsistently handled within sourmash internally at the moment.
For spreadsheet organization, lca index
expects the columns to be
present in order from superkingdom on down, while the tax
subcommands use CSV column headers instead. We are planning to
consolidate around the tax
subcommand handling in the future (see sourmash#2198).
An example spreadsheet is
here, bacteria_refseq_lineage.csv. (The
taxid
column is not used by most sourmash functions and is mostly
ignored, but it is needed for the kreport
and bioboxes
report
formats.)
LCA_SqliteDatabase
- a special case¶
The LCA_SqliteDatabase
index class can serve multiple purposes: as
an index of sketches (for regular search and gather); as a taxonomy
database for use with the tax
subcommands; and as an LCA database
for use with the lca
subcommands.
When used as a taxonomy database, an LCA_SqliteDatabase
file
contains the same SQL tables as a sqlite taxonomy database.
When used as an LCA database, an LCA_SqliteDatabase
dynamically loads
the taxonomic lineages from the sqlite database and applies them to the
individual hashes, permitting the same kind of hash-to-lineage query
capability as the LCA_Database
.
Picklists¶
Picklists are a generic mechanism used to select a (potentially small) subset of signatures for search/display.
The general idea of picklists is that you create a list of signatures
you’re interested in - by name, or identifier, or md5sum - and then
supply that list in a csvfile on the command line via --picklist
.
For example, --picklist list.csv:colname:ident
would load the
values in the column named colname
in the file list.csv
as identifiers
to be used to restrict the search.
The support picklist column types are name
, ident
(space-delimited identifier), identprefix
(identifier with version
removed), md5
, md5prefix8
, and md5short
. Generally the md5
and derived values are used to reference signatures found some other
way with sourmash, while the identifiers are more broadly useful.
There are also four special column types that can be used without a column
name: gather
, prefetch
, search
, and manifest
. These take the
CSV output of the respective sourmash commands as inputs for picklists,
so that you can use prefetch to generate a picklist and then use that
picklist with --picklist prefetch_out.csv.gz::prefetch
.
Differing internal behavior¶
Picklists behave differently with different Index
classes.
For indexed databases like SBT, LCA, and SqliteIndex
, the search is
done first, and then only those results that match the picklist are
selected.
For linear search databases like ZipFileLinearIndex
or standalone
manifests, picklists are first used to subselect the desired
signatures, and only those signatures are searched.
This means that picklists can dramatically speed up searches on some
Index
types, but won’t affect performance on others. But
the results will be the same.
Taxonomy / lineage spreadsheets as picklists¶
Note that lineage CSV spreadsheets, as consumed by sourmash tax
commands
and as output by sourmash tax grep
, can be used as ident
picklists.
Online and streaming; and adding to collections of sketches.¶
One of the big challenges with Big Data is looking at it all at once - loading all your data into memory, for example, will fail with really large data sets. The ability to look at subsets of data without looking at all of it is called “streaming” (much like when you watch a streaming movie online - you can start watching the movie without downloading the whole video, and you can also usually jump to a particular location in the video without downloading the intervening bits.)
Another related challenge is analyzing data against a database that is constantly growing, either because you’re adding to it or because it’s being updated by others. For example, in genomics, often you want to repeat the same analysis you did last time but with more reference genomes. With many software packages, this requires rebuilding your indexed database, which can be challenging for large genomes. In computer science parlance, the ability to add new data at the end without performing an expensive reindexing operation is referred to as “online”.
sourmash tackles these challenges in a few different ways, and does its best to support streaming and online behavior.
First, all sourmash commands can take multiple databases and will return the same results with multiple databases as they would with a single database containing the same sketches, unless otherwise noted. This allows you to incrementally expand your sketch collections over time without building new databases. Performance may vary (i.e. if you’re using an SBT to do search, and you add an unindexed collection of sketches to the search, the search may take longer than if you’d add the new sketches to the SBT) but the results will be the same. In this sense, many of the sourmash algorithms are online.
Second, several sourmash algorithms use streaming when searching
databases - in particular, prefetch
will load and unload sketches as
it goes, as long as the underlying collection data structure supports
it (.sig.gz
and LCA JSON databases do not, but zip files, SBTs,
and SQLite databases do). This lets you do containment searches
against really large collections without consuming large amounts of
memory. Another example is the manysearch
command in the
pyo3_branchwater
plugin, which loads and searches a limited number of metagenomes from
a large collection, rather than loading the entire collection into
memory - which would be impossible.
Last but not least, one of the interesting guarantees that FracMinHash sketches provide is that no hash is ever removed when sketching. This supports various types of input streaming, which we haven’t spent too much time exploring, but (for example) means that “watching” sequencing runs and/or downloads of sequencing data, and reporting interim results with certainty, is possible. If you’re interested in making use of this, please reach out!
Gather on multiple collections, and order of search and reporting¶
Since sourmash gather
will pick only one “best match” if there
are several (and will ignore the others), the order of searching
can matter for large collections. How does this work?
In brief, sourmash doesn’t guarantee a particular load order for sketches in a single collection, but it does guarantee that collections are loaded and searched in their entirety in the order that you provide them. So, for example, if you have a large zipfile database of sketches that contains duplicates, you can’t predict which of the duplicates will be chosen as a match; but you can build your own collection of prioritized matches as a separate database, and put it first on the command line. A practical application of this might be to list the GTDB “representatives” database first on the command line, with the full GTDB database second, in order to prioritize choosing representative genomes as matches over the rest.
This also plays a role in the order of reporting for prefetch
output - prefetch
will report matching sketches in the order it
encounters them, which will match the order in which collections are
given to sourmash prefetch
on the command line.
Formats natively understood by sourmash¶
sourmash should always autodetect the format of a collection or database, in most cases based on its content (and not its filename). Please file a bug report if this doesn’t work for you!
sourmash sig summarize
is a good way to examine the properties of a
signature collection.
Reading and writing gzipped CSV files¶
(As of sourmash v4.5)
When a CSV filename is specified (e.g. sourmash gather ... -o mygather.csv
), you can always provide a name that ends with .gz
to
produce a gzip-compressed file instead. This can save quite a bit of
space for prefetch results and manifests in particular!
All sourmash commands that take in a CSV (via manifest, or picklist,
or taxonomy) will autodetect a gzipped CSV based on content (the file
does not need to end with .gz
). The one exception is manifests,
where the CSV needs to end with .gz
to be loaded as a gzipped CSV;
see
sourmash#2214
for an issue to fix this.