# A guide to the internal design and structure of sourmash ```{contents} Contents :depth: 3 ``` sourmash was created in 2015, and has been repeatedly reorganized, refactored, and optimized to support ever larger databases, faster queries, and new use cases. We've also regularly added new functionality and features. So sourmash can be pretty complicated internally, and our user-facing documentation only covers a fraction of its potential! This document is a brain dump intended for expert users and sourmash developers who want to understand how, why, and when to use various sourmash features. It is unlikely ever to be comprehensive, so the information you are interested in may not yet exist in this document, but we are always happy to add to it - [just ask in an issue!](https://github.com/sourmash-bio/sourmash/issues) ## Signatures and sketches sourmash operates on sketches. Each sketch is a collection of hashes, which are in turn built from k-mers by applying a hash function (currently always murmurhash) and a filtering function. Each sketch is contained in a signature wrapper that contains some metadata. Internally, sketches (class `MinHash`) contain the following information: * a set of hashes; * an optional abundance for each hash (when `track_abund` is True); * a seed; * a k-mer size; * a molecule type; * either a `num` (for MinHash) or a `scaled` value (for FracMinHash); Signature objects (class `SourmashSignature`) contain a sketch (property `.minhash`) as well as additional information: * an optional `name` * an optional `filename` * a license (currently must be CC0); * an `md5sum(...)` method that returns a hash of the sketch. For now, we tend to refer to signatures and sketches interchangeably, because they are almost entirely 1:1 in the code base (but see [sourmash#616](https://github.com/sourmash-bio/sourmash/issues/616)). The default signature interchange/serialization format is JSON, optionally gzipped. This format is written and read by Rust code. In general, a lot of effort in sourmash is spent managing collections of signatures _before_ actually doing comparisons with them; see manifests, and `Index` objects, below. ### Making sketches Sketches are produced by hashing k-mers with murmurhash and then keeping either the lowest `num` hashes (for MinHashes sketches) or keeping all hashes below `2**64 / scaled` (for FracMinHash sketches). This has the effect of selecting approximately one hash for every `scaled` k-mers - so, when sketching a set of 100,000 distinct k-mers, a scaled value of 1,000 would yield approximately 100 hashes to be retained in the sketch. The default MinHash sketches use parameters so that they are compatible with mash sketches. See [utils/compute-dna-mh-another-way.py](https://github.com/sourmash-bio/sourmash/blob/latest/utils/compute-dna-mh-another-way.py) for details on how k-mers are hashed. Note that if hashes are produced some other way (with a different hash function) or from some source other than DNA, sourmash can still work with them; only `sourmash sketch` actually cares about DNA sequences, everything else works with hashes. ### Compatibility checking The point of the signatures and sketches is to enable certain kinds of rapid comparisons - Jaccard similarity and number of overlapping k-mers, specifically. However, these comparisons can only be done between compatible sketches. Here, "compatible" means - * the same MurmurHash seed (default 42); * the same k-mer size/ksize (see k-mer sizes, below); * the same molecule type (see molecule types, below); * the same `num` or `scaled` (although see [this downsampling discussion](api-example.md#downsampling-signatures), and the next two sections); sourmash uses selectors (`Index.select(...)`) to select sketches with compatible ksizes, molecule types, and sketch types. ### Scaled (FracMinHash) sketches support similarity and containment Per our discussion in [Irber et al., 2022](https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2), FracMinHash sketches can always be compared by downsampling to the max of the two scaled values. (This is not always indexed collections of sketches, e.g. SBTs; see [sourmash#1799](https://github.com/sourmash-bio/sourmash/issues/1799).) In practice, sourmash does all necessary downsampling dynamically, but returns the original sketches. This means that (for example) you can do a low-resolution/high-scaled search quickly by specifying a high `scaled` value, and then use a higher resolution comparison with the resulting matches for a more refined and accurate analysis (see below, [Speeding up `gather` and `search`](#speeding-up-gather-and-search).) ### Num (MinHash) sketches support Jaccard similarity "Regular" MinHash (or "num MinHash") sketches are implemented the same way as in mash. However, they are less well supported in sourmash, because they don't offer the same opportunities for metagenome analysis. (See also [sourmash#1354](https://github.com/sourmash-bio/sourmash/issues/1354).) Num MinHash sketches can always be compared by downsampling to a common `num` value. This may need to be done manually using `sourmash sig downsample`, however. ### Conversion between Scaled (FracMinHash) and Num (MinHash) signatures with `downsample` As discussed in the previous sections, it is possible to adjust the `scaled` and `num` values to compare two FracMinHash signatures or two Num MinHash signatures. However, it is also possible to covert between the `scaled` and `num` signatures with the `sourmash sig downsample` command. For more details, review the [command line docs for `sig downsample`](https://sourmash.readthedocs.io/en/latest/command-line.html#sourmash-signature-downsample-decrease-the-size-of-a-signature). ### Operations you can do safely with FracMinHash sketches As described in [Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers](https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2), FracMinHash sketches support a wide range of operations that mirror actions taken on the underlying data set _without_ revisiting the underlying data. This allows users to build sketches once (requiring the original data) and then do all sorts of manipulations on the sketches, and know that the results of the sketch manipulations represent what would happen if they did the same thing on the original data. For example, * set unions, intersections, and subtractions all perform the same when done on the sketches as when applied to the underling data. So, for example, you can sketch two files separately and merge the sketches (with `sig merge`), and get the same result as if you'd concatenated the files first and then sketched them. * if you filter hashes on abundance with `sig filter`, you get the same result as if you filtered the data set on k-mer abundance and then sketched it. * downsampling: you can sketch the original data set at a high resolution (e.g. scaled=100) and then downsample it later (to e.g. scaled=1000), and get the same result as if you'd sketched the data set at scaled=1000. ## K-mer sizes There is no explicit restriction on k-mer sizes built into sourmash. For highly specific genome and metagenome comparisons, we typically use k=21, k=31, or k=51. For a longer discussion, see [Assembly Free Analysis with K-mers](https://github.com/mblstamps/stamps2022/blob/main/kmers_and_sourmash/2022-stamps-assembly-free%20class.pdf) from STAMPS 2022 and a more general overview at [Using sourmash:a practical guide](using-sourmash-a-guide.md). ## Molecule types - DNA, protein, Dayhoff, and hydrophobic-polar sourmash supports four different sequence encodings, which we refer to as "molecule": DNA (`--dna`), protein (`--protein`), [Dayhoff encoding](https://en.wikipedia.org/wiki/Margaret_Oakley_Dayhoff#Table_of_Dayhoff_encoding_of_amino_acids), (`--dayhoff`), and [hydrophobic-polar](sourmash-sketch.md#protein-encodings) (`--hp`). All FracMinHash sketches have exactly one molecule type, and can only be compared to the same molecule type (and ksize). DNA moltype sketches can be constructed from DNA input sequences using `sourmash sketch dna`. Protein, Dayhoff, and HP moltype sketches can be constructed from protein input sequences using `sourmash sketch protein`, or from DNA input sequences using `sourmash sketch translate`; `translate` will translate in all six reading frames (see also [orpheum](https://github.com/czbiohub/orpheum) from [Botvinnik et al., 2021](https://www.biorxiv.org/content/10.1101/2021.07.09.450799v1)). By default protein sketches will be created; dayhoff sketches can be created by including `dayhoff` in the param string, e.g. `sourmash sketch protein -p dayhoff`, and hydrophobic-polar sketches can be built with `hp` in the param string, e.g. `sourmash sketch protein -p hp`. ## Manifests Manifests are catalogs of sketches: they include most of the information about a sketch, except for the actual hashes. The idea of manifests is that you can use them to identify *which* sketches you are interested in before actually working with them (loading them, for example). sourmash makes extensive use of signature manifests to support rapid selection and lazy loading of signatures based on signature metadata (name, ksize, moltype, etc.) See [Blog post: Scaling sourmash to millions of samples](http://ivory.idyll.org/blog/2021-sourmash-scaling-to-millions.html) for some of the motivation. Manifests are an internal format that is not meant to be particularly human readable, but the CSV format can be loaded into a spreadsheet program if you're curious :). If a collection of signatures is in a zipfile (`.zip`) or SBT zipfile (`.sbt.zip`), manifests must be named `SOURMASH-MANIFEST.csv`. They can also be stored directly on disk in CSV/gzipped CSV, or in a sqlite database; see `sourmash sig manifest`, `sourmash sig check`, and `sourmash sig collect` for manifest creation, management, and export utilities. Where signatures are stored individually in `Index` collections, e.g. as separate files in a zipfile, manifests may be stored alongside them; for other subclasses of `Index` such as the inverted indices, manifests are generated dynamically by the class itself. Currently (sourmash 4.x) manifests do not contain information about the hash seed or sketch license. This will be fixed in the future - see [sourmash#1849](https://github.com/sourmash-bio/sourmash/issues/1849). Manifests are very flexible and, especially when stored in a sqlite database, can be extremely performant for organizing hundreds of thousands to millions of sketches. Please see `StandaloneManifestIndex` for a lazy-loading `Index` class that supports such massive-scale organization. ## Index implementations The `Index` class and its various subclasses (in `sourmash.index`) are containers that provide an API for organizing, selecting, and searching (potentially) large numbers of signatures. `sourmash sig summarize` is a good way to determine what type of `Index` class is used to handle a collection. Loading and saving of `Index` objects is handled separately from the class: loading can be done in Python via the `sourmash.load_file_as_index(...)` method, while creation and/or updating of `Index` objects is done via `sourmash.sourmash_args.SaveSignaturesToLocation(...)`. These are the same APIs used by the command-line functionality. There are quite a few different `Index` subclasses and they all have distinct features. We have a high-level guide to which collection type to use [here](command-line.md#loading-many-signatures). Conceptually, `Index` classes are either organized around storing individual signatures often with metadata that permits loading, selecting, and/or searching them more efficiently (e.g. `ZipFileLinearIndex` and `SBTs`); or they store signatures as inverted indices (`LCA_Database` and `SqliteIndex`) that permit certain kinds of fast queries. Unless otherwise noted, the `Index` classes below can be loaded concurrently in "read only" mode - that is, you should build the collection _once_, and then use it from multiple processes. We currently do not test for or support concurrent read/write. Note also that (generally speaking) memory footprints will be additive, so loading the same `LCA_Database` twice will consume twice the memory. (If you're interested in concurrency, we suggest using the sqlite containers - see `SqliteIndex`.) ### In-memory storage and search. The simplest way to handle collections of signatures is to load them into memory, but it is also the least performant and most memory intensive mechanism! `LinearIndex` and `MultiIndex` both support sketches loaded from JSON files; both will load the sketches once and then keep them in memory. `LinearIndex` does not use manifests while `MultiIndex` builds a manifest as it loads the sketches. Note that `MultiIndex` is the class used to load signatures from pathlists, directory hierarchies, and so on; because it stores sketches in memory, this can incur a significant memory penalty (see [sourmash#1899](https://github.com/sourmash-bio/sourmash/issues/1899)). Therefore where possible we suggest building a standalone manifest (`StandaloneManifestIndex`) to do lazy loading from the disk instead; you can use `sourmash sig collect` to do this. ### Zipfile collections `ZipFileLinearIndex` stores signature files in a zip file with an accompanying manifest. This is the most versatile and compressed option for working with large collections of sketches - it supports rapid selection and loading of specific sketches from disk, and can store and search any mixture of sketch types (ksize, molecule type, scaled values, etc.) By default, `ZipFileLinearIndex` stores one signature (equiv. one sketch) in each member file in the zip archive. Each signature is stored uncompressed. The accompanying manifest stores the full member file path in `internal_location`, so that sketches can be retrieved directly. Searching a `ZipFileLinearIndex` is done linearly, as implied by the name. This is fine for `gather` but if you are doing repeated queries with `search` you may want to use an SBT or LCA database instead; see below. In the future we expect to parallelize searching `ZipFileLinearIndex` files in Rust; see [sourmash#1752](https://github.com/sourmash-bio/sourmash/issues/1752). `ZipFileLinearIndex` does support zip files without manifests as well as multiple signatures in a single file; this was originally intended to support simply zipping entire directory hierarchies into a zipfile. However, this slows down performance and is not recommended. If you have an existing zipfile (or really any collection of signatures) and you want to turn them into a proper `ZipFileLinearIndex`, you can use `sig cat -o combined.zip` to create a `ZipFileLinearIndex` file named `combined.zip` that will have a manifest and signatures broken out into individual files. ### Sequence Bloom Trees (SBTs) Sequence Bloom Trees (SBTs; see [the Kingsford Lab page for details](http://www.cs.cmu.edu/~ckingsf/software/bloomtree/)) provide a faster (but more memory intensive) on-disk storage and search mechanism. In brief, SBTs implement a binary tree organization of an arbitrary number of signatures; each internal node is a Bloom filter containing all of the hashes for the nodes below them. This permits potentially rapid elimination of irrelevant nodes on search. SBTs are restricted to storing and searching sketches with the same/single k-mer size and molecule type, as well as either a single num value or a single scaled value. We suggest using SBTs when you are doing multiple Jaccard search or containment searches with genomes via `sourmash search`. ### Lowest common ancestor (LCA) databases The `LCA_Database` index class stores signatures in an inverted index, where a Python dictionary is used to link individual hashes back to signatures and/or taxonomic lineages. This supports the individualized hash analyses used in the `lca` submodule. LCA databases only support a single ksize, moltype, and scaled. They can only be used with FracMinHash (scaled) sketches. The default `LCA_Database` class is serialized via JSON, and loads everything into memory when requested. The load time incurs a significant latency penalty when used from the command line, as well as having a potentially large memory footprint; this makes it difficult to use the default `LCA_Database` for very large databases, e.g. genbank bacteria. The newer `LCA_SqliteDatabase` (based on `SqliteIndex`, described below) also supports LCA-style queries, and is stored on disk, is fast to load, and uses very little memory. The tradeoff is that the underlying sqlite database can be quite large. `LCA_SqliteDatabase` should also support rapid concurrent access (see [sourmash#909](https://github.com/sourmash-bio/sourmash/issues/909)). Both types of LCA database can be constructed with `sourmash lca index`. ### SqliteIndex The `SqliteIndex` storage class uses sqlite3 to store hashes and sketch information for search and retrieval; see [this blog post](http://ivory.idyll.org/blog/2022-storing-ulong-in-sqlite-sourmash.html) for background information and details. These are fast, low-memory, on-disk databases, with the tradeoff that they can be quite large. This is probably currently the best solution for concurrent access to sketches via e.g. a Web server (see also [sourmash#909](https://github.com/sourmash-bio/sourmash/issues/909)). `SqliteIndex` can only contain FracMinHash sketches and can only store sketches with the same scaled parameter. However, it can store multiple ksizes and moltypes as long as the same scaled is used. `SqliteIndex` objects can be constructed using `sourmash sig cat ... -o filename.sqldb`. ### Standalone manifests The `StandaloneManifestIndex` class loads standalone manifests generated by `sourmash sig collect`. They support rapid selection and lazy loading on potentially extremely large collections of signatures. The underlying mechanism uses the `internal_location` field of manifests to point to the container file. When particular sketches are requested, the container file is loaded into an `Index` object with `sourmash.load_file_as_index` and the `md5` values of the requested sketches are used as a picklist to retrieve the desired signatures. Thus, while standalone manifests can point at any kind of container, including JSON files or LCA databases, they are most efficient when `internal_location` points at a file with either a single sketch in it, or a manifest that supports direct loading of sketches. Therefore, we suggest using standalone manifest indices. Note that sourmash interprets paths to locations in standalone manifests relative to the manifest filename; see the `--relpath` behavior in `sig check` and `sig collect` to output manifests that deal with relative filenames properly. Note that searching a standalone manifest is currently done through a linear iteration, and does not use any features of indexed containers such as SBTs or LCAs. This is fine for `gather` with the default approach, but is probably suboptimal for a `search`. ### Pathlists and `--from-file` All (or most) sourmash commands natively support taking in lists of signature collections via pathlists, `--from-file`, or paths to directories. This is useful for situations where you have thousands of signature files and don't want to provide them explicitly on the command line; you can simply put a list of the files in a text file, and pass it in directly (or use `--from-file` to pass it in). Both pathlists and files passed to `--from-file` contain a list of paths to be loaded; relatives paths will be interpreted relative to the current working directory of sourmash. Pathlists should be universally available on sourmash commands. When `--from-file` is available for a command, sourmash will behave as if the file paths in the file were provided on the command line. We suggest avoiding pathlists. Instead, we suggest using `--from-file` or a standalone manifest index (generated with `sourmash sig collect`). This is because the signatures from pathlists are loaded into memory (see `MultiIndex`, above) it is generally a bad idea to use them - they may be slow to load and may consume a lot of memory. They also do not support good loading error messages; see [sourmash#1414](https://github.com/sourmash-bio/sourmash/issues/1414). ### Extensions for outputting index classes Most commands that support saving signatures will save them in a variety of formats, based on the extension provided (see [sourmash#1890](https://github.com/sourmash-bio/sourmash/issues/1890) for exceptions). The supported extensions are - * `.zip` for `ZipFileLinearIndex` * `.sqldb` for `SqliteIndex` * `.sig` or `.sig.gz` for JSON/gzipped JSON * `dirname/` to save in a directory hierarchy The default signature save format is JSON, if the extension is not recognized. ## Speeding up `gather` and `search` There are two primary search commands in sourmash: `gather` and `search`. `gather` calculates a minimum metagenome cover as discussed in [Irber et al., 2022](https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2). It is mostly intended for querying a database with a metagenome, although it can be used with genome queries, as well. This approach relies on overlaps between genomes and metagenomes and can only be used with FracMinHash sketches. `search` does a straight Jaccard similarity search on MinHash and FracMinHash sketches (or, with `--containment`, a containment search on FracMinHash sketches). It is typically used to find matches to a query genome sketch in a large database of sketches. The `prefetch` command does a containment search and is intended for power users; it is a standalone implementation of the prefetch algorithm discussed below for `gather`. It only works with FracMinHash sketches. Note that all of these commands work with any and all `Index` collection/container types, and will return the same results however the collections are organized - see the "online behavior" section, below. In practice this means that you can provide additional collections of signatures via the command line without building a combined index of all your signatures. It also means that the only reason to choose different collections/containers is for optimization - you should select the containers that help you achieve the desired performance characteristics for your search (i.e. the right memory/time/disk space tradeoffs). ### Running `search` many times on the same database `search` typically is used to search a large database of sketches for all similarity or containment matches above a threshold. Depending on the query and the database, certain kinds of database indices may make search much faster, especially when only a few matches are expected. If you are doing many searches against the same database, indexing the database as an SBT (with `sourmash index`) or as a `SqliteIndex`/sqldb database is likely to provide a significant speed increase, albeit with increased memory usage (SBT) or increased disk space (sqldb). Conversely, `ZipFileLinearIndex` and the default `LCA_Database` are likely to be poor choices for many searches - the former only supports linear searches, and the latter needs to be loaded from disk and deserialized each time. ### Running `gather` once `gather` is typically used to search a metagenome against a large database of sketches, as part of finding a minimum set cover. This can be quite slow! Our current implementation (as of sourmash 4.1.0, [pull request sourmash#1370](https://github.com/sourmash-bio/sourmash/pull/1370)) does a single pass across the database to find all matches with a Jaccard similarity or containment above the provided threshold, and then organizes the matches for rapid min-set-cov analysis. This single pass across the database is called a "prefetch", and it is also implemented in the `prefetch` subcommand. With this single pass approach, benchmarks - [sourmash#2014](https://github.com/sourmash-bio/sourmash/issues/2014) - show that a linearly searchable database is performant enough to be used with `gather`. We therefore suggest using a `ZipFileLinearIndex` container with gather, or in cases where low-memory concurrency is desired, a `SqliteIndex` container. ### Using `prefetch` and `gather` together If you want to use `prefetch` independently of `gather`, you can use the prefetch output as a picklist passed into gather - see [picklists](#picklists), below. This can be useful when you want to experiment with different threshold parameters for `gather` - first, do a very sensitive/low-threshold search with `prefetch` and save the results to a CSV file with `-o`, Repeated gathers and searches. CTB Using prefetch explicitly. CTB ### Using a higher scaled value With FracMinHash sketches, you can downsample the query to make both `search` and `gather` _much_ faster. A good rule of thumb is to use a scaled value that is about 5x smaller than the minimum overlap to detect; so, if you want to be able to detect 50kb of similarity, you can use a scaled value of 10,000. Conversely, the default scaled value of 1,000 (for DNA sketches) should robustly detect overlaps of 5kb. You can supply `--scaled` to `gather` and `prefetch` to dynamically downsample the query FracMinHash. For `search` you will need to use `sourmash sig downsample` to generate a downsampled sketch. ### Running `gather` many times - `multigather` In situations where loading the search database is slow (e.g. `LCA_Database` or zipfiles with very large manifests), the `sourmash multigather` command supports many queries against many databases. (We don't particularly suggest using `multigather`; we would prefer to make search databases faster. But it's there! :) ### Much faster search and gather with branchwater We also have a reasonably stable plugin, [pyo3_branchwater](https://github.com/sourmash-bio/pyo3_branchwater), that implements multithreaded operations using Rust. It is 100-1000 times faster than sourmash, and 5-50 times lower memory. In exchange, it's not quite as flexible as the full sourmash package. But if you're running into speed or memory problems, you should give it a try! ## Taxonomy and assigning lineages All sourmash taxonomy handling is done within the `lca` and `tax` subcommands (CLI) and submodules (Python). In the case of the `lca` subcommands, the taxonomic information is incorporated into the LCA database construction (see the `lca index` command), while the `tax` subcommands load taxonomic information on demand from taxonomy databases (CSVs or databases). sourmash anchors all taxonomy to identifiers, and uses the signature name to do so - this is the name as set by the `--name` parameter to `sourmash sketch`, and output by `sourmash sig describe` as the `signature:` field. ### Identifier handling sourmash prefers identifiers to be the first space-separated token in the signature name. This token can contain any alphanumeric letters other than space, and should contain at most one period. The version of the identifier will be the component after the period. So, for example, for a signature name of ``` CP001941.1 Aciduliprofundum boonei T469, complete genome ``` the identifier would be `CP001941.1` and the version would be 1. There are no other constraints placed on the identifier, and versions are not handled in any special way other than as below. The `lca index` and `tax` commands both support some modified identifier handling in sourmash 3.x and 4.x, but in the future, we plan to deprecate these as they mostly cause confusion and internal complexity. The two modifiers are: * `--keep-full-identifiers` will use the entire signature name instead of just the first space-separated token. It is by default off (set to False). * `--keep-identifier-versions` turns on keeping the full identifier, including what is after the first period. It is by default off (set to False), stripping identifiers of their version on load. When it is on (True), identifiers are not stripped of their version on load. ### Taxonomies, or lineage spreadsheets sourmash supports arbitrary (free) taxonomies, and new taxonomic lineages can be created and used internally as long as they are provided in the appropriate spreadsheet format. You can also mix and match taxonomies as you need; for example, it is entirely legitimate in sourmash-land to combine the GTDB taxonomy for bacterial and archaeal sequence classification, with the NCBI taxonomy for eukaryotic and viral sequence classification. (You probably don't want to mix and match within superkingdoms, though!) As of sourmash v4, lineage spreadsheets should contain columns for superkingdom, phylum, class, order, family, genus, and species. Some commands may also support a 'strain' column, although this is inconsistently handled within sourmash internally at the moment. For spreadsheet organization, `lca index` expects the columns to be present in order from superkingdom on down, while the `tax` subcommands use CSV column headers instead. We are planning to consolidate around the `tax` subcommand handling in the future (see [sourmash#2198](https://github.com/sourmash-bio/sourmash/issues/2198)). An example spreadsheet is [here, bacteria_refseq_lineage.csv](https://github.com/sourmash-bio/sourmash/blob/latest/tests/test-data/tax/bacteria_refseq_lineage.csv). (The `taxid` column is not used by most sourmash functions and is mostly ignored, but it is needed for the `kreport` and `bioboxes` report formats.) ### `LCA_SqliteDatabase` - a special case The `LCA_SqliteDatabase` index class can serve multiple purposes: as an index of sketches (for regular search and gather); as a taxonomy database for use with the `tax` subcommands; and as an LCA database for use with the `lca` subcommands. When used as a taxonomy database, an `LCA_SqliteDatabase` file contains the same SQL tables as a sqlite taxonomy database. When used as an LCA database, an `LCA_SqliteDatabase` dynamically loads the taxonomic lineages from the sqlite database and applies them to the individual hashes, permitting the same kind of hash-to-lineage query capability as the `LCA_Database`. ## Picklists Picklists are a generic mechanism used to select a (potentially small) subset of signatures for search/display. The general idea of picklists is that you create a list of signatures you're interested in - by name, or identifier, or md5sum - and then supply that list in a csvfile on the command line via `--picklist`. For example, `--picklist list.csv:colname:ident` would load the values in the column named `colname` in the file `list.csv` as identifiers to be used to restrict the search. The support picklist column types are `name`, `ident` (space-delimited identifier), `identprefix` (identifier with version removed), `md5`, `md5prefix8`, and `md5short`. Generally the `md5` and derived values are used to reference signatures found some other way with sourmash, while the identifiers are more broadly useful. There are also four special column types that can be used without a column name: `gather`, `prefetch`, `search`, and `manifest`. These take the CSV output of the respective sourmash commands as inputs for picklists, so that you can use prefetch to generate a picklist and then use that picklist with `--picklist prefetch_out.csv.gz::prefetch`. ### Differing internal behavior Picklists behave differently with different `Index` classes. For indexed databases like SBT, LCA, and `SqliteIndex`, the search is done _first_, and then only those results that match the picklist are selected. For linear search databases like `ZipFileLinearIndex` or standalone manifests, picklists are _first_ used to subselect the desired signatures, and only those signatures are searched. This means that picklists can dramatically speed up searches on some `Index` types, but won't affect performance on others. But the results will be the same. ### Taxonomy / lineage spreadsheets as picklists Note that lineage CSV spreadsheets, as consumed by `sourmash tax` commands and as output by `sourmash tax grep`, can be used as `ident` picklists. ## Online and streaming; and adding to collections of sketches. One of the big challenges with Big Data is looking at it all at once - loading all your data into memory, for example, will fail with really large data sets. The ability to look at subsets of data without looking at _all_ of it is called "streaming" (much like when you watch a streaming movie online - you can start watching the movie without downloading the whole video, and you can also usually jump to a particular location in the video without downloading the intervening bits.) Another related challenge is analyzing data against a database that is constantly growing, either because you're adding to it or because it's being updated by others. For example, in genomics, often you want to repeat the same analysis you did last time but with more reference genomes. With many software packages, this requires rebuilding your indexed database, which can be challenging for large genomes. In computer science parlance, the ability to add new data at the end _without_ performing an expensive reindexing operation is referred to as "online". sourmash tackles these challenges in a few different ways, and does its best to support streaming and online behavior. First, all sourmash commands can take multiple databases and will return the same results with multiple databases as they would with a single database containing the same sketches, unless otherwise noted. This allows you to incrementally expand your sketch collections over time without building new databases. _Performance_ may vary (i.e. if you're using an SBT to do search, and you add an unindexed collection of sketches to the search, the search may take longer than if you'd add the new sketches to the SBT) but the _results_ will be the same. In this sense, many of the sourmash algorithms are online. Second, several sourmash algorithms use _streaming_ when searching databases - in particular, `prefetch` will load and unload sketches as it goes, as long as the underlying collection data structure supports it (`.sig.gz` and LCA JSON databases do _not_, but zip files, SBTs, and SQLite databases _do_). This lets you do containment searches against really large collections without consuming large amounts of memory. Another example is the `manysearch` command in the [pyo3_branchwater](https://github.com/sourmash-bio/pyo3_branchwater) plugin, which loads and searches a limited number of metagenomes from a large collection, rather than loading the entire collection into memory - which would be impossible. Last but not least, one of the interesting guarantees that FracMinHash sketches provide is that no hash is ever _removed_ when sketching. This supports various types of input streaming, which we haven't spent too much time exploring, but (for example) means that "watching" sequencing runs and/or downloads of sequencing data, and reporting interim results with certainty, is possible. If you're interested in making use of this, please reach out! ### Gather on multiple collections, and order of search and reporting Since `sourmash gather` will pick only one "best match" if there are several (and will ignore the others), the order of searching can matter for large collections. How does this work? In brief, sourmash doesn't guarantee a particular load order for sketches in a single collection, but it _does_ guarantee that collections are loaded and searched in their entirety in the order that you provide them. So, for example, if you have a large zipfile database of sketches that contains duplicates, you can't predict which of the duplicates will be chosen as a match; but you _can_ build your own collection of prioritized matches as a separate database, and put it first on the command line. A practical application of this might be to list the GTDB "representatives" database first on the command line, with the full GTDB database second, in order to prioritize choosing representative genomes as matches over the rest. This also plays a role in the order of reporting for `prefetch` output - `prefetch` will report matching sketches in the order it encounters them, which will match the order in which collections are given to `sourmash prefetch` on the command line. ## Formats natively understood by sourmash sourmash should always autodetect the format of a collection or database, in most cases based on its content (and not its filename). Please file a bug report if this doesn't work for you! `sourmash sig summarize` is a good way to examine the properties of a signature collection. ### Reading and writing gzipped CSV files (As of sourmash v4.5) When a CSV filename is specified (e.g. `sourmash gather ... -o mygather.csv`), you can always provide a name that ends with `.gz` to produce a gzip-compressed file instead. This can save quite a bit of space for prefetch results and manifests in particular! All sourmash commands that take in a CSV (via manifest, or picklist, or taxonomy) will autodetect a gzipped CSV based on content (the file does not need to end with `.gz`). The one exception is manifests, where the CSV needs to end with `.gz` to be loaded as a gzipped CSV; see [sourmash#2214](https://github.com/sourmash-bio/sourmash/issues/2214) for an issue to fix this.