Legacy Databases

Sourmash databases have evolved over time. We have changed how the database is stored (uncompressed .zip) and how we name each signature. All SBT databases below are in .sbt.zip format. Note that the SBT and LCA databases can be used with sourmash v3.5 and later, while Zipfile collections can only be used with sourmash v4.1.0 and up. We detail these changes below, and include links to legacy databases. See github.com/sourmash-bio/databases for a Snakemake workflow that builds current and legacy databases.

Sourmash signature names

Earlier versions of sourmash databases were built using individual signatures that were calculated as follows:

sourmash compute -k 4,5 \
                 -n 2000 \
                 --track-abundance \
                 --name-from-first \
                 -o {output} \
                 {input}

sourmash compute -k 21,31,51 \
                 --scaled 2000 \
                 --track-abundance \
                 --name-from-first \
                 -o {output} \
                 {input}

We moved away from this strategy because --name-from-first named each signature from the name of the first sequence in the FASTA file. While the species name of the organism was present in this name, the accession number corresponded to the accession of the first sequence fragment in the file, not the genome assembly. As such, we revised our strategy so that signatures are named by genome assembly accession and species name. This requires the assembly_summary.txt file to be parsed.

Sourmash database compression

Legacy databases

RefSeq microbial genomes - SBT

These database are formatted for use with sourmash search and sourmash gather. They are calculated with a scaled value of 2000.

Approximately 91,000 microbial genomes (including viral and fungal) from NCBI RefSeq.

Genbank microbial genomes - SBT

These database are formatted for use with sourmash search and sourmash gather.

Approximately 98,000 microbial genomes (including viral and fungal) from NCBI Genbank.

Genbank microbial genomes - LCA

These databases are formatted for use with sourmash lca; they are v2 LCA databases and will work with sourmash v2.0a11 and later. They are calculated with a scaled value of 10000 (1e5).

Approximately 87,000 microbial genomes (including viral and fungal) from NCBI Genbank.

The above LCA databases were calculated as follows:

sourmash lca index genbank-genomes-taxonomy.2017.05.29.csv \
    genbank-k21.lca.json.gz -k 21 --scaled=10000 \
    -f --traverse-directory .sbt.genbank-k21 --split-identifiers

See github.com/dib-lab/2018-ncbi-lineages for information on preparing the genbank-genomes-taxonomy when signatures are generated using --name-from-first.

GTDB databases - SBT

All files below are available here.

Release 89

Release 95