Legacy Databases¶
Sourmash databases have evolved over time.
We have changed how the database is stored (uncompressed .zip
) and how we name each signature.
All SBT databases below are in .sbt.zip
format.
Note that the SBT and LCA databases can be used with sourmash v3.5 and later, while Zipfile collections can only be used with sourmash v4.1.0 and up.
We detail these changes below, and include links to legacy databases.
See github.com/sourmash-bio/databases for a Snakemake workflow that builds current and legacy databases.
Sourmash signature names¶
Earlier versions of sourmash databases were built using individual signatures that were calculated as follows:
sourmash compute -k 4,5 \
-n 2000 \
--track-abundance \
--name-from-first \
-o {output} \
{input}
sourmash compute -k 21,31,51 \
--scaled 2000 \
--track-abundance \
--name-from-first \
-o {output} \
{input}
We moved away from this strategy because --name-from-first
named each signature from the name of the first sequence in the FASTA file.
While the species name of the organism was present in this name, the accession number corresponded to the accession of the first sequence fragment in the file, not the genome assembly.
As such, we revised our strategy so that signatures are named by genome assembly accession and species name.
This requires the assembly_summary.txt
file to be parsed.
Sourmash database compression¶
Legacy databases¶
RefSeq microbial genomes - SBT¶
These database are formatted for use with sourmash search
and
sourmash gather
. They are calculated with a scaled value of 2000.
Approximately 91,000 microbial genomes (including viral and fungal) from NCBI RefSeq.
RefSeq k=21, 2018.03.29 - 3.3 GB - manifest
RefSeq k=31, 2018.03.29 - 3.3 GB - manifest
RefSeq k=51, 2018.03.29 - 3.4 GB - manifest
Genbank microbial genomes - SBT¶
These database are formatted for use with sourmash search
and
sourmash gather
.
Approximately 98,000 microbial genomes (including viral and fungal) from NCBI Genbank.
Genbank k=21, 2018.03.29 - 3.9 GB - manifest
Genbank k=31, 2018.03.29 - 3.9 GB - manifest
Genbank k=51, 2018.03.29 - 3.9 GB - manifest
Genbank microbial genomes - LCA¶
These databases are formatted for use with sourmash lca
; they are
v2 LCA databases and will work with sourmash v2.0a11 and later.
They are calculated with a scaled value of 10000 (1e5).
Approximately 87,000 microbial genomes (including viral and fungal) from NCBI Genbank.
Genbank k=21, 2017.11.07, 109 MB
Genbank k=31, 2017.11.07, 120 MB
Genbank k=51, 2017.11.07, 125 MB
The above LCA databases were calculated as follows:
sourmash lca index genbank-genomes-taxonomy.2017.05.29.csv \
genbank-k21.lca.json.gz -k 21 --scaled=10000 \
-f --traverse-directory .sbt.genbank-k21 --split-identifiers
See
github.com/dib-lab/2018-ncbi-lineages
for information on preparing the genbank-genomes-taxonomy when signatures are generated using --name-from-first
.
GTDB databases - SBT¶
All files below are available here.
Release 89
Release 95