# Legacy Databases Sourmash databases have evolved over time. We have changed how the database is stored (uncompressed `.zip`) and how we name each signature. All SBT databases below are in `.sbt.zip` format. Note that the SBT and LCA databases can be used with sourmash v3.5 and later, while Zipfile collections can only be used with sourmash v4.1.0 and up. We detail these changes below, and include links to legacy databases. See [github.com/sourmash-bio/databases](https://github.com/sourmash-bio/databases) for a Snakemake workflow that builds current and legacy databases. ## Sourmash signature names Earlier versions of sourmash databases were built using individual signatures that were calculated as follows: ``` sourmash compute -k 4,5 \ -n 2000 \ --track-abundance \ --name-from-first \ -o {output} \ {input} sourmash compute -k 21,31,51 \ --scaled 2000 \ --track-abundance \ --name-from-first \ -o {output} \ {input} ``` We moved away from this strategy because `--name-from-first` named each signature from the name of the first sequence in the FASTA file. While the species name of the organism was present in this name, the accession number corresponded to the accession of the first sequence fragment in the file, not the genome assembly. As such, we revised our strategy so that signatures are named by genome assembly accession and species name. This requires the `assembly_summary.txt` file to be parsed. ## Sourmash database compression ## Legacy databases ### RefSeq microbial genomes - SBT These database are formatted for use with `sourmash search` and `sourmash gather`. They are calculated with a scaled value of 2000. Approximately 91,000 microbial genomes (including viral and fungal) from NCBI RefSeq. * [RefSeq k=21, 2018.03.29][0] - 3.3 GB - [manifest](https://osf.io/wamfk/download) * [RefSeq k=31, 2018.03.29][1] - 3.3 GB - [manifest](https://osf.io/x3aut/download) * [RefSeq k=51, 2018.03.29][2] - 3.4 GB - [manifest](https://osf.io/zpkau/download) ### Genbank microbial genomes - SBT These database are formatted for use with `sourmash search` and `sourmash gather`. Approximately 98,000 microbial genomes (including viral and fungal) from NCBI Genbank. * [Genbank k=21, 2018.03.29][3] - 3.9 GB - [manifest](https://osf.io/vm5kb/download) * [Genbank k=31, 2018.03.29][4] - 3.9 GB - [manifest](https://osf.io/p87ec/download) * [Genbank k=51, 2018.03.29][5] - 3.9 GB - [manifest](https://osf.io/cbxg9/download) [0]: https://sourmash-databases.s3-us-west-2.amazonaws.com/zip/refseq-k21.sbt.zip [1]: https://sourmash-databases.s3-us-west-2.amazonaws.com/zip/refseq-k31.sbt.zip [2]: https://sourmash-databases.s3-us-west-2.amazonaws.com/zip/refseq-k51.sbt.zip [3]: https://sourmash-databases.s3-us-west-2.amazonaws.com/zip/genbank-k21.sbt.zip [4]: https://sourmash-databases.s3-us-west-2.amazonaws.com/zip/genbank-k31.sbt.zip [5]: https://sourmash-databases.s3-us-west-2.amazonaws.com/zip/genbank-k51.sbt.zip ### Genbank microbial genomes - LCA These databases are formatted for use with `sourmash lca`; they are v2 LCA databases and will work with sourmash v2.0a11 and later. They are calculated with a scaled value of 10000 (1e5). Approximately 87,000 microbial genomes (including viral and fungal) from NCBI Genbank. * [Genbank k=21, 2017.11.07](https://osf.io/d7rv8/download), 109 MB * [Genbank k=31, 2017.11.07](https://osf.io/4f8n3/download), 120 MB * [Genbank k=51, 2017.11.07](https://osf.io/nemkw/download), 125 MB The above LCA databases were calculated as follows: ``` sourmash lca index genbank-genomes-taxonomy.2017.05.29.csv \ genbank-k21.lca.json.gz -k 21 --scaled=10000 \ -f --traverse-directory .sbt.genbank-k21 --split-identifiers ``` See [github.com/dib-lab/2018-ncbi-lineages](https://github.com/dib-lab/2018-ncbi-lineages) for information on preparing the genbank-genomes-taxonomy when signatures are generated using `--name-from-first`. ### GTDB databases - SBT All files below are available [here](https://osf.io/wxf9z/). Release 89 * [GTDB k=31, release 89](https://osf.io/5mb9k/download) Release 95 * [GTDB k=21, scaled=1000](https://osf.io/4yhe2/download) * [GTDB k=31, scaled=1000](https://osf.io/4n3m5/download) * [GTDB k=51, scaled=1000](https://osf.io/c8wj7/download)