Prepared search databases

RefSeq microbial genomes - SBT

These database are formatted for use with sourmash search and sourmash gather. They are calculated with a scaled value of 2000.

Approximately 60,000 microbial genomes (including viral and fungal) from NCBI RefSeq.

Genbank microbial genomes - SBT

These database are formatted for use with sourmash search and sourmash gather.

Approximately 100,000 microbial genomes (including viral and fungal) from NCBI Genbank.

Details

The individual signatures for the above SBTs were calculated as follows:

sourmash compute -k 4,5 \
                 -n 2000 \
                 --track-abundance \
                 --name-from-first \
                 -o {output} \
                 {input}

sourmash compute -k 21,31,51 \
                 --scaled 2000 \
                 --track-abundance \
                 --name-from-first \
                 -o {output} \
                 {input}

See github.com/dib-lab/sourmash_databases for a Snakemake workflow to build the databases.

Genbank LCA Database

These databases are formatted for use with sourmash lca; they are v2 LCA databases and will work with sourmash v2.0a11 and later. They are calculated with a scaled value of 10000 (1e5).

Approximately 87,000 microbial genomes (including viral and fungal) from NCBI Genbank.

Details

The above LCA databases were calculated as follows:

sourmash lca index genbank-genomes-taxonomy.2017.05.29.csv \
    genbank-k21.lca.json.gz -k 21 --scaled=10000 \
    -f --traverse-directory .sbt.genbank-k21 --split-identifiers

See github.com/dib-lab/2018-ncbi-lineages for information on preparing the genbank-genomes-taxonomy file.