Prepared search databases

RefSeq microbial genomes - SBT

These database are formatted for use with sourmash search and sourmash gather.

Approximately 60,000 microbial genomes (including viral and fungal) from NCBI RefSeq.

Genbank microbial genomes - SBT

These database are formatted for use with sourmash search and sourmash gather.

Approximately 100,000 microbial genomes (including viral and fungal) from NCBI Genbank.

Details

The individual signatures for the above SBTs were calculated as follows:

sourmash compute -k 4,5 \
                         -n 2000 \
                         --track-abundance \
                         --name-from-first \
                         -o {output} \
                         {input}

sourmash compute -k 21,31,51 \
                         --scaled 2000 \
                         --track-abundance \
                         --name-from-first \
                         -o {output} \
                         {input}

See https://github.com/dib-lab/sourmash_databases for a Snakemake workflow to build the databases.

Genbank LCA Database

These databases are formatted for use with sourmash lca.

Approximately 87,000 microbial genomes (including viral and fungal) from NCBI Genbank.

Details

The above LCA databases were calculated as follows:

sourmash lca index genbank-genomes-taxonomy.2017.05.29.csv \
    genbank-k21.lca.json.gz -k 21 --scaled=10000 \
    -f --traverse-directory .sbt.genbank-k21 --split-identifiers

See github.com/dib-lab/2018-ncbi-lineages for information on preparing the genbank-genomes-taxonomy file.