A quickstart guide to using RocksDB indexing¶
This quickstart shows how to index the GTDB species representatives database for fast search using sourmash index
with the RocksDB index type.
NOTE: You’ll need sourmash v4.9.0 or later to run this tutorial.
Download a database¶
Let’s start by downloading the GTDB rs226 “reps” database:
curl -JLO https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db/gtdb-rs226/gtdb-rs226-reps.k31.sig.zip
This database is 3.7 GB and contains sketches of the 143k genomes from GTDB rs226.
If you inspect the database with sourmash, you will see:
% sourmash sig summarize gtdb-rs226-reps.k31.sig.zip
...
is database? yes
has manifest? yes
num signatures: 143384
** examining manifest...
total hashes: 460383140
summary of sketches:
143384 sketches with DNA, k=31, scaled=1000 460383140 total hashes
Build an index¶
Run the sourmash index
command to build a RocksDB index:
sourmash index -F rocksdb gtdb-rs226-reps.k31.rocksdb gtdb-rs226-reps.k31.sig.zip
This will take about 2 hours and 10 GB of RAM (and may be significantly faster with an SSD). The time and memory should scale approximately with the total number of hashes in the database (460m, per the summarize
output above). The final size of the gtdb-rs226-reps.k31.rocksdb
directory is 12 GB - again, this should roughly scale with the total number of hashes in the database.
RocksDB indexes can only contain a single type of sketch, so if the source database contains more than one ksize, moltype, or scaled, you’ll need to specify which one to index. The command above could be modified to include -k 31 --scaled 1000 --dna
- but, in the case of this GTDB rs226 database, there is only one type of sketch, so sourmash automatically figures out what to index.
You can now run sourmash sig summarize
on the index:
% sourmash sig summarize gtdb-rs226-reps.k31.rocksdb
...
** loading from 'gtdb-rs226-reps.k31.rocksdb'
path filetype: DiskRevIndex
location: gtdb-rs226-reps.k31.rocksdb
is database? yes
has manifest? yes
num signatures: 143384
** examining manifest...
total hashes: 460383140
summary of sketches:
143384 sketches with DNA, k=31, scaled=1000 460383140 total hashes
RocksDB indexes support all of the sourmash signature collection commands - they are a full equivalent to any other sourmash collection.
Search the RocksDB index¶
Let’s try searching the index with sourmash search
.
First, grab a signature with sig grep
. I’ve chosen this one at random.
sourmash sig grep GCA_036718335 gtdb-rs226-reps.k31.sig.zip -o query.sig
now search the RocksDB index with the query:
sourmash search query.sig gtdb-rs226-reps.k31.rocksdb
You should see:
977 matches above threshold 0.080; showing first 3:
similarity match
---------- -----
100.0% GCA_036718335.1 Coriobacteriaceae bacterium
21.5% GCF_040098165.1 Collinsella sp. CLA-JM-H32
21.3% GCA_031292915.1 Collinsella sp.
Searching the RocksDB index takes about 15 seconds and 400 MB; by contrast, searching the .sig.zip
database takes 10 minutes and 4.1 GB.
Additional notes on RocksDB indexes¶
RocksDB indexes in sourmash are an inverted index, which store a mapping from each k-mer hash to the set of sketches that contain that k-mer. This means that query performance will scale with the size of the query, rather than with the number of signatures in the database.
RocksDB indexes are incredibly scalable; the branchwater petabase scale search system uses them internally.
The branchwater plugin provides a somewhat more flexible indexing command, as well. RocksDB databases produced by sourmash and the branchwater plugin are fully interoperable.