Using sourmash LCA to do taxonomic classification

The sourmash lca sub-commands do k-mer classification using an “lowest common ancestor” approach. See “Some discussion” below for links and details.

This tutorial should run without modification on Linux or Mac OS X, under Miniconda.

You’ll need about 5 GB of free disk space to download the database, and about 5 GB of RAM to search it. The tutorial should take about 20 minutes total to run.

Note, we have successfully tested it on binder.pangeo.io if you want to give it a try!

Install miniconda

If you don’t have the conda command installed, you’ll need to install miniconda for Python 3.x.

On Linux, this should work:

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b
echo export PATH="$HOME/miniconda3/bin:$PATH" >> ~/.bash_profile
source ~/.bash_profile

otherwise, follow the miniconda install.

Enable bioconda

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

Install sourmash

To install sourmash, create a new environment named smash and install sourmash:

conda create -y -n smash sourmash

and then activate:

conda activate smash

You should now be able to use the sourmash command:

sourmash info

Download some files

Next, download a genbank LCA database for k=31:

curl -L -o genbank-k31.lca.json.gz https://osf.io/4f8n3/download

Download a random genome from genbank:

curl -L -o some-genome.fa.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/178/875/GCF_000178875.2_ASM17887v2/GCF_000178875.2_ASM17887v2_genomic.fna.gz

Create a signature for this genome:

sourmash sketch dna -p scaled=1000,k=31 --name-from-first some-genome.fa.gz

Now, classify the signature with sourmash lca classify,

sourmash lca classify --db genbank-k31.lca.json.gz \
    --query some-genome.fa.gz.sig

and this will give you a taxonomic identification of your genome bin, classified using all of the genbank microbial genomes:

loaded 1 LCA databases. ksize=31, scaled=10000
finding query signatures...
outputting classifications to stdout
ID,status,superkingdom,phylum,class,order,family,genus,species,strain
... classifying NC_016901.1 Shewanella baltica OS678, complete genome (file 1 of"NC_016901.1 Shewanella baltica OS678, complete genome",found,Bacteria,Proteobacteria,Gammaproteobacteria,Alteromonadales,Shewanellaceae,Shewanella,Shewanella baltica,
classified 1 signatures total

You can also summarize the taxonomic distribution of the content with lca summarize:

sourmash lca summarize --db genbank-k31.lca.json.gz \
    --query some-genome.fa.gz.sig

which will show you:

loaded 1 LCA databases. ksize=31, scaled=10000
finding query signatures...
loaded 1 signatures from 1 files total.
97.9%   520   Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella
97.9%   520   Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae
97.9%   520   Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales
99.6%   529   Bacteria;Proteobacteria;Gammaproteobacteria
99.6%   529   Bacteria;Proteobacteria
99.6%   529   Bacteria
45.4%   241   Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella;Shewanella baltica

To apply this to your own genome(s), replace some-genome.fa.gz above with your own filename(s).

You can also specify multiple databases and multiple query signatures on the command line; separate them with --db or --query.

Building your own LCA database

(This is an abbreviated version of this blog post, updated to use the sourmash lca commands.)

Download some pre-calculated signatures:

curl -L https://osf.io/bw8d7/download -o delmont-subsample-sigs.tar.gz
tar xzf delmont-subsample-sigs.tar.gz

Next, grab the associated taxonomy spreadsheet

curl -O -L https://github.com/ctb/2017-sourmash-lca/raw/master/tara-delmont-SuppTable3.csv

Build a sourmash LCA database named delmont.lca.json:

sourmash lca index -f tara-delmont-SuppTable3.csv delmont.lca.json delmont-subsample-sigs/*.sig

Using the LCA database to classify signatures

We can now use delmont.lca.json to classify signatures with k-mers according to the database we just created. (Note, the database is completely self-contained at this point.)

Let’s classify a single signature:

sourmash lca classify --db delmont.lca.json \
    --query delmont-subsample-sigs/TARA_RED_MAG_00003.fa.gz.sig

and you should see:

loaded 1 databases for LCA use.
ksize=31 scaled=10000
outputting classifications to stdout
ID,status,superkingdom,phylum,class,order,family,genus,species
TARA_RED_MAG_00003,found,Bacteria,Proteobacteria,Gammaproteobacteria,,,,
classified 1 signatures total

You can classify a bunch of signatures and also specify an output location for the CSV:

sourmash lca classify --db delmont.lca.json \
    --query delmont-subsample-sigs/*.sig \
    -o out.csv

The lca classify command supports multiple databases as well as multiple queries; e.g. sourmash lca classify --db delmont.lca.json other.lca.json will classify based on the combination of taxonomies in the two databases.

Some discussion

Sourmash LCA is using k-mers to do taxonomic classification, using the “lowest common ancestor” approach (pioneered by Kraken, and described here), to identify each k-mer. From this it can either find a consensus taxonomy between all the k-mers (sourmash classify) or it can summarize the mixture of k-mers present in one or more signatures (sourmash summarize).

The sourmash lca index command can be used to prepare custom taxonomy databases; sourmash will happily ingest any taxonomy, whether or not it matches NCBI. See the spreadsheet from Delmont et al., 2017 for an example format.

Return to index