Using sourmash: a practical guide

So! You’ve installed sourmash, run a few of the tutorials and commands, and now you actually want to use it. This guide is here to answer some of your questions, and explain why we can’t answer others.

(If you have additional questions, please file an issue!)

What k-mer size(s) should I use?

You can build signatures at a variety of k-mer sizes all at once, and (unless you are working with very large metagenomes) the resulting signature files will still be quite small.

We suggest including k=31 and k=51. k=51 gives you the most stringent matches, and has very few false positives. k=31 may be more sensitive at the genus level.

Why 31 and 51, specifically? To a large extent these numbers were picked out of a hat, based on our reading of papers like the Metapalette paper (Koslicki and Falush, 2016. You could go with k=49 or k=53 and probably get very similar results to k=51. The general rule is that longer k-mer sizes are less prone to false positives. But you can pick your own parameters.

One additional wrinkle is that we provide a number of precalculated databases at k=21, k=31, and k=51. It is often convenient to calculate signatures at these sizes so that you can use these databases.

You’ll notice that all of the above numbers are odd. That is to avoid occasional minor complications from palindromes in numerical calculations, where the forward and reverse complements of a k-mer are identical. This cannot happen if k is odd. It is not enforced by sourmash, however, and it probably doesn’t really matter.

(When we have blog posts or publications providing more formal guidance, we’ll link to them here!)

What resolution should my signatures be and how should I create them?

sourmash supports two ways of choosing the resolution or size of your signatures: using num to specify the maximum number of hashes, or scaled to specify the compression ratio. Which should you use?

We suggest calculating all your signatures using -p scaled=1000. This will give you a compression ratio of 1000-to-1 while making it possible to detect regions of similarity in the 10kb range.

For comparison with more traditional MinHash approaches like mash, if you have a 5 Mbp genome and use -p scaled=1000, you will extract approximately 5000 hashes. So a scaled of 1000 is equivalent to using -p num=5000 with mash on a 5 Mbp genome.

The difference between using num and scaled is in metagenome analysis: fixing the number of hashes with num limits your ability to detect rare organisms, or alternatively results in very large signatures (e.g. if you use n larger than 10000). scaled will scale your resolution with the diversity of the metagenome.

You can read more about this in this blog post from the mash folk, Mash Screen: What’s in my sequencing run? What we do with sourmash and scaled is similar to the ‘modulo hash’ mentioned in that blog post.

(Again, when we have formal guidance on this based on benchmarks, we’ll link to it here.)

What kind of input data does sourmash work on?

sourmash has been used most extensively with Illumina read data sets and assembled genomes, transcriptomes, and metagenomes. The high error rate of PacBio and Nanopore sequencing is problematic for k-mer based approaches and we have not yet explored how to tune parameters for this kind of sequencing.

On a more practical note, sourmash sketch will autodetect FASTA and FASTQ formats, whether they are uncompressed, gzipped, or bzip2-ed. Nothing special needs to be done.

How should I prepare my data?

Raw Illumina read data sets should be k-mer abundance trimmed to get rid of the bulk of erroneous kmers. We suggest a command like the following, using trim-low-abund from the khmer project

trim-low-abund.py -C 3 -Z 18 -V -M 2e9 <all of your input read files>

This is safe to use on genomes, metagenomes, and transcriptomes. If you are working with large genomes or diverse metagenomes, you may need to increase the -M parameter to use more memory.

See the khmer docs for trim-low-abund.py and the semi-streaming preprint for more information.

For high coverage genomic data, you can do very stringent trimming with an absolute cutoff, e.g.

trim-low-abund.py -C 10 -M 2e9 <all of your input read files>

will eliminate all k-mers that appear fewer than 10 times in your data set. This kind of trimming will dramatically reduce your sensitivity when working with metagenomes and transcriptomes, however, where there are always real low-abundance k-mers present.

Could you just give us the !#%#!$ command line?

Sorry, yes! See below.

Calculating signatures for read files:

trim-low-abund -C 3 -Z 18 -V -M 2e9 input-reads-1.fq input-reads-2.fq ...
sourmash sketch dna -p scaled=1000,k=21,k=31,k=51 input-reads*.fq.abundtrim \
    --merge SOMENAME -o SOMENAME-reads.sig

The first command trims off low-abundance k-mers from high-coverage reads; the second takes all the trimmed read files, subsamples k-mers from them at 1000:1, and outputs a single merged signature named ‘SOMENAME’ into the file SOMENAME-reads.sig.

Calculating a combined signature for multiple read files

sourmash sketch dna -p scaled=1000,k=21,k=31,k=51 sample_*.fq.gz \
    --name "combined sketch for sample" -o sample.zip

This will build combined sketches of all *.fq.gz files in the directory for three ksizes, k=21, k=31, and k=51. The three sketches will be named combined sketch for sample and be saved to sample.zip.

Creating signatures for individual genome files:

sourmash sketch dna -p scaled=1000,k=21,k=31,k=51 *.fna.gz --name-from-first

This command creates signatures for all *.fna.gz files, and names each signature based on the first FASTA header in each file (that’s what the option --name-from-first does). The signatures will be placed in *.fna.gz.sig.

Creating signatures from a collection of genomes in a single file:

sourmash sketch dna -p scaled=1000,k=21,k=31,k=51 file.fa --singleton

This creates signatures for all individual FASTA sequences in file.fa, names them based on their FASTA headers, and places them all in a single .sig file, file.fa.sig. (This behavior is triggered by the option --singleton, which tells sourmash to treat each individual sequence in the file as an independent sequence.)

How do I store and search collections of signatures?

sourmash supports a variety of signature loading and storage options for flexibility. If you have only a few hundred signatures, here are some options -

  • you can put all your signature files in a directory and search them all using the path to the directory.

  • you can use sourmash sig cat to concatenate multiple signatures into a single file.

  • you can compress any signature file using gzip and sourmash will load them.

If you have more than a few hundred genome signatures that you regularly search, it might be worth creating an indexed database of them that will support faster searches.

sourmash supports two types of indexed databases: Sequence Bloom Trees, or SBTs; and reverse indices, or LCAs. (You can read more detail about their implementation and design considerations in Chapter 2 of Dr. Luiz Irber’s thesis, “Efficient indexing of collections of signatures”.)

Sequence Bloom Tree (SBT) indexed databases

Sequence Bloom Trees (SBTs) (see Solomon and Kingsford, 2016) are on disk databases that support low-memory query of 10s-100s of thousands of signatures. They can be created using sourmash index.

SBTs are the lowest-memory way to run search or gather on a collection of signatures. The tradeoff is that they may be quite large on disk, because SBTs also contain intermediate nodes in the tree. The default way to store SBTs is in a Zip file, named .sbt.zip, that can be built and searched directly from the command line.

Reverse indexed (LCA) databases

Reverse indexed or LCA databases are in-memory databases that, once loaded from disk, support fast search and gather across 10s of thousands of signatures. They can be created using sourmash lca index (docs)

LCA databases are currently stored in JSON files (that can be gzipped). As these files get larger, the time required to load them from disk can be substantial.

LCA databases are also currently (sourmash 2.0-4.0) the only databases that support the inclusion of taxonomic information in the database, and there is an associated collection of commands under sourmash lca. However, they can also be used as regular indexed databases for search and gather as above.

(These are called “LCA databases” because they originally were created to support “lowest common ancestor” taxonomic analyses, e.g. like Kraken; their functionality has evolved a lot since, but their name hasn’t changed to match!)