Welcome to sourmash!¶
sourmash is a command-line tool and Python/Rust library for metagenome analysis and genome comparison using k-mers. It supports the compositional analysis of metagenomes, rapid search of large sequence databases, and flexible taxonomic profiling with both NCBI and GTDB taxonomies (see our prepared databases for more information). sourmash works well with sequences 30kb or larger, including bacterial and viral genomes.
You might try sourmash if you want to -
identify which reference genomes to use for metagenomic read mapping;
search all Genbank microbial genomes with a sequence query;
cluster hundreds or thousands of genomes by similarity;
taxonomically classify genomes or metagenomes against NCBI and/or GTDB;
search thousands of metagenomes with a query genome or sequence;
New! The sourmash project also supports querying all 1 million publicly available metagenomes in the Sequence Read Archive. Give it a try!
Our vision: sourmash strives to support biologists in analyzing modern sequencing data at high resolution and with full context, including all public reference genomes and metagenomes.
This project’s mission is to provide practical tools and approaches for analyzing extremely large sequencing data sets, with an emphasis on high resolution results. Our designs follow these guiding principles:
genomic and metagenomic analyses should be able to make use of all available reference genomes.
metagenomic analyses should support assembly independent approaches, to avoid biases stemming from low coverage or high strain variability.
private and public databases should be equally well supported.
a variety of data structures and algorithms are necessary to support a wide set of use cases, including efficient command-line analysis, real-time queries, and massive-scale batch analyses.
our tools should be well behaved members of the bioinformatics analysis tool ecosystem, and use common installation approaches, standard formats, and semantic versioning.
our tools should be robustly tested, well documented, and supported.
we discuss scientific and computational tradeoffs and make specific recommendations where possible, admitting uncertainty as needed.
How does sourmash work?¶
Underneath, sourmash uses FracMinHash sketches for fast and lightweight sequence comparison; FracMinHash builds on MinHash sketching to support both Jaccard similarity and containment analyses with k-mers. This significantly expands the range of operations that can be done quickly and in low memory. sourmash also implements a number of new and powerful techniques for analysis, including minimum metagenome covers and alignment-free ANI estimation.
sourmash is inspired by mash, and supports most mash analyses. sourmash also implements an expanded set of functionality for metagenome and taxonomic analysis.
While sourmash is currently single-threaded, the branchwater plugin for sourmash provides faster and lower-memory multithreaded implementations of several important sourmash features - sketching, searching, and gather (metagenome decomposition). It does so by implementing higher-level functions in Rust on top of the core Rust library of sourmash. As a result it provides some of the same functionality as sourmash, but 10-100x faster and in 10x lower memory. Note that this code is functional and tested, but does not have all of the features of sourmash. Code and features will be integrated back into sourmash as they mature.
sourmash development was initiated with a grant from the Moore Foundation under the Data Driven Discovery program, and has been supported by further funding from the NIH and NSF. Please see funding acknowledgements for details!
Tutorials and examples¶
These tutorials are command line tutorials that should work on Mac OS X and Linux. They require about 5 GB of disk space and 5 GB of RAM.