sourmash
Python API¶
The primary programmatic way of interacting with sourmash
is via
its Python API.
Please also see examples of using the API.
MinHash
: basic MinHash sketch functionality¶
- class sourmash.MinHash(n, ksize, *, is_protein=False, dayhoff=False, hp=False, track_abundance=False, seed=42, max_hash=0, mins=None, scaled=0)[source]¶
The core sketch object for sourmash.
MinHash objects store and provide functionality for subsampled hash values from DNA, RNA, and amino acid sequences. MinHash also supports both the standard MinHash behavior (bounded size or
num
) and a non-standard MinHash, called “modulo hash” behavior, orscaled
. Please see the API examples atfor more information.
Basic usage:
>>> from sourmash import MinHash >>> mh1 = MinHash(n=20, ksize=3) >>> mh1.add_sequence('ATGAGAGACGATAGACAGATGAC')
>>> mh2 = MinHash(n=20, ksize=3) >>> mh2.add_sequence('ATGAGActCGATAGaCAGATGAC')
>>> round(mh1.similarity(mh2), 2) 0.85
- __init__(n, ksize, *, is_protein=False, dayhoff=False, hp=False, track_abundance=False, seed=42, max_hash=0, mins=None, scaled=0)[source]¶
Create a sourmash.MinHash object.
- To create a standard (
num
) MinHash, use: MinHash(<num>, <ksize>, ...)
- To create a
scaled
MinHash, use MinHash(0, <ksize>, scaled=<int>, ...)
- Optional arguments:
is_protein (default False) - aa k-mers
dayhoff (default False) - dayhoff encoding
hp (default False) - hydrophilic/hydrophobic aa
track_abundance (default False) - track hash multiplicity
mins (default None) - list of hashvals, or (hashval, abund) pairs
seed (default 42) - murmurhash seed
- To create a standard (
- add_many(hashes)[source]¶
Add many hashes to the sketch at once.
hashes
can be either an iterable (list, set, etc.), or anotherMinHash
object.
- avg_containment(other, *, downsample=False)[source]¶
Calculate average containment. Note: this is average of the containments, not count_common/ avg_denom
- avg_containment_ani(other, *, downsample=False, prob_threshold=0.001)[source]¶
Calculate average containment ANI. Note: this is average of the containment ANI’s, not ANI using count_common/ avg_denom
- containment_ani(other, *, downsample=False, containment=None, confidence=0.95, estimate_ci=False, prob_threshold=0.001)[source]¶
Use self contained by other to estimate ANI between two MinHash objects.
- copy()¶
Create a new copy of this MinHash.
- count_common(other, downsample=False)[source]¶
Return the number of hashes in common between
self
andother
.Optionally downsample
scaled
objects to highestscaled
value.
- downsample(*, num=None, scaled=None)[source]¶
Copy this object and downsample new object to either num or scaled.
- get_hashes()[source]¶
Return the list of hashes.
Deprecated since version 3.5: This will be removed in 5.0. Use .hashes property instead.
- get_mins(with_abundance=False)[source]¶
Return list of hashes or if
with_abundance
a list of (hash, abund).Deprecated since version 3.5: This will be removed in 5.0. Use .hashes property instead.
- inflate(from_mh)[source]¶
return a new MinHash object with abundances taken from ‘from_mh’
note that this implicitly does an intersection: hashes that have no abundance in ‘from_mh’ are set to abundance 0 and removed from ‘self’.
- intersection_and_union_size(other)[source]¶
Calculate intersection and union sizes between self and other.
- jaccard_ani(other, *, downsample=False, jaccard=None, prob_threshold=0.001, err_threshold=0.0001)[source]¶
Use jaccard to estimate ANI between two MinHash objects.
- kmers_and_hashes(sequence, *, force=False, is_protein=False)[source]¶
Convert sequence into (k-mer, hashval) tuples without adding it to the sketch.
If input sequence is DNA and this is a protein, dayhoff, or hp MinHash, translate the DNA appropriately before hashing.
If input sequence is protein, set is_protein=True.
If ‘force’ is True, invalid k-mers will be represented with ‘None’.
- max_containment_ani(other, *, downsample=False, max_containment=None, confidence=0.95, estimate_ci=False, prob_threshold=0.001)[source]¶
Use max_containment to estimate ANI between two MinHash objects.
- property max_hash¶
Deprecated since version 3.5: This will be removed in 5.0. Use scaled instead.
- remove_many(hashes)[source]¶
Remove many hashes from a sketch at once.
hashes
can be either an iterable (list, set, etc.), or anotherMinHash
object.
- seq_to_hashes(sequence, *, force=False, bad_kmers_as_zeroes=False, is_protein=False)[source]¶
Convert sequence to hashes without adding to the sketch.
If input sequence is DNA and this is a protein, dayhoff, or hp MinHash, translate the DNA appropriately before hashing.
If input sequence is protein, set is_protein=True.
If force = True and bad_kmers_as_zeroes = True, invalid kmers hashes will be represented as 0.
- set_abundances(values, clear=True)[source]¶
Set abundances for hashes from
values
, wherevalues[hash] = abund
If
abund
value is set to zero, thehash
will be removed from the sketch.abund
cannot be set to a negative value.
- similarity(other, ignore_abundance=False, downsample=False)[source]¶
Calculate similarity of two sketches.
If the sketches are not abundance weighted, or ignore_abundance=True, compute Jaccard similarity.
If the sketches are abundance weighted, calculate the angular similarity, a distance metric based on the cosine similarity.
Note, because the term frequencies (tf-idf weights) cannot be negative, the angle will never be < 0deg or > 90deg.
- size_is_accurate(relative_error=0.2, confidence=0.95)[source]¶
Computes the probability that the estimate: sketch_size * scaled deviates from the true set_size by more than relative_error. This relies on the fact that the sketch_size is binomially distributed with parameters sketch_size and 1/scaled. The two-sided Chernoff bounds are used. Returns True if probability is greater than or equal to the desired confidence.
SourmashSignature
: save and load MinHash sketches in JSON¶
Save and load MinHash sketches in a JSON format, along with some metadata.
- class sourmash.signature.FrozenSourmashSignature(minhash, name='', filename='')[source]¶
Frozen (immutable) signature class.
- update()[source]¶
Make a mutable copy of this signature for modification, then freeze.
This is a context manager that implements:
new_sig = this_sig.copy() new_sig.to_mutable() # modify new_sig new_sig.into_frozen()
This could be made more efficient by _not_ copying the signature, but that is non-intuitive and leads to hard-to-find bugs.
- class sourmash.signature.SourmashSignature(minhash, name='', filename='')[source]¶
Main class for signature information.
- avg_containment(other, downsample=False)[source]¶
Calculate average containment. Note: this is average of the containments, not count_common/ avg_denom
- avg_containment_ani(other, *, downsample=False)[source]¶
Calculate average containment ANI. Note: this is average of the containment ANI’s, not ANI using count_common/ avg_denom
- contained_by(other, downsample=False)[source]¶
Compute containment by the other signature. Note: ignores abundance.
- containment_ani(other, *, downsample=False, containment=None, confidence=0.95, estimate_ci=False)[source]¶
Use containment to estimate ANI between two FracMinHash signatures.
- jaccard_ani(other, *, downsample=False, jaccard=None, prob_threshold=0.001, err_threshold=0.0001)[source]¶
Use jaccard to estimate ANI between two FracMinHash signatures.
- max_containment(other, downsample=False)[source]¶
Compute max containment w/other signature. Note: ignores abundance.
- max_containment_ani(other, *, downsample=False, max_containment=None, confidence=0.95, estimate_ci=False)[source]¶
Use max containment to estimate ANI between two FracMinHash signatures.
- sourmash.signature.load_signatures_from_json(data, ksize=None, select_moltype=None, ignore_md5sum=False, do_raise=False)[source]¶
Load a JSON string with signatures into classes.
Returns iterator over SourmashSignature objects.
Note, the order is not necessarily the same as what is in the source file.
SBT
: save and load Sequence Bloom Trees in JSON¶
An implementation of sequence bloom trees, Solomon & Kingsford, 2015.
- class sourmash.sbt.GraphFactory(ksize, starting_size, n_tables)[source]¶
Build new nodegraphs (Bloom filters) of a specific (fixed) size.
Parameters¶
- ksize: int
k-mer size.
- starting_size: int
size (in bytes) for each nodegraph table.
- n_tables: int
number of nodegraph tables to be used.
- class sourmash.sbt.Leaf(metadata, data=None, name=None, storage=None, path=None)[source]¶
- property data¶
- class sourmash.sbt.Node(factory, name=None, path=None, storage=None)[source]¶
Internal node of SBT.
- property data¶
- class sourmash.sbt.SBT(factory, *, d=2, storage=None, cache_size=None)[source]¶
A Sequence Bloom Tree implementation allowing generic internal nodes and leaves.
The default node and leaf format is a Bloom Filter (like the original implementation), but we also provide a MinHash leaf class (in the sourmash.sbtmh.SigLeaf class)
Parameters¶
- factory: Factory
Callable for generating new datastores for internal nodes.
- d: int
Number of children for each internal node. Defaults to 2 (a binary tree)
- storage: Storage, default: None
A Storage is any place where we can save and load data for the nodes. If set to None, will use a FSStorage.
- cache_size: int, default None
Number of internal nodes to cache in memory. If set to None, will not remove any nodes from memory (cache grows without bounds).
Notes¶
We use two dicts to store the tree structure: One for the internal nodes, and another for the leaves (datasets).
- child(parent, pos)[source]¶
Return a child node at position
pos
under theparent
node.Parameters¶
- parent: int
Parent node position in the tree.
- pos: int
Position of the child one under the parent. Ranges from [0, arity - 1], where arity is the arity of the SBT (usually it is 2, a binary tree).
Returns¶
- NodePos
A NodePos namedtuple with the position and content of the child node.
- children(pos)[source]¶
Return all children nodes for node at position
pos
.Parameters¶
- pos: int
Position of the node in the tree.
Returns¶
- list of NodePos
A list of NodePos namedtuples with the position and content of all children nodes.
- find(search_fn, query, **kwargs)[source]¶
Do a Jaccard similarity or containment search, yield results.
Here ‘search_fn’ should be an instance of ‘JaccardSearch’.
Queries with higher scaled values than the database can still be used for containment search, but not for similarity search. See SBT.select(…) for details.
- is_database = True¶
- classmethod load(location, *, leaf_loader=None, storage=None, print_version_warning=True, cache_size=None)[source]¶
Load an SBT description from a file.
Parameters¶
- locationstr
path to the SBT description.
- leaf_loaderfunction, optional
function to load leaf nodes. Defaults to
Leaf.load
.- storageStorage, optional
Storage to be used for saving node data. Defaults to FSStorage (a hidden directory at the same level of path)
Returns¶
- SBT
the SBT tree built from the description.
- property location¶
Return a resolvable location for this index, if possible.
- parent(pos)[source]¶
Return the parent of the node at position
pos
.If it is the root node (position 0), returns None.
Parameters¶
- pos: int
Position of the node in the tree.
Returns¶
- NodePos :
A NodePos namedtuple with the position and content of the parent node.
- save(path, storage=None, sparseness=0.0, structure_only=False)[source]¶
Saves an SBT description locally and node data to a storage.
Parameters¶
- pathstr
path to where the SBT description should be saved.
- storageStorage, optional
Storage to be used for saving node data. Defaults to FSStorage (a hidden directory at the same level of path)
- sparsenessfloat
How much of the internal nodes should be saved. Defaults to 0.0 (save all internal nodes data), can go up to 1.0 (don’t save any internal nodes data)
- structure_only: boolean
Write only the index schema and metadata, but not the data. Defaults to False (save data too)
Returns¶
- str
full path to the new SBT description
- select(ksize=None, moltype=None, num=0, scaled=0, containment=False, abund=None, picklist=None, **kwargs)[source]¶
Make sure this database matches the requested requirements.
Will always raise ValueError if a requirement cannot be met.
The only tricky bit here is around downsampling: if the scaled value being requested is higher than the signatures in the SBT, we can use the SBT for containment but not for similarity. This is because:
if we are doing containment searches, the intermediate nodes can still be used for calculating containment of signatures with higher scaled values. This is because only hashes that match in the higher range are used for containment scores.
however, for similarity, _all_ hashes are used, and we cannot implicitly downsample or necessarily estimate similarity if the scaled values differ.
sourmash.fig
: make plots and figures¶
Make plots using the distance matrix+labels output by sourmash compare.