`sourmash` Python API ¶

The primary programmatic way of interacting with sourmash is via its Python API.

Please also see examples of using the API.

`MinHash`: basic MinHash sketch functionality ¶

class sourmash.MinHash(n, ksize, *, is_protein=False, dayhoff=False, hp=False, skipm1n3=False, skipm2n3=False, track_abundance=False, seed=42, max_hash=0, mins=None, scaled=0)[source]¶

The core sketch object for sourmash.

MinHash objects store and provide functionality for subsampled hash values from DNA, RNA, and amino acid sequences. MinHash also supports both the standard MinHash behavior (bounded size or num) and a non-standard MinHash, called “modulo hash” behavior, or scaled. Please see the API examples at

https://sourmash.readthedocs.io/en/latest/api-example.html#sourmash-minhash-objects-and-manipulations

for more information.

Basic usage:

>>> from sourmash import MinHash
>>> mh1 = MinHash(n=20, ksize=3)
>>> mh1.add_sequence('ATGAGAGACGATAGACAGATGAC')

>>> mh2 = MinHash(n=20, ksize=3)
>>> mh2.add_sequence('ATGAGActCGATAGaCAGATGAC')

>>> round(mh1.similarity(mh2), 2)
0.85

__init__(n, ksize, *, is_protein=False, dayhoff=False, hp=False, skipm1n3=False, skipm2n3=False, track_abundance=False, seed=42, max_hash=0, mins=None, scaled=0)[source]¶

Create a sourmash.MinHash object.

To create a standard (num) MinHash, use:

MinHash(<num>, <ksize>, ...)

To create a scaled MinHash, use

MinHash(0, <ksize>, scaled=<int>, ...)

Optional arguments:

is_protein (default False) - aa k-mers
dayhoff (default False) - dayhoff encoding
hp (default False) - hydrophilic/hydrophobic aa
skipm1n3 (default False) - skipmer (m1n3)
skipm2n3 (default False) - skipmer (m2n3)
track_abundance (default False) - track hash multiplicity
mins (default None) - list of hashvals, or (hashval, abund) pairs
seed (default 42) - murmurhash seed

add_hash(h)[source]¶: Add a single hash value.

add_hash_with_abundance(h, a)[source]¶: Add a single hash value with an abundance.

add_kmer(kmer)[source]¶: Add a kmer into the sketch.

add_many(hashes)[source]¶

Add many hashes to the sketch at once.

hashes can be either an iterable (list, set, etc.), or another MinHash object.

add_protein(sequence)[source]¶: Add a protein sequence.

add_sequence(sequence, force=False)[source]¶: Add a sequence into the sketch.

angular_similarity(other, downsample=False)[source]¶: Calculate the angular similarity.

avg_containment(other, *, downsample=False)[source]¶: Calculate average containment. Note: this is average of the containments, not count_common/ avg_denom

avg_containment_ani(other, *, downsample=False, prob_threshold=0.001)[source]¶: Calculate average containment ANI. Note: this is average of the containment ANI’s, not ANI using count_common/ avg_denom

clear()[source]¶: Clears all hashes and abundances.

contained_by(other, downsample=False)[source]¶: Calculate how much of self is contained by other.

contained_by_weighted(other)[source]¶: Calculate how much of self is contained by other; weight by self. Note: automatically downsamples as needed – is this ok?

containment_ani(other, *, downsample=False, containment=None, confidence=0.95, estimate_ci=False, prob_threshold=0.001)[source]¶: Use self contained by other to estimate ANI between two MinHash objects.

copy()¶: Create a new copy of this MinHash.

copy_and_clear()[source]¶: Create an empty copy of this MinHash.

count_common(other, downsample=False)[source]¶

Return the number of hashes in common between self and other.

Optionally downsample scaled objects to highest scaled value.

downsample(*, num=None, scaled=None)[source]¶: Copy this object and downsample new object to either num or scaled.

flatten()[source]¶: If track_abundance=True, return a new flattened MinHash.

get_hashes()[source]¶: Return the list of hashes.

Deprecated since version 3.5: This will be removed in 5.0. Use .hashes property instead.

get_mins(with_abundance=False)[source]¶: Return list of hashes or if with_abundance a list of (hash, abund).

Deprecated since version 3.5: This will be removed in 5.0. Use .hashes property instead.

inflate(from_mh)[source]¶

return a new MinHash object with abundances taken from ‘from_mh’

note that this implicitly does an intersection: hashes that have no abundance in ‘from_mh’ are set to abundance 0 and removed from ‘self’.

intersection_and_union_size(other)[source]¶: Calculate intersection and union sizes between self and other.

into_frozen()[source]¶: Freeze this MinHash, preventing any changes.

jaccard(other, downsample=False)[source]¶: Calculate Jaccard similarity of two MinHash objects.

jaccard_ani(other, *, downsample=False, jaccard=None, prob_threshold=0.001, err_threshold=0.0001)[source]¶: Use jaccard to estimate ANI between two MinHash objects.

kmers_and_hashes(sequence, *, force=False, is_protein=False)[source]¶

Convert sequence into (k-mer, hashval) tuples without adding it to the sketch.

If input sequence is DNA and this is a protein, dayhoff, or hp MinHash, translate the DNA appropriately before hashing.

If input sequence is protein, set is_protein=True.

If ‘force’ is True, invalid k-mers will be represented with ‘None’.

max_containment(other, downsample=False)[source]¶: Calculate maximum containment.

max_containment_ani(other, *, downsample=False, max_containment=None, confidence=0.95, estimate_ci=False, prob_threshold=0.001)[source]¶: Use max_containment to estimate ANI between two MinHash objects.

property max_hash¶: Deprecated since version 3.5: This will be removed in 5.0. Use scaled instead.

remove_many(hashes)[source]¶

Remove many hashes from a sketch at once.

hashes can be either an iterable (list, set, etc.), or another MinHash object.

seq_to_hashes(sequence, *, force=False, bad_kmers_as_zeroes=False, is_protein=False)[source]¶

Convert sequence to hashes without adding to the sketch.

If input sequence is DNA and this is a protein, dayhoff, or hp MinHash, translate the DNA appropriately before hashing.

If input sequence is protein, set is_protein=True.

If force = True and bad_kmers_as_zeroes = True, invalid kmers hashes will be represented as 0.

set_abundances(values, clear=True)[source]¶

Set abundances for hashes from values, where values[hash] = abund

If abund value is set to zero, the hash will be removed from the sketch. abund cannot be set to a negative value.

similarity(other, ignore_abundance=False, downsample=False)[source]¶

Calculate similarity of two sketches.

If the sketches are not abundance weighted, or ignore_abundance=True, compute Jaccard similarity.

If the sketches are abundance weighted, calculate the angular similarity, a distance metric based on the cosine similarity.

Note, because the term frequencies (tf-idf weights) cannot be negative, the angle will never be < 0deg or > 90deg.

See https://en.wikipedia.org/wiki/Cosine_similarity

size_is_accurate(relative_error=0.2, confidence=0.95)[source]¶: Computes the probability that the estimate: sketch_size * scaled deviates from the true set_size by more than relative_error. This relies on the fact that the sketch_size is binomially distributed with parameters sketch_size and 1/scaled. The two-sided Chernoff bounds are used. Returns True if probability is greater than or equal to the desired confidence.

to_frozen()[source]¶: Return a frozen copy of this MinHash that cannot be changed.

to_mutable()[source]¶: Return a copy of this MinHash that can be changed.

property unique_dataset_hashes¶: Approximate total number of hashes (num_hashes *scaled).

`SourmashSignature`: save and load MinHash sketches in JSON ¶

Save and load MinHash sketches in a JSON format, along with some metadata.

class sourmash.signature.FrozenSourmashSignature(minhash, name='', filename='')[source]¶

Frozen (immutable) signature class.

into_frozen()[source]¶: Freeze this signature, preventing attribute changes.

to_frozen()[source]¶: Return a frozen copy of this signature.

to_mutable()[source]¶: Turn this object into a mutable object.

update()[source]¶

Make a mutable copy of this signature for modification, then freeze.

This is a context manager that implements:

new_sig = this_sig.copy() new_sig.to_mutable() # modify new_sig new_sig.into_frozen()

This could be made more efficient by _not_ copying the signature, but that is non-intuitive and leads to hard-to-find bugs.

class sourmash.signature.SigInput(*values)[source]¶

class sourmash.signature.SourmashSignature(minhash, name='', filename='')[source]¶

Main class for signature information.

angular_similarity(other, downsample=False)[source]¶: Compute angular similarity with the other signature.

avg_containment(other, downsample=False)[source]¶: Calculate average containment. Note: this is average of the containments, not count_common/ avg_denom

avg_containment_ani(other, *, downsample=False)[source]¶: Calculate average containment ANI. Note: this is average of the containment ANI’s, not ANI using count_common/ avg_denom

contained_by(other, downsample=False)[source]¶: Compute containment by the other signature. Note: ignores abundance.

contained_by_weighted(other)[source]¶: Compute containment by the other signature. Weight by abundance in self.

containment_ani(other, *, downsample=False, containment=None, confidence=0.95, estimate_ci=False)[source]¶: Use containment to estimate ANI between two FracMinHash signatures.

display(location=None)[source]¶: Print a summary of this signature.

into_frozen()[source]¶: Freeze this signature, preventing attribute changes.

jaccard(other, downsample=False)[source]¶: Compute Jaccard similarity with the other MinHash signature.

jaccard_ani(other, *, downsample=False, jaccard=None, prob_threshold=0.001, err_threshold=0.0001)[source]¶: Use jaccard to estimate ANI between two FracMinHash signatures.

max_containment(other, downsample=False)[source]¶: Compute max containment w/other signature. Note: ignores abundance.

max_containment_ani(other, *, downsample=False, max_containment=None, confidence=0.95, estimate_ci=False)[source]¶: Use max containment to estimate ANI between two FracMinHash signatures.

md5sum()[source]¶: Calculate md5 hash of the bottom sketch, specifically.

similarity(other, ignore_abundance=False, downsample=False)[source]¶: Compute similarity with the other signature.

to_frozen()[source]¶: Return a frozen copy of this signature.

to_mutable()[source]¶: Return a mutable copy of this signature.

sourmash.signature.load_signatures_from_json(data, ksize=None, select_moltype=None, ignore_md5sum=False, do_raise=False)[source]¶

Load a JSON string with signatures into classes.

Returns iterator over SourmashSignature objects.

Note, the order is not necessarily the same as what is in the source file.

sourmash.signature.save_signatures_to_json(siglist, fp=None, compression=0)[source]¶: Save multiple signatures into a JSON string (or into file handle ‘fp’)

`SBT`: save and load Sequence Bloom Trees in JSON ¶

An implementation of sequence bloom trees, Solomon & Kingsford, 2015.

class sourmash.sbt.GraphFactory(ksize, starting_size, n_tables)[source]¶

Build new nodegraphs (Bloom filters) of a specific (fixed) size.

Parameters¶

ksize: int: k-mer size.
starting_size: int: size (in bytes) for each nodegraph table.
n_tables: int: number of nodegraph tables to be used.

init_args()[source]¶

class sourmash.sbt.Leaf(metadata, data=None, name=None, storage=None, path=None)[source]¶

property data¶

classmethod load(info, storage=None)[source]¶

make_manifest_row(location)[source]¶

save(path)[source]¶

unload()[source]¶

update(parent)[source]¶

class sourmash.sbt.Node(factory, name=None, path=None, storage=None)[source]¶

Internal node of SBT.

property data¶

static load(info, storage=None)[source]¶

save(path)[source]¶

unload()[source]¶

update(parent)[source]¶

class sourmash.sbt.NodePos(pos, node)¶

node¶: Alias for field number 1

pos¶: Alias for field number 0

class sourmash.sbt.SBT(factory, *, d=2, storage=None, cache_size=None)[source]¶

A Sequence Bloom Tree implementation allowing generic internal nodes and leaves.

The default node and leaf format is a Bloom Filter (like the original implementation), but we also provide a MinHash leaf class (in the sourmash.sbtmh.SigLeaf class)

Parameters¶

factory: Factory: Callable for generating new datastores for internal nodes.
d: int: Number of children for each internal node. Defaults to 2 (a binary tree)
storage: Storage, default: None: A Storage is any place where we can save and load data for the nodes. If set to None, will use a FSStorage.
cache_size: int, default None: Number of internal nodes to cache in memory. If set to None, will not remove any nodes from memory (cache grows without bounds).

Notes¶

We use two dicts to store the tree structure: One for the internal nodes, and another for the leaves (datasets).

add_node(node)[source]¶

child(parent, pos)[source]¶

Return a child node at position pos under the parent node.

Parameters¶

parent: int: Parent node position in the tree.
pos: int: Position of the child one under the parent. Ranges from [0, arity - 1], where arity is the arity of the SBT (usually it is 2, a binary tree).

Returns¶

NodePos: A NodePos namedtuple with the position and content of the child node.

children(pos)[source]¶

Return all children nodes for node at position pos.

Parameters¶

pos: int: Position of the node in the tree.

Returns¶

list of NodePos: A list of NodePos namedtuples with the position and content of all children nodes.

combine(other)[source]¶

find(search_fn, query, **kwargs)[source]¶

Do a Jaccard similarity or containment search, yield results.

Here ‘search_fn’ should be an instance of ‘JaccardSearch’.

Queries with higher scaled values than the database can still be used for containment search, but not for similarity search. See SBT.select(…) for details.

insert(signature)[source]¶: Add a new SourmashSignature in to the SBT.

is_database = True¶

leaves(with_pos=False, unload_data=True)[source]¶

classmethod load(location, *, leaf_loader=None, storage=None, print_version_warning=True, cache_size=None)[source]¶

Load an SBT description from a file.

Parameters¶

locationstr: path to the SBT description.
leaf_loaderfunction, optional: function to load leaf nodes. Defaults to Leaf.load.
storageStorage, optional: Storage to be used for saving node data. Defaults to FSStorage (a hidden directory at the same level of path)

Returns¶

SBT: the SBT tree built from the description.

property location¶: Return a resolvable location for this index, if possible.

new_node_pos(node)[source]¶

parent(pos)[source]¶

Return the parent of the node at position pos.

If it is the root node (position 0), returns None.

Parameters¶

pos: int: Position of the node in the tree.

Returns¶

NodePos :: A NodePos namedtuple with the position and content of the parent node.

print()[source]¶

print_dot()[source]¶

save(path, storage=None, sparseness=0.0, structure_only=False)[source]¶

Saves an SBT description locally and node data to a storage.

Parameters¶

pathstr: path to where the SBT description should be saved.
storageStorage, optional: Storage to be used for saving node data. Defaults to FSStorage (a hidden directory at the same level of path)
sparsenessfloat: How much of the internal nodes should be saved. Defaults to 0.0 (save all internal nodes data), can go up to 1.0 (don’t save any internal nodes data)
structure_only: boolean: Write only the index schema and metadata, but not the data. Defaults to False (save data too)

Returns¶

str: full path to the new SBT description

select(ksize=None, moltype=None, num=0, scaled=0, containment=False, abund=None, picklist=None, **kwargs)[source]¶

Make sure this database matches the requested requirements.

Will always raise ValueError if a requirement cannot be met.

The only tricky bit here is around downsampling: if the scaled value being requested is higher than the signatures in the SBT, we can use the SBT for containment but not for similarity. This is because:

if we are doing containment searches, the intermediate nodes can still be used for calculating containment of signatures with higher scaled values. This is because only hashes that match in the higher range are used for containment scores.
however, for similarity, _all_ hashes are used, and we cannot implicitly downsample or necessarily estimate similarity if the scaled values differ.

signatures()[source]¶: Return an iterator over all signatures in the Index object.

`sourmash.fig`: make plots and figures ¶

Make plots using the distance matrix+labels output by sourmash compare.

sourmash.fig.load_matrix_and_labels(basefile)[source]¶

Load the comparison matrix and associated labels.

Returns a square numpy matrix & list of labels.

sourmash.fig.plot_composite_matrix(D, labeltext, show_labels=True, vmax=1.0, vmin=0.0, force=False)[source]¶

Build a composite plot showing dendrogram + distance matrix/heatmap.

Returns a matplotlib figure.

If show_labels is True, display labels. Otherwise, no labels are shown on the plot.

`sourmash` Python API ¶

`MinHash`: basic MinHash sketch functionality ¶

`SourmashSignature`: save and load MinHash sketches in JSON ¶

`SBT`: save and load Sequence Bloom Trees in JSON ¶

Parameters¶

Parameters¶

Notes¶

Parameters¶

Returns¶

Parameters¶

Returns¶

Parameters¶

Returns¶

Parameters¶

Returns¶

Parameters¶

Returns¶

`sourmash.fig`: make plots and figures ¶

sourmash

Navigation

Related Topics

This Page

sourmash Python API¶

MinHash: basic MinHash sketch functionality¶

SourmashSignature: save and load MinHash sketches in JSON¶

SBT: save and load Sequence Bloom Trees in JSON¶

Parameters¶

Parameters¶

Notes¶

Parameters¶

Returns¶

Parameters¶

Returns¶

Parameters¶

Returns¶

Parameters¶

Returns¶

Parameters¶

Returns¶

sourmash.fig: make plots and figures¶

`sourmash` Python API ¶

`MinHash`: basic MinHash sketch functionality ¶

`SourmashSignature`: save and load MinHash sketches in JSON ¶

`SBT`: save and load Sequence Bloom Trees in JSON ¶

`sourmash.fig`: make plots and figures ¶