sourmash Python API

The primary programmatic way of interacting with sourmash is via its Python API.

Please also see examples of using the API.

MinHash: basic MinHash sketch functionality

class sourmash.MinHash(n, ksize, *, is_protein=False, dayhoff=False, hp=False, track_abundance=False, seed=42, max_hash=0, mins=None, scaled=0)[source]

The core sketch object for sourmash.

MinHash objects store and provide functionality for subsampled hash values from DNA, RNA, and amino acid sequences. MinHash also supports both the standard MinHash behavior (bounded size or num) and a non-standard MinHash, called “modulo hash” behavior, or scaled. Please see the API examples at

for more information.

Basic usage:

>>> from sourmash import MinHash
>>> mh1 = MinHash(n=20, ksize=3)
>>> mh1.add_sequence('ATGAGAGACGATAGACAGATGAC')
>>> mh2 = MinHash(n=20, ksize=3)
>>> mh2.add_sequence('ATGAGActCGATAGaCAGATGAC')
>>> round(mh1.similarity(mh2), 2)
0.85
__init__(n, ksize, *, is_protein=False, dayhoff=False, hp=False, track_abundance=False, seed=42, max_hash=0, mins=None, scaled=0)[source]

Create a sourmash.MinHash object.

To create a standard (num) MinHash, use:

MinHash(<num>, <ksize>, ...)

To create a scaled MinHash, use

MinHash(0, <ksize>, scaled=<int>, ...)

Optional arguments:
  • is_protein (default False) - aa k-mers

  • dayhoff (default False) - dayhoff encoding

  • hp (default False) - hydrophilic/hydrophobic aa

  • track_abundance (default False) - track hash multiplicity

  • mins (default None) - list of hashvals, or (hashval, abund) pairs

  • seed (default 42) - murmurhash seed

add_hash(h)[source]

Add a single hash value.

add_hash_with_abundance(h, a)[source]

Add a single hash value with an abundance.

add_kmer(kmer)[source]

Add a kmer into the sketch.

add_many(hashes)[source]

Add many hashes to the sketch at once.

hashes can be either an iterable (list, set, etc.), or another MinHash object.

add_protein(sequence)[source]

Add a protein sequence.

add_sequence(sequence, force=False)[source]

Add a sequence into the sketch.

angular_similarity(other)[source]

Calculate the angular similarity.

avg_containment(other, *, downsample=False)[source]

Calculate average containment. Note: this is average of the containments, not count_common/ avg_denom

avg_containment_ani(other, *, downsample=False, prob_threshold=0.001)[source]

Calculate average containment ANI. Note: this is average of the containment ANI’s, not ANI using count_common/ avg_denom

clear()[source]

Clears all hashes and abundances.

contained_by(other, downsample=False)[source]

Calculate how much of self is contained by other.

containment_ani(other, *, downsample=False, containment=None, confidence=0.95, estimate_ci=False, prob_threshold=0.001)[source]

Use self contained by other to estimate ANI between two MinHash objects.

copy()

Create a new copy of this MinHash.

copy_and_clear()[source]

Create an empty copy of this MinHash.

count_common(other, downsample=False)[source]

Return the number of hashes in common between self and other.

Optionally downsample scaled objects to highest scaled value.

downsample(*, num=None, scaled=None)[source]

Copy this object and downsample new object to either num or scaled.

flatten()[source]

If track_abundance=True, return a new flattened MinHash.

get_hashes()[source]

Return the list of hashes.

Deprecated since version 3.5: This will be removed in 5.0. Use .hashes property instead.

get_mins(with_abundance=False)[source]

Return list of hashes or if with_abundance a list of (hash, abund).

Deprecated since version 3.5: This will be removed in 5.0. Use .hashes property instead.

inflate(from_mh)[source]

return a new MinHash object with abundances taken from ‘from_mh’

note that this implicitly does an intersection: hashes that have no abundance in ‘from_mh’ are set to abundance 0 and removed from ‘self’.

intersection_and_union_size(other)[source]

Calculate intersection and union sizes between self and other.

into_frozen()[source]

Freeze this MinHash, preventing any changes.

jaccard(other, downsample=False)[source]

Calculate Jaccard similarity of two MinHash objects.

jaccard_ani(other, *, downsample=False, jaccard=None, prob_threshold=0.001, err_threshold=0.0001)[source]

Use jaccard to estimate ANI between two MinHash objects.

kmers_and_hashes(sequence, *, force=False, is_protein=False)[source]

Convert sequence into (k-mer, hashval) tuples without adding it to the sketch.

If input sequence is DNA and this is a protein, dayhoff, or hp MinHash, translate the DNA appropriately before hashing.

If input sequence is protein, set is_protein=True.

If ‘force’ is True, invalid k-mers will be represented with ‘None’.

max_containment(other, downsample=False)[source]

Calculate maximum containment.

max_containment_ani(other, *, downsample=False, max_containment=None, confidence=0.95, estimate_ci=False, prob_threshold=0.001)[source]

Use max_containment to estimate ANI between two MinHash objects.

property max_hash

Deprecated since version 3.5: This will be removed in 5.0. Use scaled instead.

remove_many(hashes)[source]

Remove many hashes from a sketch at once.

hashes can be either an iterable (list, set, etc.), or another MinHash object.

seq_to_hashes(sequence, *, force=False, bad_kmers_as_zeroes=False, is_protein=False)[source]

Convert sequence to hashes without adding to the sketch.

If input sequence is DNA and this is a protein, dayhoff, or hp MinHash, translate the DNA appropriately before hashing.

If input sequence is protein, set is_protein=True.

If force = True and bad_kmers_as_zeroes = True, invalid kmers hashes will be represented as 0.

set_abundances(values, clear=True)[source]

Set abundances for hashes from values, where values[hash] = abund

If abund value is set to zero, the hash will be removed from the sketch. abund cannot be set to a negative value.

similarity(other, ignore_abundance=False, downsample=False)[source]

Calculate similarity of two sketches.

If the sketches are not abundance weighted, or ignore_abundance=True, compute Jaccard similarity.

If the sketches are abundance weighted, calculate the angular similarity, a distance metric based on the cosine similarity.

Note, because the term frequencies (tf-idf weights) cannot be negative, the angle will never be < 0deg or > 90deg.

See https://en.wikipedia.org/wiki/Cosine_similarity

size_is_accurate(relative_error=0.2, confidence=0.95)[source]

Computes the probability that the estimate: sketch_size * scaled deviates from the true set_size by more than relative_error. This relies on the fact that the sketch_size is binomially distributed with parameters sketch_size and 1/scaled. The two-sided Chernoff bounds are used. Returns True if probability is greater than or equal to the desired confidence.

to_frozen()[source]

Return a frozen copy of this MinHash that cannot be changed.

to_mutable()[source]

Return a copy of this MinHash that can be changed.

property unique_dataset_hashes

Approximate total number of hashes (num_hashes *scaled).

SourmashSignature: save and load MinHash sketches in JSON

Save and load MinHash sketches in a JSON format, along with some metadata.

class sourmash.signature.FrozenSourmashSignature(minhash, name='', filename='')[source]

Frozen (immutable) signature class.

into_frozen()[source]

Freeze this signature, preventing attribute changes.

to_frozen()[source]

Return a frozen copy of this signature.

to_mutable()[source]

Turn this object into a mutable object.

update()[source]

Make a mutable copy of this signature for modification, then freeze.

This is a context manager that implements:

new_sig = this_sig.copy() new_sig.to_mutable() # modify new_sig new_sig.into_frozen()

This could be made more efficient by _not_ copying the signature, but that is non-intuitive and leads to hard-to-find bugs.

class sourmash.signature.SigInput(value)[source]

An enumeration.

class sourmash.signature.SourmashSignature(minhash, name='', filename='')[source]

Main class for signature information.

avg_containment(other, downsample=False)[source]

Calculate average containment. Note: this is average of the containments, not count_common/ avg_denom

avg_containment_ani(other, *, downsample=False)[source]

Calculate average containment ANI. Note: this is average of the containment ANI’s, not ANI using count_common/ avg_denom

contained_by(other, downsample=False)[source]

Compute containment by the other signature. Note: ignores abundance.

containment_ani(other, *, downsample=False, containment=None, confidence=0.95, estimate_ci=False)[source]

Use containment to estimate ANI between two FracMinHash signatures.

into_frozen()[source]

Freeze this signature, preventing attribute changes.

jaccard(other)[source]

Compute Jaccard similarity with the other MinHash signature.

jaccard_ani(other, *, downsample=False, jaccard=None, prob_threshold=0.001, err_threshold=0.0001)[source]

Use jaccard to estimate ANI between two FracMinHash signatures.

max_containment(other, downsample=False)[source]

Compute max containment w/other signature. Note: ignores abundance.

max_containment_ani(other, *, downsample=False, max_containment=None, confidence=0.95, estimate_ci=False)[source]

Use max containment to estimate ANI between two FracMinHash signatures.

md5sum()[source]

Calculate md5 hash of the bottom sketch, specifically.

similarity(other, ignore_abundance=False, downsample=False)[source]

Compute similarity with the other signature.

to_frozen()[source]

Return a frozen copy of this signature.

to_mutable()[source]

Return a mutable copy of this signature.

sourmash.signature.load_signatures(data, ksize=None, select_moltype=None, ignore_md5sum=False, do_raise=False)[source]

Load a JSON string with signatures into classes.

Returns iterator over SourmashSignature objects.

Note, the order is not necessarily the same as what is in the source file.

sourmash.signature.save_signatures(siglist, fp=None, compression=0)[source]

Save multiple signatures into a JSON string (or into file handle ‘fp’)

SBT: save and load Sequence Bloom Trees in JSON

An implementation of sequence bloom trees, Solomon & Kingsford, 2015.

class sourmash.sbt.GraphFactory(ksize, starting_size, n_tables)[source]

Build new nodegraphs (Bloom filters) of a specific (fixed) size.

Parameters

ksize: int

k-mer size.

starting_size: int

size (in bytes) for each nodegraph table.

n_tables: int

number of nodegraph tables to be used.

init_args()[source]
class sourmash.sbt.Leaf(metadata, data=None, name=None, storage=None, path=None)[source]
property data
classmethod load(info, storage=None)[source]
make_manifest_row(location)[source]
save(path)[source]
unload()[source]
update(parent)[source]
class sourmash.sbt.Node(factory, name=None, path=None, storage=None)[source]

Internal node of SBT.

property data
static load(info, storage=None)[source]
save(path)[source]
unload()[source]
update(parent)[source]
class sourmash.sbt.NodePos(pos, node)
node

Alias for field number 1

pos

Alias for field number 0

class sourmash.sbt.SBT(factory, *, d=2, storage=None, cache_size=None)[source]

A Sequence Bloom Tree implementation allowing generic internal nodes and leaves.

The default node and leaf format is a Bloom Filter (like the original implementation), but we also provide a MinHash leaf class (in the sourmash.sbtmh.SigLeaf class)

Parameters

factory: Factory

Callable for generating new datastores for internal nodes.

d: int

Number of children for each internal node. Defaults to 2 (a binary tree)

storage: Storage, default: None

A Storage is any place where we can save and load data for the nodes. If set to None, will use a FSStorage.

cache_size: int, default None

Number of internal nodes to cache in memory. If set to None, will not remove any nodes from memory (cache grows without bounds).

Notes

We use two dicts to store the tree structure: One for the internal nodes, and another for the leaves (datasets).

add_node(node)[source]
child(parent, pos)[source]

Return a child node at position pos under the parent node.

Parameters

parent: int

Parent node position in the tree.

pos: int

Position of the child one under the parent. Ranges from [0, arity - 1], where arity is the arity of the SBT (usually it is 2, a binary tree).

Returns

NodePos

A NodePos namedtuple with the position and content of the child node.

children(pos)[source]

Return all children nodes for node at position pos.

Parameters

pos: int

Position of the node in the tree.

Returns

list of NodePos

A list of NodePos namedtuples with the position and content of all children nodes.

combine(other)[source]
find(search_fn, query, **kwargs)[source]

Do a Jaccard similarity or containment search, yield results.

Here ‘search_fn’ should be an instance of ‘JaccardSearch’.

Queries with higher scaled values than the database can still be used for containment search, but not for similarity search. See SBT.select(…) for details.

insert(signature)[source]

Add a new SourmashSignature in to the SBT.

is_database = True
leaves(with_pos=False, unload_data=True)[source]
classmethod load(location, *, leaf_loader=None, storage=None, print_version_warning=True, cache_size=None)[source]

Load an SBT description from a file.

Parameters

locationstr

path to the SBT description.

leaf_loaderfunction, optional

function to load leaf nodes. Defaults to Leaf.load.

storageStorage, optional

Storage to be used for saving node data. Defaults to FSStorage (a hidden directory at the same level of path)

Returns

SBT

the SBT tree built from the description.

property location

Return a resolvable location for this index, if possible.

new_node_pos(node)[source]
parent(pos)[source]

Return the parent of the node at position pos.

If it is the root node (position 0), returns None.

Parameters

pos: int

Position of the node in the tree.

Returns

NodePos :

A NodePos namedtuple with the position and content of the parent node.

print()[source]
print_dot()[source]
save(path, storage=None, sparseness=0.0, structure_only=False)[source]

Saves an SBT description locally and node data to a storage.

Parameters

pathstr

path to where the SBT description should be saved.

storageStorage, optional

Storage to be used for saving node data. Defaults to FSStorage (a hidden directory at the same level of path)

sparsenessfloat

How much of the internal nodes should be saved. Defaults to 0.0 (save all internal nodes data), can go up to 1.0 (don’t save any internal nodes data)

structure_only: boolean

Write only the index schema and metadata, but not the data. Defaults to False (save data too)

Returns

str

full path to the new SBT description

select(ksize=None, moltype=None, num=0, scaled=0, containment=False, abund=None, picklist=None)[source]

Make sure this database matches the requested requirements.

Will always raise ValueError if a requirement cannot be met.

The only tricky bit here is around downsampling: if the scaled value being requested is higher than the signatures in the SBT, we can use the SBT for containment but not for similarity. This is because:

  • if we are doing containment searches, the intermediate nodes can still be used for calculating containment of signatures with higher scaled values. This is because only hashes that match in the higher range are used for containment scores.

  • however, for similarity, _all_ hashes are used, and we cannot implicitly downsample or necessarily estimate similarity if the scaled values differ.

signatures()[source]

Return an iterator over all signatures in the Index object.

sourmash.fig: make plots and figures

Make plots using the distance matrix+labels output by sourmash compare.

sourmash.fig.load_matrix_and_labels(basefile)[source]

Load the comparison matrix and associated labels.

Returns a square numpy matrix & list of labels.

sourmash.fig.plot_composite_matrix(D, labeltext, show_labels=True, vmax=1.0, vmin=0.0, force=False)[source]

Build a composite plot showing dendrogram + distance matrix/heatmap.

Returns a matplotlib figure.

If show_labels is True, display labels. Otherwise, no labels are shown on the plot.