`sourmash` Python API ¶

The primary programmatic way of interacting with sourmash is via its Python API. Please also see examples of using the API.

Contents

sourmash Python API

`MinHash`: basic MinHash sketch functionality ¶

class sourmash.MinHash(n, ksize, is_protein=False, dayhoff=False, hp=False, track_abundance=False, seed=42, max_hash=0, mins=None, scaled=0)[source]¶

The core sketch object for sourmash.

MinHash objects store and provide functionality for subsampled hash values from DNA, RNA, and amino acid sequences. MinHash also supports both the standard MinHash behavior (bounded size or num) and a non-standard MinHash, called “modulo hash” behavior, or scaled. Please see the API examples at

https://sourmash.readthedocs.io/en/latest/api-example.html#sourmash-minhash-objects-and-manipulations

for more information.

Basic usage:

>>> from sourmash import MinHash
>>> mh1 = MinHash(n=20, ksize=3)
>>> mh1.add_sequence('ATGAGAGACGATAGACAGATGAC')

>>> mh2 = MinHash(n=20, ksize=3)
>>> mh2.add_sequence('ATGAGActCGATAGaCAGATGAC')

>>> round(mh1.similarity(mh2), 2)
0.85

__init__(n, ksize, is_protein=False, dayhoff=False, hp=False, track_abundance=False, seed=42, max_hash=0, mins=None, scaled=0)[source]¶

Create a sourmash.MinHash object.

To create a standard (num) MinHash, use:

MinHash(<num>, <ksize>, ...)

To create a scaled MinHash, use

MinHash(0, <ksize>, scaled=<int>, ...)

Optional arguments:

is_protein (default False) - aa k-mers
dayhoff (default False) - dayhoff encoding
hp (default False) - hydrophilic/hydrophobic aa
track_abundance (default False) - track hash multiplicity
mins (default None) - list of hashvals, or (hashval, abund) pairs
seed (default 42) - murmurhash seed

Deprecated: @CTB

max_hash=<int>; use scaled instead.

add(kmer)[source]¶: Add a kmer into the sketch.

add_hash(h)[source]¶: Add a single hash value.

add_hash_with_abundance(h, a)[source]¶: Add a single hash value with an abundance.

add_many(hashes)[source]¶

Add many hashes to the sketch at once.

hashes can be either an iterable (list, set, etc.), or another MinHash object.

add_protein(sequence)[source]¶: Add a protein sequence.

add_sequence(sequence, force=False)[source]¶: Add a sequence into the sketch.

angular_similarity(other)[source]¶: Calculate the angular similarity.

clear()[source]¶: Clears all hashes and abundances.

compare(other, downsample=False)[source]¶: Calculate Jaccard similarity of two sketches.

Deprecated since version 3.3: This will be removed in 4.0. Use ‘similarity’ instead of compare.

contained_by(other, downsample=False)[source]¶: Calculate how much of self is contained by other.

containment_ignore_maxhash(other)[source]¶: Calculate contained_by, with downsampling.

Deprecated since version 3.3: This will be removed in 4.0. Use ‘contained_by’ with downsample=True instead.

copy_and_clear()[source]¶: Create an empty copy of this MinHash.

count_common(other, downsample=False)[source]¶

Return the number of hashes in common between self and other.

Optionally downsample scaled objects to highest scaled value.

downsample_max_hash(*others)[source]¶

Copy this object and downsample new object to min of *others.

Here, *others is one or more MinHash objects.

downsample_n(new_num)[source]¶: Copy this object and downsample new object to num=``new_num``.

downsample_scaled(new_scaled)[source]¶: Copy this object and downsample new object to scaled=``new_scaled``.

get_hashes()[source]¶: Return the list of hashes.

get_mins(with_abundance=False)[source]¶: Return list of hashes or if with_abundance a list of (hash, abund).

intersection(other, in_common=False)[source]¶

Calculate the intersection between self and other, and return (mins, size) where mins are the hashes in common, and size is the number of hashes.

if in_common, return the actual hashes. Otherwise, mins will be empty.

Deprecated since version 3.3: This will be removed in 4.0. Use count_common or set methods instead.

is_molecule_type(molecule)[source]¶

Check if this MinHash is a particular human-readable molecule type.

Supports ‘protein’, ‘dayhoff’, ‘hp’, ‘DNA’. @CTB deprecate for 4.0?

jaccard(other, downsample=False)[source]¶: Calculate Jaccard similarity of two MinHash objects.

remove_many(hashes)[source]¶: Remove many hashes at once; hashes must be an iterable.

set_abundances(values, clear=True)[source]¶: Set abundances for hashes from values, where values[hash] = abund

similarity(other, ignore_abundance=False, downsample=False)[source]¶

Calculate similarity of two sketches.

If the sketches are not abundance weighted, or ignore_abundance=True, compute Jaccard similarity.

If the sketches are abundance weighted, calculate the angular similarity, a distance metric based on the cosine similarity.

Note, because the term frequencies (tf-idf weights) cannot be negative, the angle will never be < 0deg or > 90deg.

See https://en.wikipedia.org/wiki/Cosine_similarity

subtract_mins(other)[source]¶: Get the list of mins in this MinHash, after removing the ones in other.

translate_codon(codon)[source]¶: Translate a codon into an amino acid.

update(other)[source]¶: Update this sketch from all the hashes in the other.

`SourmashSignature`: save and load MinHash sketches in JSON ¶

Save and load MinHash sketches in a JSON format, along with some metadata.

class sourmash.signature.SigInput(value)[source]¶: An enumeration.

class sourmash.signature.SourmashSignature(minhash, name='', filename='')[source]¶

Main class for signature information.

contained_by(other, downsample=False)[source]¶: Compute containment by the other signature. Note: ignores abundance.

jaccard(other)[source]¶: Compute Jaccard similarity with the other MinHash signature.

md5sum()[source]¶: Calculate md5 hash of the bottom sketch, specifically.

name()[source]¶: Return as nice a name as possible, defaulting to md5 prefix.

similarity(other, ignore_abundance=False, downsample=False)[source]¶: Compute similarity with the other signature.

sourmash.signature.load_signatures(data, ksize=None, select_moltype=None, ignore_md5sum=False, do_raise=False, quiet=False)[source]¶

Load a JSON string with signatures into classes.

Returns list of SourmashSignature objects.

Note, the order is not necessarily the same as what is in the source file.

sourmash.signature.save_signatures(siglist, fp=None, compression=0)[source]¶: Save multiple signatures into a JSON string (or into file handle ‘fp’)

`SBT`: save and load Sequence Bloom Trees in JSON ¶

An implementation of sequence bloom trees, Solomon & Kingsford, 2015.

To try it out, do:

factory = GraphFactory(ksize, tablesizes, n_tables)
root = Node(factory)

graph1 = factory()
# ... add stuff to graph1 ...
leaf1 = Leaf("a", graph1)
root.insert(leaf1)

For example,

# filenames: list of fa/fq files
# ksize: k-mer size
# tablesizes: Bloom filter table sizes
# n_tables: Number of tables

factory = GraphFactory(ksize, tablesizes, n_tables)
root = Node(factory)

for filename in filenames:
    graph = factory()
    graph.consume_fasta(filename)
    leaf = Leaf(filename, graph)
    root.insert(leaf)

then define a search function,

def kmers(k, seq):
    for start in range(len(seq) - k + 1):
        yield seq[start:start + k]

def search_transcript(node, seq, threshold):
    presence = [ node.data.get(kmer) for kmer in kmers(ksize, seq) ]
    if sum(presence) >= int(threshold * len(seq)):
        return 1
    return 0

class sourmash.sbt.GraphFactory(ksize, starting_size, n_tables)[source]¶

Build new nodegraphs (Bloom filters) of a specific (fixed) size.

Parameters

ksize (int) – k-mer size.
starting_size (int) – size (in bytes) for each nodegraph table.
n_tables (int) – number of nodegraph tables to be used.

init_args()[source]¶

class sourmash.sbt.Leaf(metadata, data=None, name=None, storage=None, path=None)[source]¶

property data¶

classmethod load(info, storage=None)[source]¶

save(path)[source]¶

unload()[source]¶

update(parent)[source]¶

class sourmash.sbt.Node(factory, name=None, path=None, storage=None)[source]¶

Internal node of SBT.

property data¶

static load(info, storage=None)[source]¶

save(path)[source]¶

unload()[source]¶

update(parent)[source]¶

class sourmash.sbt.NodePos(pos, node)¶

property node¶: Alias for field number 1

property pos¶: Alias for field number 0

class sourmash.sbt.SBT(factory, d=2, storage=None)[source]¶

A Sequence Bloom Tree implementation allowing generic internal nodes and leaves.

The default node and leaf format is a Bloom Filter (like the original implementation), but we also provide a MinHash leaf class (in the sourmash.sbtmh.SigLeaf class)

Parameters

factory (Factory) – Callable for generating new datastores for internal nodes.
d (int) – Number of children for each internal node. Defaults to 2 (a binary tree)
storage (Storage, default: None) – A Storage is any place where we can save and load data for the nodes. If set to None, will use a FSStorage.

Notes

We use two dicts to store the tree structure: One for the internal nodes, and another for the leaves (datasets).

add_node(node)[source]¶

child(parent, pos)[source]¶

Return a child node at position pos under the parent node.

Parameters

parent (int) – Parent node position in the tree.
pos (int) – Position of the child one under the parent. Ranges from [0, arity - 1], where arity is the arity of the SBT (usually it is 2, a binary tree).

Returns

A NodePos namedtuple with the position and content of the child node.

Return type

NodePos

children(pos)[source]¶

Return all children nodes for node at position pos.

Parameters: pos (int) – Position of the node in the tree.
Returns: A list of NodePos namedtuples with the position and content of all children nodes.
Return type: list of NodePos

combine(other)[source]¶

find(search_fn, *args, **kwargs)[source]¶: Search the tree using search_fn.

gather(query, *args, **kwargs)[source]¶: Return the match with the best Jaccard containment in the database.

insert(signature)[source]¶: Add a new SourmashSignature in to the SBT.

leaves(with_pos=False)[source]¶

classmethod load(location, leaf_loader=None, storage=None, print_version_warning=True)[source]¶

Load an SBT description from a file.

Parameters

location (str) – path to the SBT description.
leaf_loader (function, optional) – function to load leaf nodes. Defaults to Leaf.load.
storage (Storage, optional) – Storage to be used for saving node data. Defaults to FSStorage (a hidden directory at the same level of path)

Returns

the SBT tree built from the description.

Return type

SBT

new_node_pos(node)[source]¶

parent(pos)[source]¶

Return the parent of the node at position pos.

If it is the root node (position 0), returns None.

Parameters: pos (int) – Position of the node in the tree.
Returns: A NodePos namedtuple with the position and content of the parent node.
Return type: NodePos

print()[source]¶

print_dot()[source]¶

save(path, storage=None, sparseness=0.0, structure_only=False)[source]¶

Saves an SBT description locally and node data to a storage.

Parameters

path (str) – path to where the SBT description should be saved.
storage (Storage, optional) – Storage to be used for saving node data. Defaults to FSStorage (a hidden directory at the same level of path)
sparseness (float) – How much of the internal nodes should be saved. Defaults to 0.0 (save all internal nodes data), can go up to 1.0 (don’t save any internal nodes data)
structure_only (boolean) – Write only the index schema and metadata, but not the data. Defaults to False (save data too)

Returns

full path to the new SBT description

Return type

str

search(query, *args, **kwargs)[source]¶

Return set of matches with similarity above ‘threshold’.

Results will be sorted by similarity, highest to lowest.

Optional arguments:

do_containment: default False. If True, use Jaccard containment.
best_only: default False. If True, allow optimizations that may. May discard matches better than threshold, but first match is guaranteed to be best.
ignore_abundance: default False. If True, and query signature and database support k-mer abundances, ignore those abundances.

select(ksize=None, moltype=None)[source]¶

signatures()[source]¶: Return an iterator over all signatures in the Index object.

`sourmash.fig`: make plots and figures ¶

Make plots using the distance matrix+labels output by sourmash compare.

sourmash.fig.load_matrix_and_labels(basefile)[source]¶

Load the comparison matrix and associated labels.

Returns a square numpy matrix & list of labels.

sourmash.fig.plot_composite_matrix(D, labeltext, show_labels=True, show_indices=True, vmax=1.0, vmin=0.0, force=False)[source]¶

Build a composite plot showing dendrogram + distance matrix/heatmap.

Returns a matplotlib figure.

`sourmash` Python API ¶

`MinHash`: basic MinHash sketch functionality ¶

`SourmashSignature`: save and load MinHash sketches in JSON ¶

`SBT`: save and load Sequence Bloom Trees in JSON ¶

`sourmash.fig`: make plots and figures ¶

Table Of Contents

Related Topics

This Page

sourmash Python API¶

MinHash: basic MinHash sketch functionality¶

SourmashSignature: save and load MinHash sketches in JSON¶

SBT: save and load Sequence Bloom Trees in JSON¶

sourmash.fig: make plots and figures¶

`sourmash` Python API ¶

`MinHash`: basic MinHash sketch functionality ¶

`SourmashSignature`: save and load MinHash sketches in JSON ¶

`SBT`: save and load Sequence Bloom Trees in JSON ¶

`sourmash.fig`: make plots and figures ¶