sourmash Python API

The primary programmatic way of interacting with sourmash is via its Python API. Please also see examples of using the API.

MinHash: basic MinHash sketch functionality

class sourmash.MinHash(n, ksize, is_protein=False, dayhoff=False, hp=False, track_abundance=False, seed=42, max_hash=0, mins=None, scaled=0)[source]

The core sketch object for sourmash.

MinHash objects store and provide functionality for subsampled hash values from DNA, RNA, and amino acid sequences. MinHash also supports both the standard MinHash behavior (bounded size or num) and a non-standard MinHash, called “modulo hash” behavior, or scaled. Please see the API examples at

for more information.

Basic usage:

>>> from sourmash import MinHash
>>> mh1 = MinHash(n=20, ksize=3)
>>> mh1.add_sequence('ATGAGAGACGATAGACAGATGAC')
>>> mh2 = MinHash(n=20, ksize=3)
>>> mh2.add_sequence('ATGAGActCGATAGaCAGATGAC')
>>> round(mh1.similarity(mh2), 2)
0.85
__init__(n, ksize, is_protein=False, dayhoff=False, hp=False, track_abundance=False, seed=42, max_hash=0, mins=None, scaled=0)[source]

Create a sourmash.MinHash object.

To create a standard (num) MinHash, use:

MinHash(<num>, <ksize>, ...)

To create a scaled MinHash, use

MinHash(0, <ksize>, scaled=<int>, ...)

Optional arguments:
  • is_protein (default False) - aa k-mers

  • dayhoff (default False) - dayhoff encoding

  • hp (default False) - hydrophilic/hydrophobic aa

  • track_abundance (default False) - track hash multiplicity

  • mins (default None) - list of hashvals, or (hashval, abund) pairs

  • seed (default 42) - murmurhash seed

Deprecated: @CTB
  • max_hash=<int>; use scaled instead.

add(kmer)[source]

Add a kmer into the sketch.

add_hash(h)[source]

Add a single hash value.

add_many(hashes)[source]

Add many hashes to the sketch at once.

hashes can be either an iterable (list, set, etc.), or another MinHash object.

add_protein(sequence)[source]

Add a protein sequence.

add_sequence(sequence, force=False)[source]

Add a sequence into the sketch.

angular_similarity(other)[source]

Calculate the angular similarity.

compare(other, downsample=False)[source]

Calculate Jaccard similarity of two sketches.

contained_by(other, downsample=False)[source]

Calculate how much of self is contained by other.

containment_ignore_maxhash(other)[source]

Calculate contained_by, with downsampling.

copy_and_clear()[source]

Create an empty copy of this MinHash.

count_common(other, downsample=False)[source]

Return the number of hashes in common between self and other.

Optionally downsample scaled objects to highest scaled value.

downsample_max_hash(*others)[source]

Copy this object and downsample new object to min of *others.

Here, *others is one or more MinHash objects.

downsample_n(new_num)[source]

Copy this object and downsample new object to num=``new_num``.

downsample_scaled(new_scaled)[source]

Copy this object and downsample new object to scaled=``new_scaled``.

get_hashes()[source]

Return the list of hashes.

get_mins(with_abundance=False)[source]

Return list of hashes or if with_abundance a list of (hash, abund).

intersection(other, in_common=False)[source]

Calculate the intersection between self and other, and return (mins, size) where mins are the hashes in common, and size is the number of hashes.

if in_common, return the actual hashes. Otherwise, mins will be empty.

is_molecule_type(molecule)[source]

Check if this MinHash is a particular human-readable molecule type.

Supports ‘protein’, ‘dayhoff’, ‘hp’, ‘DNA’.

jaccard(other, downsample=False)[source]

Calculate Jaccard similarity of two MinHash objects.

remove_many(hashes)[source]

Remove many hashes at once; hashes must be an iterable.

set_abundances(values)[source]

Set abundances for hashes from values, where values[hash] = abund

similarity(other, ignore_abundance=False, downsample=False)[source]

Calculate similarity of two sketches.

If the sketches are not abundance weighted, or ignore_abundance=True, compute Jaccard similarity.

If the sketches are abundance weighted, calculate the angular similarity, a distance metric based on the cosine similarity.

Note, because the term frequencies (tf-idf weights) cannot be negative, the angle will never be < 0deg or > 90deg.

See https://en.wikipedia.org/wiki/Cosine_similarity

subtract_mins(other)[source]

Get the list of mins in this MinHash, after removing the ones in other.

translate_codon(codon)[source]

Translate a codon into an amino acid.

update(other)[source]

Update this sketch from all the hashes in the other.

SourmashSignature: save and load MinHash sketches in JSON

Save and load MinHash sketches in a JSON format, along with some metadata.

class sourmash.signature.SourmashSignature(minhash, name='', filename='')[source]

Main class for signature information.

contained_by(other, downsample=False)[source]

Compute containment by the other signature. Note: ignores abundance.

jaccard(other)[source]

Compute Jaccard similarity with the other MinHash signature.

md5sum()[source]

Calculate md5 hash of the bottom sketch, specifically.

name()[source]

Return as nice a name as possible, defaulting to md5 prefix.

similarity(other, ignore_abundance=False, downsample=False)[source]

Compute similarity with the other signature.

sourmash.signature.load_signatures(data, ksize=None, select_moltype=None, ignore_md5sum=False, do_raise=False, quiet=False)[source]

Load a JSON string with signatures into classes.

Returns list of SourmashSignature objects.

Note, the order is not necessarily the same as what is in the source file.

sourmash.signature.save_signatures(siglist, fp=None)[source]

Save multiple signatures into a JSON string (or into file handle ‘fp’)

SBT: save and load Sequence Bloom Trees in JSON

An implementation of sequence bloom trees, Solomon & Kingsford, 2015.

To try it out, do:

factory = GraphFactory(ksize, tablesizes, n_tables)
root = Node(factory)

graph1 = factory()
# ... add stuff to graph1 ...
leaf1 = Leaf("a", graph1)
root.insert(leaf1)

For example,

# filenames: list of fa/fq files
# ksize: k-mer size
# tablesizes: Bloom filter table sizes
# n_tables: Number of tables

factory = GraphFactory(ksize, tablesizes, n_tables)
root = Node(factory)

for filename in filenames:
    graph = factory()
    graph.consume_fasta(filename)
    leaf = Leaf(filename, graph)
    root.insert(leaf)

then define a search function,

def kmers(k, seq):
    for start in range(len(seq) - k + 1):
        yield seq[start:start + k]

def search_transcript(node, seq, threshold):
    presence = [ node.data.get(kmer) for kmer in kmers(ksize, seq) ]
    if sum(presence) >= int(threshold * len(seq)):
        return 1
    return 0
class sourmash.sbt.GraphFactory(ksize, starting_size, n_tables)[source]

Build new nodegraphs (Bloom filters) of a specific (fixed) size.

Parameters
  • ksize (int) – k-mer size.

  • starting_size (int) – size (in bytes) for each nodegraph table.

  • n_tables (int) – number of nodegraph tables to be used.

init_args()[source]
class sourmash.sbt.Node(factory, name=None, path=None, storage=None)[source]

Internal node of SBT.

property data
static load(info, storage=None)[source]
save(path)[source]
unload()[source]
update(parent)[source]
class sourmash.sbt.NodePos(pos, node)
property node

Alias for field number 1

property pos

Alias for field number 0

class sourmash.sbt.SBT(factory, d=2, storage=None)[source]

A Sequence Bloom Tree implementation allowing generic internal nodes and leaves.

The default node and leaf format is a Bloom Filter (like the original implementation), but we also provide a MinHash leaf class (in the sourmash.sbtmh.SigLeaf class)

Parameters
  • factory (Factory) – Callable for generating new datastores for internal nodes.

  • d (int) – Number of children for each internal node. Defaults to 2 (a binary tree)

  • storage (Storage, default: None) – A Storage is any place where we can save and load data for the nodes. If set to None, will use a FSStorage.

Notes

We use two dicts to store the tree structure: One for the internal nodes, and another for the leaves (datasets).

add_node(node)[source]
child(parent, pos)[source]

Return a child node at position pos under the parent node.

Parameters
  • parent (int) – Parent node position in the tree.

  • pos (int) – Position of the child one under the parent. Ranges from [0, arity - 1], where arity is the arity of the SBT (usually it is 2, a binary tree).

Returns

A NodePos namedtuple with the position and content of the child node.

Return type

NodePos

children(pos)[source]

Return all children nodes for node at position pos.

Parameters

pos (int) – Position of the node in the tree.

Returns

A list of NodePos namedtuples with the position and content of all children nodes.

Return type

list of NodePos

combine(other)[source]
find(search_fn, *args, **kwargs)[source]

Search the tree using search_fn.

gather(query, *args, **kwargs)[source]

Return the match with the best Jaccard containment in the Index.

insert(signature)[source]

Add a new SourmashSignature in to the SBT.

leaves(with_pos=False)[source]
classmethod load(location, leaf_loader=None, storage=None, print_version_warning=True)[source]

Load an SBT description from a file.

Parameters
  • location (str) – path to the SBT description.

  • leaf_loader (function, optional) – function to load leaf nodes. Defaults to Leaf.load.

  • storage (Storage, optional) – Storage to be used for saving node data. Defaults to FSStorage (a hidden directory at the same level of path)

Returns

the SBT tree built from the description.

Return type

SBT

new_node_pos(node)[source]
parent(pos)[source]

Return the parent of the node at position pos.

If it is the root node (position 0), returns None.

Parameters

pos (int) – Position of the node in the tree.

Returns

A NodePos namedtuple with the position and content of the parent node.

Return type

NodePos

print()[source]
print_dot()[source]
save(path, storage=None, sparseness=0.0, structure_only=False)[source]

Saves an SBT description locally and node data to a storage.

Parameters
  • path (str) – path to where the SBT description should be saved.

  • storage (Storage, optional) – Storage to be used for saving node data. Defaults to FSStorage (a hidden directory at the same level of path)

  • sparseness (float) – How much of the internal nodes should be saved. Defaults to 0.0 (save all internal nodes data), can go up to 1.0 (don’t save any internal nodes data)

  • structure_only (boolean) – Write only the index schema and metadata, but not the data. Defaults to False (save data too)

Returns

full path to the new SBT description

Return type

str

search(query, *args, **kwargs)[source]

Return set of matches with similarity above ‘threshold’.

Results will be sorted by similarity, highest to lowest.

Optional arguments accepted by all Index subclasses:
  • do_containment: default False. If True, use Jaccard containment.

  • best_only: default False. If True, allow optimizations that may. May discard matches better than threshold, but first match is guaranteed to be best.

  • ignore_abundance: default False. If True, and query signature and database support k-mer abundances, ignore those abundances.

Note, the “best only” hint is ignored by LinearIndex.

signatures()[source]

Return an iterator over all signatures in the Index object.

class sourmash.sbt.Leaf(metadata, data=None, name=None, storage=None, path=None)[source]
property data
classmethod load(info, storage=None)[source]
save(path)[source]
unload()[source]
update(parent)[source]

sourmash.fig: make plots and figures

Make plots using the distance matrix+labels output by sourmash compare.

sourmash.fig.load_matrix_and_labels(basefile)[source]

Load the comparison matrix and associated labels.

Returns a square numpy matrix & list of labels.

sourmash.fig.plot_composite_matrix(D, labeltext, show_labels=True, show_indices=True, vmax=1.0, vmin=0.0, force=False)[source]

Build a composite plot showing dendrogram + distance matrix/heatmap.

Returns a matplotlib figure.