sourmash Python API

The primary programmatic way of interacting with sourmash is via its Python API. Please also see examples of using the API.

MinHash: basic MinHash sketch functionality

An implementation of a MinHash bottom sketch, applied to k-mers in DNA.

SourmashSignature: save and load MinHash sketches in JSON

Save and load MinHash sketches in a JSON format, along with some metadata.

class sourmash.signature.SourmashSignature(minhash, name='', filename='')[source]

Main class for signature information.

contained_by(other, downsample=False)[source]

Compute containment by the other signature. Note: ignores abundance.

jaccard(other)[source]

Compute Jaccard similarity with the other MinHash signature.

md5sum()[source]

Calculate md5 hash of the bottom sketch, specifically.

name()[source]

Return as nice a name as possible, defaulting to md5 prefix.

similarity(other, ignore_abundance=False, downsample=False)[source]

Compute similarity with the other MinHash signature.

sourmash.signature.load_signatures(data, ksize=None, select_moltype=None, ignore_md5sum=False, do_raise=False, quiet=False)[source]

Load a JSON string with signatures into classes.

Returns list of SourmashSignature objects.

Note, the order is not necessarily the same as what is in the source file.

sourmash.signature.save_signatures(siglist, fp=None)[source]

Save multiple signatures into a JSON string (or into file handle ‘fp’)

SBT: save and load Sequence Bloom Trees in JSON

An implementation of sequence bloom trees, Solomon & Kingsford, 2015.

To try it out, do:

factory = GraphFactory(ksize, tablesizes, n_tables)
root = Node(factory)

graph1 = factory()
# ... add stuff to graph1 ...
leaf1 = Leaf("a", graph1)
root.add_node(leaf1)

For example,

# filenames: list of fa/fq files
# ksize: k-mer size
# tablesizes: Bloom filter table sizes
# n_tables: Number of tables

factory = GraphFactory(ksize, tablesizes, n_tables)
root = Node(factory)

for filename in filenames:
    graph = factory()
    graph.consume_fasta(filename)
    leaf = Leaf(filename, graph)
    root.add_node(leaf)

then define a search function,

def kmers(k, seq):
    for start in range(len(seq) - k + 1):
        yield seq[start:start + k]

def search_transcript(node, seq, threshold):
    presence = [ node.data.get(kmer) for kmer in kmers(ksize, seq) ]
    if sum(presence) >= int(threshold * len(seq)):
        return 1
    return 0
class sourmash.sbt.GraphFactory(ksize, starting_size, n_tables)[source]

Build new nodegraphs (Bloom filters) of a specific (fixed) size.

Parameters:
  • ksize (int) – k-mer size.
  • starting_size (int) – size (in bytes) for each nodegraph table.
  • n_tables (int) – number of nodegraph tables to be used.
init_args()[source]
class sourmash.sbt.Node(factory, name=None, path=None, storage=None)[source]

Internal node of SBT.

data
static load(info, storage=None)[source]
save(path)[source]
update(parent)[source]
class sourmash.sbt.NodePos(pos, node)
node

Alias for field number 1

pos

Alias for field number 0

class sourmash.sbt.SBT(factory, d=2, storage=None)[source]

A Sequence Bloom Tree implementation allowing generic internal nodes and leaves.

The default node and leaf format is a Bloom Filter (like the original implementation), but we also provide a MinHash leaf class (in the sourmash.sbtmh.Leaf

Parameters:
  • factory (Factory) – Callable for generating new datastores for internal nodes.
  • d (int) – Number of children for each internal node. Defaults to 2 (a binary tree)
  • n_tables (int) – number of nodegraph tables to be used.

Notes

We use a defaultdict to store the tree structure. Nodes are numbered specific node they are numbered

add_node(node)[source]
child(parent, pos)[source]

Return a child node at position pos under the parent node.

Parameters:
  • parent (int) – Parent node position in the tree.
  • pos (int) – Position of the child one under the parent. Ranges from [0, arity - 1], where arity is the arity of the SBT (usually it is 2, a binary tree).
Returns:

A NodePos namedtuple with the position and content of the child node.

Return type:

NodePos

children(pos)[source]

Return all children nodes for node at position pos.

Parameters:pos (int) – Position of the node in the tree.
Returns:A list of NodePos namedtuples with the position and content of all children nodes.
Return type:list of NodePos
combine(other)[source]
find(search_fn, *args, **kwargs)[source]

Search the tree using search_fn.

leaves()[source]
classmethod load(location, leaf_loader=None, storage=None, print_version_warning=True)[source]

Load an SBT description from a file.

Parameters:
  • location (str) – path to the SBT description.
  • leaf_loader (function, optional) – function to load leaf nodes. Defaults to Leaf.load.
  • storage (Storage, optional) – Storage to be used for saving node data. Defaults to FSStorage (a hidden directory at the same level of path)
Returns:

the SBT tree built from the description.

Return type:

SBT

new_node_pos(node)[source]
parent(pos)[source]

Return the parent of the node at position pos.

If it is the root node (position 0), returns None.

Parameters:pos (int) – Position of the node in the tree.
Returns:A NodePos namedtuple with the position and content of the parent node.
Return type:NodePos
print()[source]
print_dot()[source]
save(path, storage=None, sparseness=0.0, structure_only=False)[source]

Saves an SBT description locally and node data to a storage.

Parameters:
  • path (str) – path to where the SBT description should be saved.
  • storage (Storage, optional) – Storage to be used for saving node data. Defaults to FSStorage (a hidden directory at the same level of path)
  • sparseness (float) – How much of the internal nodes should be saved. Defaults to 0.0 (save all internal nodes data), can go up to 1.0 (don’t save any internal nodes data)
  • structure_only (boolean) – Write only the index schema and metadata, but not the data. Defaults to False (save data too)
Returns:

full path to the new SBT description

Return type:

str

class sourmash.sbt.Leaf(metadata, data=None, name=None, storage=None, path=None)[source]
data
classmethod load(info, storage=None)[source]
save(path)[source]
update(parent)[source]

sourmash.fig: make plots and figures

Make plots using the distance matrix+labels output by sourmash compare.

sourmash.fig.load_matrix_and_labels(basefile)[source]

Load the comparison matrix and associated labels.

Returns a square numpy matrix & list of labels.

sourmash.fig.plot_composite_matrix(D, labeltext, show_labels=True, show_indices=True, vmax=1.0, vmin=0.0, force=False)[source]

Build a composite plot showing dendrogram + distance matrix/heatmap.

Returns a matplotlib figure.