sourmash
Python API¶
The primary programmatic way of interacting with sourmash
is via
its Python API. Please also see examples of using the API.
Contents
MinHash
: basic MinHash sketch functionality¶
An implementation of a MinHash bottom sketch, applied to k-mers in DNA.
SourmashSignature
: save and load MinHash sketches in JSON¶
Save and load MinHash sketches in a JSON format, along with some metadata.
-
class
sourmash.signature.
SourmashSignature
(minhash, name='', filename='')[source]¶ Main class for signature information.
SBT
: save and load Sequence Bloom Trees in JSON¶
An implementation of sequence bloom trees, Solomon & Kingsford, 2015.
To try it out, do:
factory = GraphFactory(ksize, tablesizes, n_tables)
root = Node(factory)
graph1 = factory()
# ... add stuff to graph1 ...
leaf1 = Leaf("a", graph1)
root.add_node(leaf1)
For example,
# filenames: list of fa/fq files
# ksize: k-mer size
# tablesizes: Bloom filter table sizes
# n_tables: Number of tables
factory = GraphFactory(ksize, tablesizes, n_tables)
root = Node(factory)
for filename in filenames:
graph = factory()
graph.consume_fasta(filename)
leaf = Leaf(filename, graph)
root.add_node(leaf)
then define a search function,
def kmers(k, seq):
for start in range(len(seq) - k + 1):
yield seq[start:start + k]
def search_transcript(node, seq, threshold):
presence = [ node.data.get(kmer) for kmer in kmers(ksize, seq) ]
if sum(presence) >= int(threshold * len(seq)):
return 1
return 0
-
class
sourmash.sbt.
GraphFactory
(ksize, starting_size, n_tables)[source]¶ Build new nodegraphs (Bloom filters) of a specific (fixed) size.
Parameters: - ksize (int) – k-mer size.
- starting_size (int) – size (in bytes) for each nodegraph table.
- n_tables (int) – number of nodegraph tables to be used.
-
class
sourmash.sbt.
Node
(factory, name=None, path=None, storage=None)[source]¶ Internal node of SBT.
-
data
¶
-
-
class
sourmash.sbt.
SBT
(factory, d=2, storage=None)[source]¶ A Sequence Bloom Tree implementation allowing generic internal nodes and leaves.
The default node and leaf format is a Bloom Filter (like the original implementation), but we also provide a MinHash leaf class (in the sourmash.sbtmh.Leaf
Parameters: - factory (Factory) – Callable for generating new datastores for internal nodes.
- d (int) – Number of children for each internal node. Defaults to 2 (a binary tree)
- n_tables (int) – number of nodegraph tables to be used.
Notes
We use a defaultdict to store the tree structure. Nodes are numbered specific node they are numbered
-
child
(parent, pos)[source]¶ Return a child node at position
pos
under theparent
node.Parameters: - parent (int) – Parent node position in the tree.
- pos (int) – Position of the child one under the parent. Ranges from [0, arity - 1], where arity is the arity of the SBT (usually it is 2, a binary tree).
Returns: A NodePos namedtuple with the position and content of the child node.
Return type:
-
children
(pos)[source]¶ Return all children nodes for node at position
pos
.Parameters: pos (int) – Position of the node in the tree. Returns: A list of NodePos namedtuples with the position and content of all children nodes. Return type: list of NodePos
-
classmethod
load
(location, leaf_loader=None, storage=None, print_version_warning=True)[source]¶ Load an SBT description from a file.
Parameters: - location (str) – path to the SBT description.
- leaf_loader (function, optional) – function to load leaf nodes. Defaults to
Leaf.load
. - storage (Storage, optional) – Storage to be used for saving node data. Defaults to FSStorage (a hidden directory at the same level of path)
Returns: the SBT tree built from the description.
Return type:
-
parent
(pos)[source]¶ Return the parent of the node at position
pos
.If it is the root node (position 0), returns None.
Parameters: pos (int) – Position of the node in the tree. Returns: A NodePos namedtuple with the position and content of the parent node. Return type: NodePos
-
save
(path, storage=None, sparseness=0.0, structure_only=False)[source]¶ Saves an SBT description locally and node data to a storage.
Parameters: - path (str) – path to where the SBT description should be saved.
- storage (Storage, optional) – Storage to be used for saving node data. Defaults to FSStorage (a hidden directory at the same level of path)
- sparseness (float) – How much of the internal nodes should be saved. Defaults to 0.0 (save all internal nodes data), can go up to 1.0 (don’t save any internal nodes data)
- structure_only (boolean) – Write only the index schema and metadata, but not the data. Defaults to False (save data too)
Returns: full path to the new SBT description
Return type: str
sourmash.fig
: make plots and figures¶
Make plots using the distance matrix+labels output by sourmash compare
.