sourmash
Python API¶
The primary programmatic way of interacting with sourmash
is via
its Python API. Please also see examples of using the API.
Contents
MinHash
: basic MinHash sketch functionality¶
-
class
sourmash.
MinHash
(n, ksize, is_protein=False, dayhoff=False, hp=False, track_abundance=False, seed=42, max_hash=0, mins=None, scaled=0)[source]¶ The core sketch object for sourmash.
MinHash objects store and provide functionality for subsampled hash values from DNA, RNA, and amino acid sequences. MinHash also supports both the standard MinHash behavior (bounded size or
num
) and a non-standard MinHash, called “modulo hash” behavior, orscaled
. Please see the API examples atfor more information.
Basic usage:
>>> from sourmash import MinHash >>> mh1 = MinHash(n=20, ksize=3) >>> mh1.add_sequence('ATGAGAGACGATAGACAGATGAC')
>>> mh2 = MinHash(n=20, ksize=3) >>> mh2.add_sequence('ATGAGActCGATAGaCAGATGAC')
>>> round(mh1.similarity(mh2), 2) 0.85
-
__init__
(n, ksize, is_protein=False, dayhoff=False, hp=False, track_abundance=False, seed=42, max_hash=0, mins=None, scaled=0)[source]¶ Create a sourmash.MinHash object.
- To create a standard (
num
) MinHash, use: MinHash(<num>, <ksize>, ...)
- To create a
scaled
MinHash, use MinHash(0, <ksize>, scaled=<int>, ...)
- Optional arguments:
is_protein (default False) - aa k-mers
dayhoff (default False) - dayhoff encoding
hp (default False) - hydrophilic/hydrophobic aa
track_abundance (default False) - track hash multiplicity
mins (default None) - list of hashvals, or (hashval, abund) pairs
seed (default 42) - murmurhash seed
- Deprecated: @CTB
max_hash=<int>
; usescaled
instead.
- To create a standard (
-
add_many
(hashes)[source]¶ Add many hashes to the sketch at once.
hashes
can be either an iterable (list, set, etc.), or anotherMinHash
object.
-
compare
(other, downsample=False)[source]¶ Calculate Jaccard similarity of two sketches.
Deprecated since version 3.3: This will be removed in 4.0. Use ‘similarity’ instead of compare.
-
containment_ignore_maxhash
(other)[source]¶ Calculate contained_by, with downsampling.
Deprecated since version 3.3: This will be removed in 4.0. Use ‘contained_by’ with downsample=True instead.
-
count_common
(other, downsample=False)[source]¶ Return the number of hashes in common between
self
andother
.Optionally downsample
scaled
objects to highestscaled
value.
-
downsample_max_hash
(*others)[source]¶ Copy this object and downsample new object to min of
*others
.Here,
*others
is one or more MinHash objects.
-
downsample_scaled
(new_scaled)[source]¶ Copy this object and downsample new object to scaled=``new_scaled``.
-
get_mins
(with_abundance=False)[source]¶ Return list of hashes or if
with_abundance
a list of (hash, abund).
-
intersection
(other, in_common=False)[source]¶ Calculate the intersection between
self
andother
, and return(mins, size)
wheremins
are the hashes in common, andsize
is the number of hashes.if
in_common
, return the actual hashes. Otherwise, mins will be empty.Deprecated since version 3.3: This will be removed in 4.0. Use count_common or set methods instead.
-
is_molecule_type
(molecule)[source]¶ Check if this MinHash is a particular human-readable molecule type.
Supports ‘protein’, ‘dayhoff’, ‘hp’, ‘DNA’. @CTB deprecate for 4.0?
-
set_abundances
(values, clear=True)[source]¶ Set abundances for hashes from
values
, wherevalues[hash] = abund
-
similarity
(other, ignore_abundance=False, downsample=False)[source]¶ Calculate similarity of two sketches.
If the sketches are not abundance weighted, or ignore_abundance=True, compute Jaccard similarity.
If the sketches are abundance weighted, calculate the angular similarity, a distance metric based on the cosine similarity.
Note, because the term frequencies (tf-idf weights) cannot be negative, the angle will never be < 0deg or > 90deg.
-
SourmashSignature
: save and load MinHash sketches in JSON¶
Save and load MinHash sketches in a JSON format, along with some metadata.
-
class
sourmash.signature.
SourmashSignature
(minhash, name='', filename='')[source]¶ Main class for signature information.
SBT
: save and load Sequence Bloom Trees in JSON¶
An implementation of sequence bloom trees, Solomon & Kingsford, 2015.
To try it out, do:
factory = GraphFactory(ksize, tablesizes, n_tables)
root = Node(factory)
graph1 = factory()
# ... add stuff to graph1 ...
leaf1 = Leaf("a", graph1)
root.insert(leaf1)
For example,
# filenames: list of fa/fq files
# ksize: k-mer size
# tablesizes: Bloom filter table sizes
# n_tables: Number of tables
factory = GraphFactory(ksize, tablesizes, n_tables)
root = Node(factory)
for filename in filenames:
graph = factory()
graph.consume_fasta(filename)
leaf = Leaf(filename, graph)
root.insert(leaf)
then define a search function,
def kmers(k, seq):
for start in range(len(seq) - k + 1):
yield seq[start:start + k]
def search_transcript(node, seq, threshold):
presence = [ node.data.get(kmer) for kmer in kmers(ksize, seq) ]
if sum(presence) >= int(threshold * len(seq)):
return 1
return 0
-
class
sourmash.sbt.
GraphFactory
(ksize, starting_size, n_tables)[source]¶ Build new nodegraphs (Bloom filters) of a specific (fixed) size.
- Parameters
ksize (int) – k-mer size.
starting_size (int) – size (in bytes) for each nodegraph table.
n_tables (int) – number of nodegraph tables to be used.
-
class
sourmash.sbt.
Leaf
(metadata, data=None, name=None, storage=None, path=None)[source]¶ -
property
data
¶
-
property
-
class
sourmash.sbt.
Node
(factory, name=None, path=None, storage=None)[source]¶ Internal node of SBT.
-
property
data
¶
-
property
-
class
sourmash.sbt.
NodePos
(pos, node)¶ -
property
node
¶ Alias for field number 1
-
property
pos
¶ Alias for field number 0
-
property
-
class
sourmash.sbt.
SBT
(factory, d=2, storage=None)[source]¶ A Sequence Bloom Tree implementation allowing generic internal nodes and leaves.
The default node and leaf format is a Bloom Filter (like the original implementation), but we also provide a MinHash leaf class (in the sourmash.sbtmh.SigLeaf class)
- Parameters
factory (Factory) – Callable for generating new datastores for internal nodes.
d (int) – Number of children for each internal node. Defaults to 2 (a binary tree)
storage (Storage, default: None) – A Storage is any place where we can save and load data for the nodes. If set to None, will use a FSStorage.
Notes
We use two dicts to store the tree structure: One for the internal nodes, and another for the leaves (datasets).
-
child
(parent, pos)[source]¶ Return a child node at position
pos
under theparent
node.- Parameters
parent (int) – Parent node position in the tree.
pos (int) – Position of the child one under the parent. Ranges from [0, arity - 1], where arity is the arity of the SBT (usually it is 2, a binary tree).
- Returns
A NodePos namedtuple with the position and content of the child node.
- Return type
-
children
(pos)[source]¶ Return all children nodes for node at position
pos
.- Parameters
pos (int) – Position of the node in the tree.
- Returns
A list of NodePos namedtuples with the position and content of all children nodes.
- Return type
list of NodePos
-
gather
(query, *args, **kwargs)[source]¶ Return the match with the best Jaccard containment in the database.
-
classmethod
load
(location, leaf_loader=None, storage=None, print_version_warning=True)[source]¶ Load an SBT description from a file.
- Parameters
location (str) – path to the SBT description.
leaf_loader (function, optional) – function to load leaf nodes. Defaults to
Leaf.load
.storage (Storage, optional) – Storage to be used for saving node data. Defaults to FSStorage (a hidden directory at the same level of path)
- Returns
the SBT tree built from the description.
- Return type
-
parent
(pos)[source]¶ Return the parent of the node at position
pos
.If it is the root node (position 0), returns None.
- Parameters
pos (int) – Position of the node in the tree.
- Returns
A NodePos namedtuple with the position and content of the parent node.
- Return type
-
save
(path, storage=None, sparseness=0.0, structure_only=False)[source]¶ Saves an SBT description locally and node data to a storage.
- Parameters
path (str) – path to where the SBT description should be saved.
storage (Storage, optional) – Storage to be used for saving node data. Defaults to FSStorage (a hidden directory at the same level of path)
sparseness (float) – How much of the internal nodes should be saved. Defaults to 0.0 (save all internal nodes data), can go up to 1.0 (don’t save any internal nodes data)
structure_only (boolean) – Write only the index schema and metadata, but not the data. Defaults to False (save data too)
- Returns
full path to the new SBT description
- Return type
str
-
search
(query, *args, **kwargs)[source]¶ Return set of matches with similarity above ‘threshold’.
Results will be sorted by similarity, highest to lowest.
- Optional arguments:
do_containment: default False. If True, use Jaccard containment.
best_only: default False. If True, allow optimizations that may. May discard matches better than threshold, but first match is guaranteed to be best.
ignore_abundance: default False. If True, and query signature and database support k-mer abundances, ignore those abundances.
sourmash.fig
: make plots and figures¶
Make plots using the distance matrix+labels output by sourmash compare
.