sourmash Python examples

A very simple example: two k-mers

Define two sequences:

>>> seq1 = "ATGGCA"
>>> seq2 = "AGAGCA"

Create two estimators using 3-mers, and add the sequences:

>>> import sourmash_lib
>>> E1 = sourmash_lib.Estimators(n=20, ksize=3)
>>> E1.add_sequence(seq1)
>>> E2 = sourmash_lib.Estimators(n=20, ksize=3)
>>> E2.add_sequence(seq2)

One of the 3-mers (out of 4) overlaps, so Jaccard index is 1/4:

>>> E1.jaccard(E2)
0.25

and of course the estimators match themselves:

>>> E1.jaccard(E1)
1.0

We can add sequences and query at any time –

>>> E1.add_sequence(seq2)
>>> x = E1.jaccard(E2)
>>> round(x, 3)
0.571

Consuming files

Suppose we want to create MinHash sketches from genomes –

>>> import glob, pprint
>>> genomes = glob.glob('data/GCF*.fna.gz')
>>> genomes = list(sorted(genomes))
>>> pprint.pprint(genomes)
['data/GCF_000005845.2_ASM584v2_genomic.fna.gz',
 'data/GCF_000006945.1_ASM694v1_genomic.fna.gz',
 'data/GCF_000783305.1_ASM78330v1_genomic.fna.gz']

We have to read them in (here using screed), but then they can be fed into ‘add_sequence’ directly; here we set ‘force=True’ in add_sequence to ignore non-ACTGN characters.

(Note, just for speed reasons, we’ll truncate the sequences to 50kb in length.)

>>> import screed
>>> estimators = []
>>> for g in genomes:
...     E = sourmash_lib.Estimators(n=500, ksize=31)
...     for record in screed.open(g):
...         E.add_sequence(record.sequence[:50000], True)
...     estimators.append(E)

And now the estimators can be compared against each other:

>>> import sys
>>> for i, e in enumerate(estimators):
...    _ = sys.stdout.write(genomes[i][:20] + ' ')
...    for j, e2 in enumerate(estimators):
...       x = e.jaccard(estimators[j])
...       _ = sys.stdout.write(str(round(x, 3)) + ' ')
...    _= sys.stdout.write('\n')
data/GCF_000005845.2 1.0 0.0 0.0
data/GCF_000006945.1 0.0 1.0 0.0
data/GCF_000783305.1 0.0 0.0 1.0

Note that the comparisons are quite quick; most of the time is spent in making the estimators, which can be saved and loaded easily.

Saving and loading signature files

>>> from sourmash_lib import signature
>>> sig1 = signature.SourmashSignature('titus@idyll.org', estimators[0],
...                                    name=genomes[0][:20])
>>> with open('/tmp/genome1.sig', 'wt') as fp:
...   signature.save_signatures([sig1], fp)

Here, /tmp/genome1.sig is a YAML file that can now be loaded and compared – first, load:

>>> sigdata = open('/tmp/genome1.sig', 'rt').read()
>>> siglist = signature.load_signatures(sigdata)
>>> loaded_sig = siglist[0]

then compare:

>>> loaded_sig.estimator.jaccard(sig1.estimator)
1.0
>>> sig1.estimator.jaccard(loaded_sig.estimator)
1.0