============================ ``sourmash`` Python examples ============================ A very simple example: two k-mers --------------------------------- Define two sequences: >>> seq1 = "ATGGCA" >>> seq2 = "AGAGCA" Create two estimators using 3-mers, and add the sequences: >>> import sourmash_lib >>> E1 = sourmash_lib.Estimators(n=20, ksize=3) >>> E1.add_sequence(seq1) >>> E2 = sourmash_lib.Estimators(n=20, ksize=3) >>> E2.add_sequence(seq2) One of the 3-mers (out of 4) overlaps, so Jaccard index is 1/4: >>> E1.jaccard(E2) 0.25 and of course the estimators match themselves: >>> E1.jaccard(E1) 1.0 We can add sequences and query at any time -- >>> E1.add_sequence(seq2) >>> x = E1.jaccard(E2) >>> round(x, 3) 0.571 Consuming files --------------- Suppose we want to create MinHash sketches from genomes -- >>> import glob, pprint >>> genomes = glob.glob('data/GCF*.fna.gz') >>> genomes = list(sorted(genomes)) >>> pprint.pprint(genomes) ['data/GCF_000005845.2_ASM584v2_genomic.fna.gz', 'data/GCF_000006945.1_ASM694v1_genomic.fna.gz', 'data/GCF_000783305.1_ASM78330v1_genomic.fna.gz'] We have to read them in (here using screed), but then they can be fed into 'add_sequence' directly; here we set 'force=True' in ``add_sequence`` to ignore non-ACTGN characters. (Note, just for speed reasons, we'll truncate the sequences to 50kb in length.) >>> import screed >>> estimators = [] >>> for g in genomes: ... E = sourmash_lib.Estimators(n=500, ksize=31) ... for record in screed.open(g): ... E.add_sequence(record.sequence[:50000], True) ... estimators.append(E) And now the estimators can be compared against each other: >>> import sys >>> for i, e in enumerate(estimators): ... _ = sys.stdout.write(genomes[i][:20] + ' ') ... for j, e2 in enumerate(estimators): ... x = e.jaccard(estimators[j]) ... _ = sys.stdout.write(str(round(x, 3)) + ' ') ... _= sys.stdout.write('\n') data/GCF_000005845.2 1.0 0.0 0.0 data/GCF_000006945.1 0.0 1.0 0.0 data/GCF_000783305.1 0.0 0.0 1.0 Note that the comparisons are quite quick; most of the time is spent in making the estimators, which can be saved and loaded easily. Saving and loading signature files ---------------------------------- >>> from sourmash_lib import signature >>> sig1 = signature.SourmashSignature('titus@idyll.org', estimators[0], ... name=genomes[0][:20]) >>> with open('/tmp/genome1.sig', 'wt') as fp: ... signature.save_signatures([sig1], fp) Here, ``/tmp/genome1.sig`` is a YAML file that can now be loaded and compared -- first, load: >>> sigdata = open('/tmp/genome1.sig', 'rt').read() >>> siglist = signature.load_signatures(sigdata) >>> loaded_sig = siglist[0] then compare: >>> loaded_sig.estimator.jaccard(sig1.estimator) 1.0 >>> sig1.estimator.jaccard(loaded_sig.estimator) 1.0