Building plots from sourmash compare output

Running this notebook.

You can run this notebook interactively via mybinder; click on this button: Binder

A rendered version of this notebook is available at sourmash.readthedocs.io under “Tutorials and notebooks”.

You can also get this notebook from the doc/ subdirectory of the sourmash github repository. See binder/environment.yaml for installation dependencies.

What is this?

This is a Jupyter Notebook using Python 3. If you are running this via binder, you can use Shift-ENTER to run cells, and double click on code cells to edit them.

Contact: C. Titus Brown, ctbrown@ucdavis.edu. Please file issues on GitHub if you have any questions or comments!

Running sourmash compare and generating figures in Python

First, we need to generate a similarity matrix with compare. (If you want to generate this programmatically, it’s just a numpy matrix.)

[1]:
!sourmash compare ../tests/test-data/demo/*.sig -o compare-demo

== This is sourmash version 4.8.5.dev0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loaded 7 signatures total.

0-SRR2060939_1.fa...    [1.    0.356 0.078 0.086 0.    0.    0.   ]
1-SRR2060939_2.fa...    [0.356 1.    0.072 0.078 0.    0.    0.   ]
2-SRR2241509_1.fa...    [0.078 0.072 1.    0.074 0.    0.    0.   ]
3-SRR2255622_1.fa...    [0.086 0.078 0.074 1.    0.    0.    0.   ]
4-SRR453566_1.fas...    [0.    0.    0.    0.    1.    0.382 0.364]
5-SRR453569_1.fas...    [0.    0.    0.    0.    0.382 1.    0.386]
6-SRR453570_1.fas...    [0.    0.    0.    0.    0.364 0.386 1.   ]
min similarity in matrix: 0.000
saving labels to: compare-demo.labels.txt
saving comparison matrix to: compare-demo
[2]:
%pylab inline
# import the `fig` module from sourmash:
from sourmash import fig
%pylab is deprecated, use %matplotlib inline and import the required libraries.
Populating the interactive namespace from numpy and matplotlib

The sourmash.fig module contains code to load the similarity matrix and associated labels:

[3]:
matrix, labels = fig.load_matrix_and_labels("compare-demo")

Here, matrix is a numpy matrix and labels is a list of labels (by default, filenames).

[4]:
print("matrix:\n", matrix)
print("labels:", labels)
matrix:
 [[1.    0.356 0.078 0.086 0.    0.    0.   ]
 [0.356 1.    0.072 0.078 0.    0.    0.   ]
 [0.078 0.072 1.    0.074 0.    0.    0.   ]
 [0.086 0.078 0.074 1.    0.    0.    0.   ]
 [0.    0.    0.    0.    1.    0.382 0.364]
 [0.    0.    0.    0.    0.382 1.    0.386]
 [0.    0.    0.    0.    0.364 0.386 1.   ]]
labels: ['SRR2060939_1.fastq.gz', 'SRR2060939_2.fastq.gz', 'SRR2241509_1.fastq.gz', 'SRR2255622_1.fastq.gz', 'SRR453566_1.fastq.gz', 'SRR453569_1.fastq.gz', 'SRR453570_1.fastq.gz']

The plot_composite_matrix function returns a generated plot, along with the labels and matrix as re-ordered by the clustering:

[5]:
f, reordered_labels, reordered_matrix = fig.plot_composite_matrix(matrix, labels)
_images/plotting-compare_11_0.png
[6]:
print("reordered matrix:\n", reordered_matrix)
print("reordered labels:", reordered_labels)
reordered matrix:
 [[1.    0.382 0.364 0.    0.    0.    0.   ]
 [0.382 1.    0.386 0.    0.    0.    0.   ]
 [0.364 0.386 1.    0.    0.    0.    0.   ]
 [0.    0.    0.    1.    0.356 0.078 0.086]
 [0.    0.    0.    0.356 1.    0.072 0.078]
 [0.    0.    0.    0.078 0.072 1.    0.074]
 [0.    0.    0.    0.086 0.078 0.074 1.   ]]
reordered labels: ['SRR453566_1.fastq.gz', 'SRR453569_1.fastq.gz', 'SRR453570_1.fastq.gz', 'SRR2060939_1.fastq.gz', 'SRR2060939_2.fastq.gz', 'SRR2241509_1.fastq.gz', 'SRR2255622_1.fastq.gz']

Customizing plots

If you want to customize the plots, please see the code for plot_composite_matrix in sourmash/fig.py, which is reproduced below; you can modify the code in place to (for example) use custom dendrogram colors.

[7]:
import scipy.cluster.hierarchy as sch


def plot_composite_matrix(
    D, labeltext, show_labels=True, vmax=1.0, vmin=0.0, force=False
):
    """Build a composite plot showing dendrogram + distance matrix/heatmap.

    Returns a matplotlib figure.

    If show_labels is True, display labels. Otherwise, no labels are
    shown on the plot.
    """
    if D.max() > 1.0 or D.min() < 0.0:
        error(
            "This matrix doesn't look like a distance matrix - min value {}, max value {}",
            D.min(),
            D.max(),
        )
        if not force:
            raise ValueError("not a distance matrix")
        else:
            notify("force is set; scaling to [0, 1]")
            D -= D.min()
            D /= D.max()

    if show_labels:
        pass

    fig = pylab.figure(figsize=(11, 8))
    ax1 = fig.add_axes([0.09, 0.1, 0.2, 0.6])

    # plot dendrogram
    Y = sch.linkage(D, method="single")  # centroid

    Z1 = sch.dendrogram(
        Y,
        orientation="left",
        labels=labeltext,
        no_labels=not show_labels,
        get_leaves=True,
    )
    ax1.set_xticks([])

    xstart = 0.45
    width = 0.45
    if not show_labels:
        xstart = 0.315
    scale_xstart = xstart + width + 0.01

    # re-order labels along rows, top to bottom
    idx1 = Z1["leaves"]
    reordered_labels = [labeltext[i] for i in idx1]

    # reorder D by the clustering in the dendrogram
    D = D[idx1, :]
    D = D[:, idx1]

    # show matrix
    axmatrix = fig.add_axes([xstart, 0.1, width, 0.6])

    im = axmatrix.matshow(
        D, aspect="auto", origin="lower", cmap=pylab.cm.YlGnBu, vmin=vmin, vmax=vmax
    )
    axmatrix.set_xticks([])
    axmatrix.set_yticks([])

    # Plot colorbar.
    axcolor = fig.add_axes([scale_xstart, 0.1, 0.02, 0.6])
    pylab.colorbar(im, cax=axcolor)

    return fig, reordered_labels, D
[8]:
_ = plot_composite_matrix(matrix, labels)
_images/plotting-compare_16_0.png
[ ]: