Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2005/04/24/interactive_html

Interactive HTML

I want to make an interactive web page based on the graph output used in the previous essay. When the mouse is over one of the points on the plot I want it to show the compound identifier (from the PUBCHEM_COMPOUND_CID field) and when clicked I want to see the structure. Here's an example of the result in action, though with only 20 data points instead of the 100 I use in a bit.

To make it work I'll need three things. I need an HTML page which shows the scatter plot and the 2D depiction. The scatter plot will be an image with an associated client-side image map to support the mouse actions. I'll use OpenEye's mol2img to make GIFs of the 2D depictions. I'll put them in the imgs subdirectory and use a filename based on the structure's id.

I'll use Javascript for the interactivity instead of using CGI on the server. This make the page more responsive and the code simpler because I can do everything with URLs and data files. I don't even need a web server. It works because user events, like mouseovers and clicks on specific elements, can be configured to call a Javascript function. The Javascript function can in turn reach into the HTML document (through a data structure called a DOM, for Document Object Model) and change it.

I need a 2D depiction for each compound in my data set. OGHAM doesn't yet have public Python bindings and I don't want to write my own extension using OGHAM's C++ API. Instead I'll write a wrapper to $OE_DIR/bin/mol2gif and use it to make a 200x200 pixel GIF given the molecule's SMILES string for input. There's a chance that mol2gif will fail; most likely because the output file could not be created. I'll do some simple error checking but not as extensive as my smi2name wrapper.

import subprocess, os

def make_gif(smiles, filename, width = 200, height = 200):
    p = subprocess.Popen([os.environ["OE_DIR"] + "/bin/mol2gif",
                          "-width", str(width), "-height", str(height),
                          "-gif", "-", filename],
                         stdin = subprocess.PIPE,
                         stderr = subprocess.PIPE)
    p.stdin.write(smiles + "\n")
    p.stdin.close()
    errmsg = p.stderr.read()
    errcode = p.wait()
    if errcode:
        raise AssertionError("Could not save %r as an image to %r:\n%s" %
	                     (smiles, filename, errmsg))
This is not a general purpose function. For example, I don't check for strange characters in the SMILES because they will be created be OECreateCanSmiString(). Nor is it a coprocess-based implementation because mol2gif can only be used to generate a single image at a time.

I need to call this function for each compound in the data set. I have a function to read the compound id, XLogP and molecular weight of the first N compounds in the data set that have an XLogP. I want the SMILES string as well. I don't have access to the actual molecule read so I'll need to change the function, read the data file twice (once for the SD fields and once for the SMILES), or consider a new approach. I decided on the last because it shows a more general solution to this sort of task.

The existing read_data() function does two things; it selects the molecules to use and it extracts some data from each molecule. I can break that function into two parts; one to select the molecule and the other to extract data from a given molecule. The result might look like this:

ifs = oemolistream(filename)
cids = []; weights = []; xlogps = []
for i, mol in select_compounds_with_xlogp(ifs.GetOEGraphMols()):
    if i == 100:
        break
    cid, weight, xlogp = get_data(mol)
    cids.append(cid)
    weights.append(weight)
    xlogps.append(xlogp)
    smiles = OECreateCanSmiString(mol)
    ... make the image given the CID and SMILES ...
This is a bit more complicated than the original function because I need to make the list of cids, weights and xlopgs myself. On the other hand it has access to the molecule object, which makes it possible to get the SMILES strings. It's also more appropriate for cases where you don't need the list of values. Suppose you just wanted the compound id which has the largest XLogP. The following would work:
ifs = oemolistream(filename)
max_xlogp = None
max_cid = None
for mol in select_compounds_with_xlogp(ifs.GetOEGraphMols()):
    cid, weight, xlogp = get_data(mol)
    if xlogp > max_xlogp:
        max_xlogp = xlogp
        max_cid = cid
print max_cid, "has the largest XLogP:", xlogp

This is an iterator based approach. You can think of it as a pipeline or stream with different components: a source, transformers, filters, and a destination. The source here is a file, referenced by ifs. ("ifs" is short for "input file stream"). The GetOEGraphMols() converts the input stream into a stream of OEGraphMol objects. The select_compounds_with_xlogp() is a filter, and the destination is the mol used in the body of the loop.

I need to write two functions; one to filter a stream and return a stream of only those compounds that contain XLogP values, and the other to extract the CID, XLogP and MW from a molecule. I'll write the latter function first because it is the easier of the two:

def get_data(mol):
    cid = OEGetSDData(mol, "PUBCHEM_COMPOUND_CID")
    weight = OEGetSDData(mol, "PUBCHEM_OPENEYE_MW")
    xlogp = OEGetSDData(mol, "PUBCHEM_CACTVS_XLOGP")
    if (cid == "" or weight == "" or xlogp == ""):
        raise AssertionError( (cid, weight, xlogp) )

    return cid, float(weight), float(xlogp)
Writing a filter is easy because of Python's yield keyword. Here's one implementation:
def select_compounds_with_xlogp(mols):
    for mol in mols:
        if OEHasSDData(mol, "PUBCHEM_CACTVS_XLOGP"):
            yield mol
Iterators in Python have proved both easy and powerful. Some common patterns have emerged from experience. Many filters, for example, have a form similar to the previous function:
def ...name...(stream):
    for item in stream:
        if ..some.test..(item):
            yield item
The itertools module implements many of the common patterns and is very helpful when manipulating iterator streams. This specific pattern is in there and called ifilter() for "iterator filter". To work it needs to know how to test if an item in the stream should be used. This is done by passing it a function that returns True or False when passed an item. This class of functions are called predicates; OpenEye uses the same terminology for some of their test functions.

To use itertools I need to make a new predicate function that tests if a molecule has the XLogP SD tag. Predicate function names usually start with "is" or "has" and this is no exception.

# True only for those molecules with an XLOGP field
def has_xlogp(mol):
    return OEHasSDData(mol, "PUBCHEM_CACTVS_XLOGP")
I can then define the select_compounds_with_xlogp() function like this:
def select_compounds_with_xlogp(mols):
    return itertools.ifilter(has_xlogp, mols)

But I don't think that's the right way to write this code. Anyone trying to understand it needs to track down several very short functions definitions. It's unlikely the code will be reused elsewhere and having the function name doesn't make things significantly clearer so I prefer writing the itertools call directly in the code. It is somewhat of a stylism point though.

I only want to use the first 100 compounds with XLogP. There's an itertools function to help out with that called islice. Combining these and the sketch of the result looks like this

from itertools import *
...
# True only for those molecules with an XLOGP field
def has_xlogp(mol):
    return OEHasSDData(mol, "PUBCHEM_CACTVS_XLOGP")
...

def main():
    ...
    # Get the first 100 compounds that have an XLogP field
    for mol in islice(ifilter(has_xlogp, ifs.GetOEGraphMols()),
                      0, 100):
        
        cid, weight, xlogp = get_data(mol)
        smiles = OECreateCanSmiString(mol)
        ...

Now that I can get all the needed data I'll add the code to make an image file for each compound. I can do that now that I have the SMILES. I'll use the make_gif() function and save it to the directory "imgs". If the directory doesn't exist when the function starts I'll create it. The details are pretty understandable so I'll simply show it, with the image creation parts in bold.

def main():
    filename = "/Users/dalke/databases/compounds_500001_510000.sdf.gz"
    ifs = oemolistream(filename)

    imgdir = "imgs"
    if not os.path.isdir(imgdir):
        os.mkdir(imgdir)

    # Width and height for each compound image, in pixels
    cmpd_width = cmpd_height = 320

    cids = []
    weights = []
    xlogps = []
    imgnames = []
    # Get the first 100 compounds that have an XLogP field
    for mol in islice(ifilter(has_xlogp, ifs.GetOEGraphMols()),
                      0, 100):
        
        cid, weight, xlogp = get_data(mol)

        imgname = os.path.join(imgdir, "%s.gif" % (cid,))
        make_gif(OECreateCanSmiString(mol), imgname,
                 cmpd_width, cmpd_height)

        cids.append(cid)
        weights.append(weight)
        xlogps.append(xlogp)
        imgnames.append(imgname)
Next is the code to create and save the scatter plot. To make the image map I'll need to convert from data coordinates to the scatter plot's image coordinates. The transformation is stored in the scatter plot object so I'll keep a reference to it for later. I'll also need the figures image width and height.
    fig = Figure(figsize=(4,4))
    ax = fig.add_subplot(111)

    sc = ax.scatter(weights, xlogps)
    
    ax.set_xlabel("Atomic weight")
    ax.set_ylabel("CACTVS XLogP")

    # Make the PNG and get the scatter plot image size
    canvas = FigureCanvasAgg(fig)
    canvas.print_figure("mw_v_xlogp.png", dpi=80)
    img_width = fig.get_figwidth() * 80
    img_height = fig.get_figheight() * 80

The last part is to make the HTML page with the client-side image map. The image map needs to be in image coordinates not data coordinates. The scatter plot has a Transform object that stores the transformation information. Its seq_x_y_ method converts the coordinates in the needed direction.

    # Convert the data set points into screen space coordinates
    trans = sc.get_transform()
    xcoords, ycoords = trans.seq_x_y(weights, xlogps)

The image map uses "circle" elements for each point, centered at the data point and with a 5 pixel radius. (The 5 comes from testing. I could use the marker size from matplotlib but figured that was too complicated for this example.) The image map coordinates have y=0 at the top of the image with increasing y values going down. The matplotlib coordinates have y=0 at the bottom of the image with increasing y values going up. So I'll need to convert between these two.

Each image map circle has two Javascript actions, one for when the mouse goes over the area and the other for mouse clicks. On mouseovers I want to display the compound name at the top of the web page. The function does this by finding the HMTL element I want to modify (in this case a '<SPAN>' with the id of "cid") and changing its content – its innerHtml – to the new name. To make this work I need to make sure that element exists with that id.

The other Javascript function is called when the image map circle is clicked on. It is called with the URL for the 2D depiction. In this case it's a relative URL so works with both http and file URLs. The function finds the document element with the id cmpd_img, which is an '<IMG>' element. Setting its src property tells the element to load the new image from the given source. To the user, clicking on the element brings up the corresponding structure.

That all said, here's the code to generate the HTML file:

    f = open("mw_v_xlogp.html", "w")
    f.write('''<HTML>
<HEAD>
 <TITLE>MW vs. XLogP</TITLE>
</HEAD>
<BODY>
<SCRIPT>
function mouseover(name) {
  var cid = document.getElementById("cid");
  cid.innerHTML = name;
}
function show(filename) {
  var cmpd_img = document.getElementById("cmpd_img");
  cmpd_img.src = filename;
}

</SCRIPT>
Mouse is over: <SPAN id="cid"></SPAN><BR>
Pick a point to see the depiction<BR>
<IMG SRC="mw_v_xlogp.png" ismap usemap="#points"
  WIDTH="%d" HEIGHT="%d">
<IMG ID="cmpd_img" WIDTH="%d" HEIGHT="%d">
<MAP name="points">
''' % (img_height, img_width, cmpd_height, cmpd_width))

    # Figure height in pixels.
    # HTML image coordinates have y=0 on the top.  Matplotlib
    # has y=0 on the bottom.  We'll need to flip the numbers
    for cid, x, y, imgname in zip(cids, xcoords, ycoords, imgnames):
        f.write('''<AREA shape="circle" coords="%d,%d,5"
onmouseover="javascript:mouseover('%s');"
href="javascript:show('%s');">\n''' %
                (x, img_height-y, cid, imgname))

    f.write("</MAP>\n</BODY></HTML>\n")
It's pretty messy because it's mixing HTML and Python code, with the Python code in charge. In the next essay I'll show how to use a templating system to seperate these two.

Putting it all together, here's the complete code:

import subprocess, os
from itertools import *

from openeye.oechem import *

from matplotlib.figure import Figure
from matplotlib.patches import Polygon
from matplotlib.backends.backend_agg import FigureCanvasAgg
import matplotlib.numerix as nx

def make_gif(smiles, filename, width = 200, height = 200):
    p = subprocess.Popen([os.environ["OE_DIR"] + "/bin/mol2gif",
                          "-width", str(width), "-height", str(height),
                          "-gif", "-", filename],
                         stdin = subprocess.PIPE,
                         stderr = subprocess.PIPE)
    p.stdin.write(smiles + "\n")
    p.stdin.close()
    errmsg = p.stderr.read()
    errcode = p.wait()
    if errcode:
        raise AssertionError("Could not save %r as an image to %r:\n%s" %
	                     (smiles, filename, errmsg))
                     

# True only for those molecules with an XLOGP field
def has_xlogp(mol):
    return OEHasSDData(mol, "PUBCHEM_CACTVS_XLOGP")

def get_data(mol):
    cid = OEGetSDData(mol, "PUBCHEM_COMPOUND_CID")
    weight = OEGetSDData(mol, "PUBCHEM_OPENEYE_MW")
    xlogp = OEGetSDData(mol, "PUBCHEM_CACTVS_XLOGP")
    if (cid == "" or weight == "" or xlogp == ""):
        raise AssertionError( (cid, weight, xlogp) )

    return cid, float(weight), float(xlogp)

def main():
    filename = "/Users/dalke/databases/compounds_500001_510000.sdf.gz"
    ifs = oemolistream(filename)

    imgdir = "imgs"
    if not os.path.isdir(imgdir):
        os.mkdir(imgdir)

    # Width and height for each compound image, in pixels
    cmpd_width = cmpd_height = 320

    cids = []
    weights = []
    xlogps = []
    imgnames = []
    # Get the first 100 compounds that have an XLogP field
    for mol in islice(ifilter(has_xlogp, ifs.GetOEGraphMols()),
                      0, 100):
        
        cid, weight, xlogp = get_data(mol)

        imgname = os.path.join(imgdir, "%s.gif" % (cid,))
        make_gif(OECreateCanSmiString(mol), imgname,
                 cmpd_width, cmpd_height)

        cids.append(cid)
        weights.append(weight)
        xlogps.append(xlogp)
        imgnames.append(imgname)
    
    fig = Figure(figsize=(4,4))
    ax = fig.add_subplot(111)

    sc = ax.scatter(weights, xlogps)
    
    ax.set_xlabel("Atomic weight")
    ax.set_ylabel("CACTVS XLogP")

    # Make the PNG and get the scatter plot image size
    canvas = FigureCanvasAgg(fig)
    canvas.print_figure("mw_v_xlogp.png", dpi=80)
    img_width = fig.get_figwidth() * 80
    img_height = fig.get_figheight() * 80

    # Convert the data set points into screen space coordinates
    trans = sc.get_transform()
    xcoords, ycoords = trans.seq_x_y(weights, xlogps)

    f = open("mw_v_xlogp.html", "w")
    f.write('''<HTML>
<HEAD>
 <TITLE>MW vs. XLogP</TITLE>
</HEAD>
<BODY>
<SCRIPT>
function mouseover(name) {
  var cid = document.getElementById("cid");
  cid.innerHTML = name;
}
function show(filename) {
  var cmpd_img = document.getElementById("cmpd_img");
  cmpd_img.src = filename;
}

</SCRIPT>
Mouse is over: <SPAN id="cid"></SPAN><BR>
Pick a point to see the depiction<BR>
<IMG SRC="mw_v_xlogp.png" ismap usemap="#points"
  WIDTH="%d" HEIGHT="%d">
<IMG ID="cmpd_img" WIDTH="%d" HEIGHT="%d">
<MAP name="points">
''' % (img_height, img_width, cmpd_height, cmpd_width))

    # Figure height in pixels.
    # HTML image coordinates have y=0 on the top.  Matplotlib
    # has y=0 on the bottom.  We'll need to flip the numbers
    for cid, x, y, imgname in zip(cids, xcoords, ycoords, imgnames):
        f.write('''<AREA shape="circle" coords="%d,%d,5"
onmouseover="javascript:mouseover('%s');"
href="javascript:show('%s');">\n''' %
                (x, img_height-y, cid, imgname))

    f.write("</MAP>\n</BODY></HTML>\n")
    
if __name__ == "__main__":
    OEThrow.SetLevel(OEErrorLevel_Error)
    main()


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB