Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2005/04/26/extending_python

Extending Python

Another way to extend Python is to write an interface to an existing C or C++ libraries. Earlier I showed how to write a wrappper to the command-line version of OpenEye's OGHAM program mol2nam, in detail. I used it as a teaching example because it was simple code that showed some of the problem of working with external programs.

There are better ways to call the IUPAC naming code from Python. OpenEye doesn't provide Python bindings to it but they do provide the needed C++ libraries and example code for compiling it ( in $OECHEM/examples/ogham.cpp). That's enough information to write my own extension for just that one function. At the end of this essay I point to a few alternate ways to write an extension.

The Python documentation has detailed examples of how to write a Python extension and others have written additional documentation, so I won't go into details. If you need examples you can easily look at existing extensions to see how they are done.

I need to lay out my new C++ code in the way that Python expects. It's filename is smi2name.cpp and the name of the new shared library will be "_smi2name". (Often C extensions used by module "X" are given a leading underscore, as in "_X".)

/* This file is named smi2name.cpp */

#include "Python.h"

#include "openeye.h"
#include "oesystem.h"
#include "oechem.h"
#include "oeiupac.h"

using namespace OESystem;
using namespace OEChem;

static PyObject *
smi2name(PyObject *self, PyObject *args) {
  const char *smiles;
  OEMol mol;

  if (!PyArg_ParseTuple(args, "s", &smiles)) {
    return NULL;

  /* If there is a failure, simply return None. */
  /* I could raise an exceptio instead but this is easier */
  if (!OEParseSmiles(mol, smiles)) {
    return Py_None;

  /* Compute the IUPAC name and return it to Python */
  std::string name = OEIUPAC::OECreateIUPACName(mol);
  return Py_BuildValue("s", name.c_str());

/* Set up the method table. */
static PyMethodDef _smi2name_methods[] = {
  {"smi2name", smi2name, METH_VARARGS, "convert a SMILES to an IUPAC name"},
  {NULL, NULL, 0, NULL},   /* Sentinel */

/* This function must be named "init" + <modulename> */
/* Because the module is "_smi2name" the function is "init_smi2name" */
init_smi2name(void) {
  (void) Py_InitModule("_smi2name", _smi2name_methods);
Again, the details of things like reference counting are explained in the Python documentation. They are not complicated but do require close attention to detail.

The next step is to build the shared library. The easiest way to do this is to use the distutils package, which has a section on building extensions. I just need to make a setup.py file with the needed configuration in it and disutils does the rest.

import os
from distutils.core import setup, Extension

OE_INCLUDE = os.path.join(os.environ["OE_DIR"], "include")
OE_LIB = os.path.join(os.environ["OE_DIR"], "lib")

# Check that we're pointed at roughly the right place
oechem_h = os.path.join(OE_INCLUDE, "oechem.h")
if not os.path.exists(oechem_h):
    raise AssertionError("Cannot find oechem.h at %r" % (oechem_h,))

      ext_modules=[Extension('_smi2name', ['smi2name.cpp'],
                             library_dirs = [OE_LIB],
                             libraries = ["oeiupac", "oechem", "oesystem",
                                          "oeplatform", "z", "m"])

This setup.py does a very basic check to test if the OE_DIR environment variable is correct by looking for the oechem.h file in the include directory. If it isn't there it does. This check is a bit strict for real use, but fine for now.

To compile the extension run python setup.py build in the shell. Here's what it looks like for me (with line wraps included for clarity):

% python setup.py build
running build
running build_ext
building '_smi2name' extension
creating build
creating build/temp.darwin-7.9.0-Power_Macintosh-2.3
gcc -fno-strict-aliasing -Wno-long-double -no-cpp-precomp -mno-fused-madd
 -fno-common -dynamic -DNDEBUG -g -O3 -Wall -Wstrict-prototypes
 -c smi2name.cpp -o build/temp.darwin-7.9.0-Power_Macintosh-2.3/smi2name.o
creating build/lib.darwin-7.9.0-Power_Macintosh-2.3
c++ -Wl,-F. -Wl,-F. -bundle
 -framework Python build/temp.darwin-7.9.0-Power_Macintosh-2.3/smi2name.o
 -L/usr/local/openeye/lib -loeiupac -loechem -loesystem -loeplatform -lz
 -lm -o build/lib.darwin-7.9.0-Power_Macintosh-2.3/_smi2name.so

The result is put under the build/ directory. The location is different for different machines. To test it out you can make a symbolic link from the created .so file to the current directory:

ln -s build/lib.darwin-7.9.0-Power_Macintosh-2.3/_smi2name.so .
          #    ^^^^^ change as appropriate ^^^^^
(Or set your PYTHONPATH, but this is easier.)

It's built, time to test it

% python
Python 2.3 (#1, Sep 13 2003, 00:49:11) 
[GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import _smi2name
>>> _smi2name.smi2name("c1ccccc1O")
>>> print _smi2name.smi2name.__doc__
convert a SMILES to an IUPAC name
>>> result = _smi2name.smi2name("C1CCC")
Warning: Error parsing SMILES:
Warning: Unclosed ring.
Warning: C1CCC
Warning:     ^

>>> result is None
In this example I printed the docstring to show that the text was available from the C++ code. I also show that OEChem's error messages were still going to stderr when it couldn't parse the SMILES, which is why the smi2name function call returned None.

The next step would be to make a smi2name.py function that provides the primary interface to the rest of Python. It should implement the previous API. Sadly, that isn't as easy as it seems because the previous code was able to extract SMILES parsing error messages that this version doesn't handle, because it's tricky to get that data.

Never-the-less it does show that writing a C++ extension for Python isn't too hard.

There are other approaches for writing an extension. I wrote the interface by hand. If the library the library has more than a few tens of functions that gets boring real fast. Much of the work is rote and repeative and can be done by machine. The best known of these is SWIG which generates C and C++ interfaces for Python, Tcl, Perl, Ruby, and several other languages. I use SWIG in PyDaylight.

SWIG builds the interface from the header but if that's too complicated then a human has to write an interface file. The basic problem is that C++ code is very complicated to parse. A recent approach is pyste which is part of the Boost project. Pyste uses gcc to parse the header files into a XML format then reads the XML to generate the actual interface code.

Even if you don't use pyste the Boost package has a package to simplify writing a C++/Python interface.

The approaches described above all require a compiler. The ctypes package uses a different approach often called a ffi for foreign function interface. It is able to load a shared library and call the function directly. The downside is that mistakes in defining how to do the call take the whole program down.

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me

Copyright © 2001-2013 Andrew Dalke Scientific AB