Diary RSS | All of Andrew's writings | Diary archive
Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure. Code to taste. Best served at room temperature.Leipzig outline and flyer #
I'm leaving tomorrow for EuroCUP in Germany. I will be looking for
more students for my upcoming training course in cheminformatics web
development. To the left is a copy of the flyer I'll be handing
out, where you can see that it's on 18-20 May in Leipzig.
I spent some more time working on the details of what I'll be teaching. The "Hello, world" of cheminformatics is canonical SMILES generation, so do that first on the command-line then show how to get that working inside of Django, along with depiction.
I'll build on this to do SMARTS matching, displaying first the number of matches and building up to depicting a table of all substructure matches.
But wait! Who types in SMILES when there's compound ids in the corporate database? I'll walk through how to connect to a SQL database to get the SMILES and other data fields.From that, I'll go through various ways to calculate molecular properties, from molecular weight to fingerprint similarity to calling InChI through the command-line.
SQL databases are easy to search, so we'll set up a more complex query and display the list of results, with the option of downloading the results as SD, CSV, or Excel files.
This will lead to some user personalization, to show how to save the last few results for a user, so they can quickly be retrieved later.
The last part of the course will cover using Javascript to get structure data from JMol, to display tabular results in a sortable table, and to use AJAX to make more interactive pages.
All this, and more, in only three days.
If you want to sign up for the course, let me know!
How to compile openCHORD on a Mac #
In my upcoming training course I will teach how to develop web applications for cheminformatics. A few people have signed up already but there are still slots open for those interested. It will be in Leipzig in the latter part of May.
One topic will be connecting to a remote relational database to do chemistry queries. I decided to base it on TJ O'Donnell's openCHORD, which is a chemistry extension on top of PostgreSQL. TJ quite literally wrote the book on the Design and Use of Relational Database in Chemistry, which uses CHORD for its examples.
I am not a database person. I know enough to be able to use it, and design schemas. Most databases are designed with a keeper in mind (called a DBA, for "DataBase Administrator"), who knows and cares a lot about how the database works. I'm not one of those people. I dread each time I have to work with a database on my own.
(One exception is SQLite. It's designed to have zero administration, and because it's an in-process database it doesn't have the hassles of connecting to a database process that MySQL, PostgreSQL, Oracle and others have. But it doesn't make sense for pharmas to keep their corporate data in SQLite.)
TJ is a database person and has been using PostgreSQL for many years. He made his openCHORD code available. It's designed for someone like him who knows how to make the pieces go together. I don't have that experience so I struggled for a while. Here's my step-by-step process. I hope it helps me next time I need to do this, and you, if you want to try it out.
Prerequisites
openCHORD uses C and Python code to extend PostgreSQL functionality. It uses OpenBabel for the chemistry. This means you'll need to have Python installed (which is always the case on a Mac) as well as gcc (available in the developer tools). You'll also need to install OpenBabel, including the Python interface.
Install PostgreSQL with server- and client-side Python support
First I needed to install PostgreSQL, along with support for server-extensions written in Python ("plpythonu") and the Python client code.
I'm using a Mac. I have MacPorts installed. That's a package management system for a lot of the freely available Unix packages. My first step was to ask for the Python client library for PostgreSQL and see what happens:
% sudo port install py-pygresql Password: ---> Computing dependencies for py-pygresql ---> Fetching postgresql83 ---> Attempting to fetch postgresql-8.3.8.tar.bz2 from http://arn.se.distfiles.macports.org/postgresql83 ---> Attempting to fetch postgresql-8.3.8.tar.bz2 from http://ftp7.de.postgresql.org/ftp.postgresql.org/source/v8.3.8/ ---> Verifying checksum(s) for postgresql83 ---> Extracting postgresql83 ...
You see it installed version 8.3.8. The most recent version is 8.4.3 and the most recent in the 8.3 series is 8.3.10. But it should be plenty good enough for my needs.
plpython
However! After working through TJ's instructions I got a failure trying to execute:
sudo -u $owner createdb openchordwhere "$owner" was superuser database owner, in my case "dalke" but it's often "postgres".
It gave me an error which included the message
could not access file "$libdir/plpython"That's because PostgreSQL needs to be compiled with support for Python extensions in the server. The default ports installation does not do that. Instead, I need to request the "+python" install option
% sudo port install -d postgresql83-server +python ---> Computing dependencies for postgresql83-server ---> Fetching postgresql83-server ---> Verifying checksum(s) for postgresql83-server ---> Extracting postgresql83-server ---> Configuring postgresql83-server ---> Building postgresql83-server ---> Staging postgresql83-server into destroot ---> Creating launchd control script ########################################################### # A startup item has been generated that will aid in # starting postgresql83-server with launchd. It is disabled # by default. Execute the following command to start it, # and to cause it to launch at startup: # # sudo launchctl load -w /Library/LaunchDaemons/org.macports.postgresql83-server.plist ########################################################### ---> Installing postgresql83-server @8.3.8_0 To create a database instance, after install do sudo mkdir -p /opt/local/var/db/postgresql83/defaultdb sudo chown postgres:postgres /opt/local/var/db/postgresql83/defaultdb sudo su postgres -c '/opt/local/lib/postgresql83/bin/initdb -D /opt/local/var/db/postgresql83/defaultdb' To tweak your DBMS, consider increasing kern.sysv.shmmax by adding an increased kern.sysv.shmmax .. to /etc/sysctl.conf ---> Activating postgresql83-server @8.3.8_0 ---> Cleaning postgresql83-server
PostreSQL is now installed. Make sure its installation directory is on your shell path. If you used Mac ports and installed version 8.3 like me (and use the same shell I do) then that's:
set path= ( $path /opt/local/lib/postgresql85/bin )
Initialize database storage with initdb
The next step is to initialized storage for the database. I'm going to put them under ~/teaching/pg .
% initdb -D ~/teaching/pg
The files belonging to this database system will be owned by user "dalke".
This user must also own the server process.
The database cluster will be initialized with locale en_US.UTF-8.
The default database encoding has accordingly been set to UTF8.
The default text search configuration will be set to "english".
creating directory /Users/dalke/teaching/pg ... ok
creating subdirectories ... ok
selecting default max_connections ... 20
selecting default shared_buffers/max_fsm_pages ... 1600kB/20000
creating configuration files ... ok
creating template1 database in /Users/dalke/teaching/pg/base/1 ... ok
initializing pg_authid ... ok
initializing dependencies ... ok
creating system views ... ok
loading system objects' descriptions ... ok
creating conversions ... ok
creating dictionaries ... ok
setting privileges on built-in objects ... ok
creating information schema ... ok
vacuuming database template1 ... ok
copying template1 to template0 ... ok
copying template1 to postgres ... ok
WARNING: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the -A option the
next time you run initdb.
Success. You can now start the database server using:
postgres -D /Users/dalke/teaching/pg
or
pg_ctl -D /Users/dalke/teaching/pg -l logfile start
All this did was initialize storage space. It did not start the
database for you. I'll do that after I install the openCHORD
extension.
Installing openCHORD
Now that you know the base PostgreSQL installation works, the next step is to install the openCHORD extension. The source file is named openchord.tgz. It unpacks to the local directory so I did:
% curl -O http://www.gnova.com/demos/openchord.tgz % mkdir openchord % cd openchord % tar -xvzf ../openchord.tgz x amw.sql x bits.sql x frowns-core.sql x makedb x Makefile x makenci x openbabel-core.sql x openchord.py x perlmol-core.sql x public166keys.sql x sdfloader x similarity.sql x splitter.sql x tpsa.sql x varbit.c x varbit.so x varbit.sql x vla4.sdf
There are three installation phases:
- build and install a shared library
- execute a set of SQL functions which define the extension functions and database schemas
- install a Python module
Build and install the shared library
The shared library "varbit.so" defines a few functions to help work with bitsets, which are a built-in datatype in PostgreSQL. openCHORD comes with a Makefile but it only works for Linux and is hard-coded for PostgreSQL 8.2 installed under /usr. I have a Mac, a different database version, and installed in a different locations. The easiest solution was to make a platform-specific file named "Makefile.mac" which contains the following
# Makefile.mac # Compile openCHORD for a Mac INCS = -I/opt/local/include/postgresql83/server LIBS = -L/opt/local/lib/postgresql83/ -lpq varbit.so: varbit.c $(CC) -c varbit.c $(INCS) $(CC) -bundle -flat_namespace -undefined suppress -o varbit.so varbit.o
Part of openCHORD fails to compiled because of a change between PostgreSQL 8.2 and 8.3. If you try to compile as-is then you'll see:
% make -f Makefile.mac cc -c varbit.c -I/opt/local/include/postgresql83/server varbit.c: In function 'bit_set': varbit.c:90: error: lvalue required as left operand of assignment make: *** [varbit.so] Error 1This is due to what looks like a change to make a macro look more like an l-value. Here's the basic diff:
% diff varbit.c varbit.c.orig 90c90 < SET_VARSIZE(result, rlen); --- > VARATT_SIZEP(result) = rlen;and here's the unified diff version of the same thing:
% diff -u varbit.c varbit.c.orig
--- varbit.c 2010-04-11 01:56:37.000000000 +0200
+++ varbit.c.orig 2010-04-11 01:55:50.000000000 +0200
@@ -87,7 +87,7 @@
/* create result bitstring and copy input bitstring to result */
int rlen = VARBITTOTALLEN(alen);
result = (VarBit *) palloc0(rlen);
- SET_VARSIZE(result, rlen);
+ VARATT_SIZEP(result) = rlen;
VARBITLEN(result) = alen;
unsigned char *ap = VARBITS(a);
unsigned char *rp = VARBITS(result);
With that in place,
% make -f Makefile.mac cc -c varbit.c -I/opt/local/include/postgresql83/server cc -bundle -flat_namespace -undefined suppress -o varbit.so varbit.o
This shared library must be added to the PostgreSQL package library. Use the helper utility "pg_config" to get that location and install the varbit.so file:
% pg_config --pkglibdir /opt/local/lib/postgresql83 % sudo cp varbit.so `pg_config --pkglibdir`
Start the database server and verify the installation
Now you can start the database. I'll start it in interactive mode so I can see the log messages sent to the terminal, and switch to another terminal window to keep on working
% postgres -D /Users/dalke/teaching/pgThere's also a way to set it up in the background using "pg_ctl" and a way to have the server start automatically when the computer starts up, via launchctl. I won't go into them though.
How can you verify that it installed correctly? If you look in varbit.c you'll see it defined the functions
Datum nbits_set(PG_FUNCTION_ARGS); Datum isbit_set(PG_FUNCTION_ARGS); Datum bit_set(PG_FUNCTION_ARGS); Datum bit_contains(PG_FUNCTION_ARGS);I should be able to call them.
I'll start the interactive PostgreSQL client. The default database name is my username but that database doesn't exist, so I'll tell it to use the "postgres" database. (You can get a list of all databases with "psql -l".)
% psql postgres
Welcome to psql 8.3.8, the PostgreSQL interactive terminal.
Type: \copyright for distribution terms
\h for help with SQL commands
\? for help with psql commands
\g or terminate with semicolon to execute query
\q to quit
postgres=#
I have to tell PostgreSQL to create
a new function based on the extension module. The actual SQL text is:
CREATE or REPLACE FUNCTION nbits_set(bit) RETURNS integer AS 'varbit', 'nbits_set' LANGUAGE c IMMUTABLE STRICT;which I copied from 'varbits.sql'. I'll do that at the client prompt:
postgres=# CREATE or REPLACE FUNCTION nbits_set(bit) postgres-# RETURNS integer AS 'varbit', 'nbits_set' postgres-# LANGUAGE c IMMUTABLE STRICT; CREATE FUNCTION postgres=#then test it out. The b'1001' uses the special PostgreSQL extension for making a bit array.
postgres-# select nbits_set(b'1001');
nbits_set
-----------
2
(1 row)
openchord=# select nbits_set(b'001');
nbits_set
-----------
1
(1 row)
Go ahead and try other inputs to check that it works.
Now that I know it works, I'll remove the function since I don't actually need it in the "postgres" database. I'll drop it then prove that it's no longer present:
postgres=# DROP FUNCTION nbits_set(bit);
DROP FUNCTION
postgres=# select nbits_set(b'1001');
ERROR: function nbits_set(bit) does not exist
LINE 1: select nbits_set(b'1001');
^
HINT: No function matches the given name and argument types. You might need to add explicit type casts.
Run the SQL setup scripts
The next step is to run the various SQL scripts which install the extension function and set up some auxillary tables. These are in the script "makedb" but there were some problems with the script. For one, it assumed the "postgres" username exists, which isn't true for my install. I've tweaked it a bit to make it work for me:
# makedb (modified from TJ's original version) owner=$1 if [ "$owner" == "" ] ; then echo "No owner (user name) specified" echo "usage: makedb $owner" exit fi sudo -u $owner createdb openchord sudo -u $owner createlang plpythonu openchord sudo -u $owner psql openchord <varbit.sql sudo -u $owner psql openchord <openbabel-core.sql sudo -u $owner psql openchord <bits.sql sudo -u $owner psql openchord <public166keys.sql sudo -u $owner psql openchord <similarity.sql sudo -u $owner psql openchord <amw.sql sudo -u $owner psql openchord <tpsa.sql
The individual SQL scripts do the following:
- varbit.sql - create SQL functions based on the functions defined in varbits.so
- openbabel-core.sql - define SQL functions based on OpenBabel, called via Python
- bits.sql - additional functions for working with bitsets
- public166keys.sql - creates and populates the "public166keys" table with SMARTS patterns for the 166 bit MACCS keys, and the corresponding fingerprint generation function
- similarity.sql - creates several bitset similarity functions
- amw.sql - creates and populates the "amw" table with patterns for calculating molecular weight, and the corresponding MW calculation function
- tpsa.sql - creates and populates the "tpsa" table with patterns for calculating molecular polar surface area, and the corresponding calculation function
Here I'll run the script, which needs an account name with sufficient privileges:
% bash makedb dalke CREATE FUNCTION COMMENT CREATE FUNCTION COMMENT ... CREATE FUNCTION COMMENT
Did it work? I'll test to see if the nbits_set function was added to the newly created "openchord" database:
% psql openchord
Welcome to psql 8.3.8, the PostgreSQL interactive terminal.
Type: \copyright for distribution terms
\h for help with SQL commands
\? for help with psql commands
\g or terminate with semicolon to execute query
\q to quit
openchord=# select nbits_set(b'1000');
nbits_set
-----------
1
(1 row)
Success!
What doesn't yet work is the code which calls out to OpenBabel via Python.
openchord=# select amw('c1ccccc1O');
ERROR: plpython: function "count_matches" failed
DETAIL: <type 'exceptions.ImportError'>: No module named openchord
CONTEXT: SQL function "amw" statement 1
To fix that requires the final step.
Install openchord.py
The file "openchord.py" contains a set of support functions used by the functions created by openbabel-core.sql, most notably parsing SMILES and SMARTS strings and caching the resulting molecule and pattern objects. It must be installed on the PYTHONPATH for the PostgreSQL extensions to be able to use it.
It is possible to install the file by hand, but I would rather let Python do it for me. I created the following "setup.py"
# setup.py
from distutils.core import setup
setup(name="openchord",
version='1.0',
py_modules=['openchord'],
)
then did the standard
% sudo python setup.py install
The "import openchord" failure from the previous section probably means the Python runtime in the database server doesn't know that the extension can be loaded available now, so I killed the database server and restarted it.
Does it work?
% psql openchord
...
openchord=# select amw('c1ccccc1O');
amw
-------
94.12
(1 row)
Yes, it does! I now have openCHORD installed.
Importing the NCI data set
openCHORD comes with a script named "makenci" which imports the NCI data set (NCI-Open_09-03.sdf.gz) into the openchord database.
The script doesn't work on Macs because it assumes 'zcat' can handle .gz extensions. I also didn't like how it gets the database username from the variable USERNAME instead of the command-line. Here are the tweaks:
% diff makenci.orig makenci 2c2 < owner=$USERNAME --- > #owner=$USERNAME 11c11 < time zcat NCI-Open_09-03.sdf.gz | iconv -c | perl sdfloader nci | sudo -u "$owner" psql openchord --- > time gzcat NCI-Open_09-03.sdf.gz | iconv -c | perl sdfloader nci | sudo -u "$owner" psql openchordor in unified format:
--- makenci.orig 2010-04-11 03:16:26.000000000 +0200 +++ makenci 2010-04-11 03:16:44.000000000 +0200 @@ -1,5 +1,5 @@ owner=$1 -owner=$USERNAME +#owner=$USERNAME if [ "$owner" == "" ] ; then echo "no owner (user name) specified" echo "usage: makenci owner" @@ -8,7 +8,7 @@ # iconv -c removes any chars that cannot be represented as utf-8, which postgres uses # see iconv --help to convert chars in another encoding -time zcat NCI-Open_09-03.sdf.gz | iconv -c | perl sdfloader nci | sudo -u "$owner" psql openchord +time gzcat NCI-Open_09-03.sdf.gz | iconv -c | perl sdfloader nci | sudo -u "$owner" psql openchord # you may wish to use some of the following SQL commmands on the new schema and tables #Grant Usage On Schema nci To Public;
If you look at the script it's mostly comments, containing suggestions of how to work with the data. The core code is this single line:
time gzcat NCI-Open_09-03.sdf.gz | iconv -c | perl sdfloader nci | sudo -u "$owner" psql openchordwhich takes the NCI data set, removes non-ASCII characters (with iconv), then calls "sdfloader" to parse an SD file and generate SQL commands for bulk data loading, which it passes to the psql client for transfer to the database server.
The documentation warns that the data load takes an hour. It took my laptop a bit longer than that. Here's what it looked like in operation:
% bash ./makenci dalke
CREATE SCHEMA
GRANT
CREATE SEQUENCE
CREATE TABLE
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "structure_pkey" for table "structure"
CREATE TABLE
CREATE TABLE
GRANT
GRANT
GRANT
{several minutes with no output}
SET
{much longer period with no output (about 50 minutes)}
INSERT 0 260071
{more waiting}
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
real 68m43.143s
user 2m0.958s
sys 0m10.789s
Tracking down an OpenBabel problem. Not yet successful.
That was unexpected. The database server crashed. The database log says:
LOG: server process (PID 58767) was terminated by signal 11: Segmentation fault LOG: terminating any other active server processes LOG: all server processes terminated; reinitializing LOG: database system was interrupted; last known up at 2010-04-11 16:19:14 CEST LOG: database system was not properly shut down; automatic recovery in progress LOG: redo starts at 0/4A00E6B8 LOG: unexpected pageaddr 0/45BCA000 in log file 0, segment 76, offset 12361728 LOG: redo done at 0/4CBC9868 LOG: database system is ready to accept connections LOG: autovacuum launcher started
Based on the outputs, this happened either during canonical SMILES generation or during fingerprint generation. Databases are designed to keep the data even in the face of crashes, so figuring out where it crashed shouldn't be too hard. The relevant code from sdfloader is:
Insert Into $schema.structure (id, name, isosmiles, coords, atoms) Select id, (molfile_mol(molfile)).* from $schema.sdf; Update $schema.structure Set cansmiles=cansmiles(isosmiles) Where valid(isosmiles); Update $schema.structure Set fp=fp(cansmiles) Where valid(cansmiles);I'll ask how many records there are, how many have isosmiles set, how many have cansmiles set, and how many have their fingerprints set:
openchord=# select count(*) from nci.structure;
count
--------
260071
(1 row)
openchord=# select count(*) from nci.structure where nci.structure.isosmiles is NULL;
count
-------
0
(1 row)
openchord=# select count(*) from nci.structure where nci.structure.cansmiles is NULL;
count
--------
260071
(1 row)
openchord=# select count(*) from nci.structure where nci.structure.fp is NULL;
count
--------
260071
(1 row)
Huh. It looks like the problem was with canonicalization. Okay, I'll walk though that manually:
openchord=# select count(*) from nci.structure where openbabel.valid(isosmiles); server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed.
Wow! Some sort of bug in openbabel.valid()? Investigating ....
I've been working with TJ on trying to track this down. It only happens on my Mac, which is running OS X 10.6 and OpenBabel 2.2.3. I have a reproducible which only uses OpenBabel
import sys
import gzip
import openbabel
f = gzip.open("NCI-Open_09-03.sdf.gz")
f = iter(enumerate(f))
GD = {}
GD['mol'] = dict()
GD['obc'] = openbabel.OBConversion()
GD['obc'].SetInFormat("smi")
GD['nmol'] = 0
GD['maxsmi'] = 10000
n = 0
for lineno, line in f:
if lineno % 10000 == 0:
sys.stdout.write("\r %d / %d" % (n, lineno))
sys.stdout.flush()
if line.startswith("> <E_SMILES>"):
lineno, line = next(f)
smiles = line.strip()
#mol = openchord.parse_smi(GD, smiles)
if GD['nmol'] < GD['maxsmi']:
mol = openbabel.OBMol()
GD['nmol'] += 1
#plpy.notice('new mol for %s' % smiles)
else:
key,mol = GD['mol'].popitem()
#plpy.notice('mol reuse %s for %s' % (key,smiles))
if GD['obc'].ReadString(mol, smiles):
GD['mol'][smiles] = mol
# return copy is slower, but safer?
# return openbabel.OBMol(mol)
n += 1
The errors I get are:
% python tj.py 226288 / 46080000Segmentation faultmeaning it died here after reading over 226,282 SMILES and over 460,000,000 lines into the input file. This takes a long time to fail and has been very irritating to try to track down.
The stack trace one time (through Apple's crash reporter) looked like:
Thread 0 Crashed: Dispatch queue: com.apple.main-thread 0 libstdc++.6.dylib 0x00007fff861e47cd __dynamic_cast + 89 1 smilesformat.so 0x0000000101a0fd8e OpenBabel::SMIBaseFormat::ReadMolecule(OpenBabel::OBBase*, OpenBabel::OBConversion*) + 78 (base.h:254) 2 libopenbabel.3.dylib 0x00000001013dfe2c OpenBabel::OBConversion::Read(OpenBabel::OBBase*, std::istream*) + 220 (obconversion.cpp:745) 3 libopenbabel.3.dylib 0x00000001013e3f4c OpenBabel::OBConversion::ReadString(OpenBabel::OBBase*, std::string) + 508 (obconversion.cpp:893) 4 _openbabel.so 0x0000000101135d6c _wrap_OBConversion_ReadString + 881and another time (through gdb) was
#0 OpenBabel::OBMol::DestroyBond (this=<value temporarily unavailable, due to optimizations>, bond=0x10541efa0) at mol.cpp:1471 #1 0x0000000101a0fd9f in OpenBabel::OBConversion::GetInStream () at /Users/dalke/ftps/openbabel-2.2.3/include/openbabel/obconversion.h:256 #2 0x0000000101a0fd9f in OpenBabel::SMIBaseFormat::ReadMolecule (this=0x10541efa0, pOb=0x102c83990, pConv=0x10027cc10) at base.h:854 #3 0x00000001013dfe2c in OpenBabel::OBConversion::Read (this=0x10027cc10, pOb=0x102c81140, pin=<value temporarily unavailable, due to optimizations>) at obconversion.cpp:745 #4 0x00000001013e3f4c in OpenBabel::OBConversion::ReadString (this=0x10027cc10, pOb=0x102c81140, input=<value temporarily unavailable, due to optimizations>) at obconversion.cpp:893 warning: .o file "/Users/dalke/ftps/openbabel-2.2.3/scripts/python/build/temp.macosx-10.6-universal-2.6/openbabel_python.o" more recent than executable timestamp in "/Library/Python/2.6/site-packages/_openbabel.so" warning: Couldn't open object file '/Users/dalke/ftps/openbabel-2.2.3/scripts/python/build/temp.macosx-10.6-universal-2.6/openbabel_python.o' #5 0x0000000101135d6c in _wrap_OBConversion_ReadString ()(Both times truncating the stack once it got into the Python run-time.)
I'm now running it with memory checks enabled (valgrind does not work on Mac OS X 10.6). That's very slow so I'll let it run overnight. I'll also see if building OpenBabel from version control fixed any problems, and email the OpenBabel list to see if anyone there knows what's going on.
GothPyCon 2010 #
Welcome to the first ever GothPyCon! This is a local conference for
those on the west coast of Sweden conference who are interested in the
Python programming language. We're very open minded, so feel free to
come by if you live anywhere in Sweden, Norway, Denmark ... or
Australia for that matter.
The conference will be on Saturday 29 May in Göteborg from 9.00 to 17.00. We're still searching for a venue, so keep your eye on http://www.meetup.com/GothPy/ for updates when we find one. Afterwards we'll head to a nearby bar or restaurant and talk more about what's going on with Python and programming in general.
If you're planning to come, please sign up to the meeting on the GothPy page so we can plan for meals and other arrangements. The invitation code for the meetup group is "significant_whitespace".
The conference will have a single track with presentations (20-45 minutes each) and 5 minute lightning talks, followed by breakout groups in the afternoon.
Presentations are meant to be informative and hopefully enlightening to the rest of the audience, which will tend to be practitioner oriented and with a range of expertise. Lighting talks are where you inspire people with your project, or let them know about something else that's cool in Python and other software. The breakout groups are meant to give people hands-on experience, for example a code kata, trying out IronPython, or helping others learn PyQt.
Interested in giving a presentation? Send your proposal (max 3 paragraphs) as a plain text email to gothpycon@dalkescientific.com. We'll select the presentations by 15 May and let you know if yours has been chosen.
Interested in giving a lightning talk? Sign up at the conference.
Want to organize a breakout group? Plan for a 1-2 hour session and let us know at gothpycon@dalkescientific.com what you want to do so we can help make it happen.
There will be a nominal fee of about 150 Swedish kronor to cover the room hire, lunch, and coffee/fika. We'll let you know once we figure out the cost, and you can pay at the door when you come. Any left-over funds will go to cover fika for the GothPy monthly meetings. If you or your company would like to sponsor the conference, and possibly cover some of these costs, let us know!
All announcements will be posted to the GothPy Meetup page at http://www.meetup.com/GothPy/ so signup there if you want to receive them.
If you have any questions, comments, or sponsorship offers, email them to us at gothpycon@dalkescientific.com.
Your organizers, Emily Bache and Andrew Dalke
Upcoming cheminformatics programming training courses #
I have scheduled a training session for cheminformatics web development in Leipzig, Germany on 18-20 May, I'm arranging a training course in Boston at the end of July, and tentatively planning a course for the Research Triangle Park area of North Carolina in early August. And as a bonus, I'll be in the San Francisco Bay Area in mid-June and can arrange a course there for those interested.
Cheminformatics Web Development course, Leipzig, 18-20 May
My next cheminformatics training course will be in Leipzig, Germany on 18-20 May. It will have the theme "cheminformatics web development." It's meant for computational chemists who want to set up an in-house cheminformatics server for specific analysis tasks. The topics I'll cover are:
- overview of Python and OEChem,
- Setting up a Django web server,
- Working with a SQL database,
- Structure depictions and entry,
- Fingerprint generation and searching,
- Plotting,
- Introduction to Javascript and JQuery.
While the topics do cover specific technologies like OEChem and Django, the concepts are easily transferable to other choices.
The price for the three day course will be €1,300 and is limited to 8 people. For more details see my training page and if you have any questions then email me directly at dalke@dalkescientific.com. A discount is available for a limited number of academic and non-profit participants.
Possible course in the SF Bay Area in mid-June
I will be in the San Francisco Bay Area in mid-June and will be available to do some training then. At this point I'm only making a call out to see if there's any interest, either in a public training course or something arranged on-site. Are you interested? Contact me.
Boston, late July
I will be doing some training in Boston in late July and have two people already signed up. The final dates will be determined within a few weeks. This will be a public course focused more on scripting development with Python and OEChem. I will also be available for in-house training around that time if there's something specific you want for your company.
Email me if you are interested or would like to know when there's a finalized date for the public training session. Note that it will be open to at most 10 people.
Research Triangle Park, North Carolina, early August
I have been talking to a couple of people in the RTP area about doing training there. This is still in the early planning stages. If you are in the area and would like to attend then let me know as that would help a lot in making this plan a reality.
KNIME and beginners #
I gave a presentation at OpenEye's CUP last week. More precisely, I was assigned a talk with the title "Evils of KNIME." I don't chose that sort of name, but the CUP organizers like to be a bit confrontational with presentation titles. I used my speaking slot as a platform for expressing my views on dataflow/visual languages. I don't like them, and think their effectivity is limited compared to a text language, so I explained why. Other people do like them and enjoy them. I've asked them why, and they have some good reasons. My presentation outlined those responses with some observations of my own, including suggestions for ways to improve the text-based toolkits so they are more accessible to "non-programmers."
The next few posts will be based on parts of that talk. Feel free to leave comments.
Upcoming training classes (pre-announcement)
I ended by pointing out that these are technological solutions. Why not spend some time training computational chemists to be more effective at writing software? I provide that sort of training. If you are interested, email me. I'm pinning down the dates for a course in Leipzig in mid-May (likely 18-20 May), and another in Boston in late July. I'll announce them when the dates are determined. if you want to influence those dates or schedule a course at your site, let me know.
Sample test case for KNIME
I haven't used KNIME for about two years. That experience was with KNIME 1.x. People told me that it's gotten better, so I decided it was well time to take a fresh look. Last time I couldn't get it to work on my Mac. I'm happy to report that things have changed, although there are still some difficulties with it regarding updates.
My test case was the first example from the Chemistry Toolkit Rosetta, specifically, to compute the heavy atom counts from an SD file. The pybel solution is:
import pybel
for mol in pybel.readfile("sdf", "benzodiazepine.sdf.gz"):
print mol.OBMol.NumHvyAtoms()
It's not as short as I would like because I had to specify "sdf" twice
and because it had to reach down into the underlying OpenBabel
molecule object. Still, it's a lot more succint than using any of the
base toolkits directly, and a good reference of what a text-based
programming language is capable of when designed for ease of use.
What molecular properties can I compute? And how do I do it?
The first step was to find out if KNIME could compute the number of heavy atoms. When I say "KNIME" I mean "the CDK nodes which come with KNIME" since KNIME is a dataflow-based visual programming language with support for a number of extension packages, including chemistry nodes based on the CDK. Schrodinger, Tripos, ChemAxon and likely other companies provide nodes based on their respective toolkits, but I don't have a license to those tools. In any case the Mac version of KNIME doesn't yet support adding new nodes.
The most likely candidate was "Molecular Properties." The help says:
Create new columns holding molecular properties, computed for each structure. The computations are based on the CDK toolkit and include logP, molecular weight, number of aromatic bonds, and many others.What other properties does it compute? I put the node on the workspace and double clicked on it to bring up the dialog box. The result is:
The dialog cannot be opened for the following reason:
No column in spec compatible to "CDKValue".

A Google search for that error message found the same question from 9 September 2009 although concerning a different node. Bernd Wiswedel answered:
We obviously need to improve on the error messages. You need to process the output of the SD reader with the "Molecule to CDK" node, which will parse the structures into an appropriate format for the Lipinski node. Reason is that the Lipinski node is contributed from the CDK plugin, so it needs its desired input format.What this means is the inputs need to be set up correctly before I can see more details. However, it's more complicated then that. If I set up the nodes as shown:

The dialog cannot be opened for the following reason:Turns out I need to put in a valid SD filename in the "SDF Reader" box (the one with the exclaimation point under it), in order to get the right inputs to "Molcule to CDK", in order to see the "Molecular Properties."
No column in spec compatible to "SdfValue" "SmilesValue" "MolValue" "Mol2Value" or "CMLValue".
How accessible is KNIME to first-time users?
Is that really friendly for first-time users? That is, how is a first-time user supposed to: 1) know which options are available if they can't open an unconnected node, 2) know which inputs are required for a node, or for that matter see what outputs are available, 3) know that the "SDF Reader" needs to be converted from "Molecule to CDK" before it can be used by the CDK nodes?
Of course all those can be explained in the documentation, and perhaps they are explained. I admit I haven't read it, but then again the knime.org documentation doesn't show how to use the CDK nodes. And should someone have to read the documentation in order to do something basic like this task? If so, are dataflow systems really any easier than working with a text-based programming language?
Can't compute the number of heavy atoms?
I looked through the list of properties which could be computed:
- Atomic Polarizabilities
- Aromatic Atoms Count
- Aromatic Bonds Count
- Element Count
- Bond Polarizabilities
- Bond Count
- Carbon connectivity index (order 1)
- Carbon connectivity index (order 0)
- Eccentric Connectivity Index
- Fragment Complexity
- Hydrogen Bond Acceptors
- Hydrogen Bond Donors
- Largest Chain
- Largest Pi Chain
- Petitjean Number
- Rotatable Bonds Count
- Topological Polar Surface Area
- Vertex adjacency information magnitude
- Molecular Weight
- Zagreb Index
No "heavy atom count." Next option is to see if there's a way to specify the counts based on a SMARTS pattern. Nope, didn't find anything.
As far as I can tell, there's no way with the default nodes to do much of anything with KNIME. I assume there are additional packages which I can install, but why aren't there more useful CDK nodes as part of the standard installation? An obvious one to me would be a SMARTS count pattern matcher, where I could specify the SMARTS pattern, the option for unique or non-unique matche counts, and the output column name.
Is my problem because I'm on a Mac? Do Linux users get more nodes? Or is there something else I'm missing? How would you find the number of heavy atoms using KNIME? Is there a solution using the default CDK nodes or do I have to use one of the commercial toolkits?
Leave answers and comments here.
Instrumenting the AST #
The following is a rough retelling of my presentation for the Testing in Python BoF at PyCon 2010, including removing some in-jokes relevant only for that session. On the other hand, I expanded it to include working code. This means you, yes you, could work on this. It's not for the faint of heart. Have fun! And let me know what you think.
The AST module
Code coverage is a good thing. I want to do branch coverage. Last year Ned Batchelder added branch coverage support to coverage.py, which works by analyzing the byte code. I want to see if there's a better solution through an entirely different approach.
Let me introduce you to Python's "ast" module.
>>> import astIt's an interface to Python's internal Python parser so program can convert string containing Python code into an abstract syntax tree (AST).
>>> ast.parse("for i in range(10): print i")
<_ast.Module object at 0x1004d06d0>
>>>
The ast module contains some code to display the contents of the AST
as a string.
>>> ast.dump(ast.parse("for i in range(10): print i"))
"Module(body=[For(target=Name(id='i', ctx=Store()),
iter=Call(func=Name(id='range', ctx=Load()),
args=[Num(n=10)], keywords=[], starargs=None,
kwargs=None), body=[Print(dest=None,
values=[Name(id='i', ctx=Load())], nl=True)],
orelse=[])])"
I've reformatted it to fit my slides as otherwise it's a long
string. I can also ask it to display the position information, which
is the second True in the following.
>>> ast.dump(ast.parse("for i in range(10): print i"), True, True)
"Module(body=[For(target=Name(id='i', ctx=Store(),
lineno=1, col_offset=4), iter=Call(func=Name(
id='range', ctx=Load(), lineno=1, col_offset=9),
args=[Num(n=10, lineno=1, col_offset=15)],
keywords=[], starargs=None, kwargs=None,
lineno=1, col_offset=9), body=[Print(dest=None,
values=[Name(id='i', ctx=Load(), lineno=1, col_offset=26)],
nl=True, lineno=1, col_offset=20)], orelse=[], lineno=1,
col_offset=0)])"
Programmatically building an AST
You can use the ast module to build a tree directly, without parsing a string, then compile and execute that code.
>>> from ast import *
>>> tree = Module([Print(None, [Str("PyCon2010!")], True)])
>>> tree.lineno = 1
>>> tree.col_offset = 1
>>> fix_missing_locations(tree)
<_ast.Module object at 0x1004dff50>
>>> tree = fix_missing_locations(tree)
>>> compile(tree, "<TiP>", "exec")
<code object <module> at 0x1004d38a0, file "<TiP>", line 1>
>>> exec compile(tree, "<TiP>", "exec")
PyCon2010!
>>>
"ast.fix_missing_locations" is a helper function to assign missing
position information the compiler needs when generating byte code. I
end up using it and "ast.copy_location" a lot, which copies the
location information from one node to another.
The mystery of the wrong TypeError
What's the bug with the following?
try:
raise TypeError("blah: %d" % "I said 'PyCon2010'!")
except TypeError:
pass
The code correctly raises a TypeError, but it's the wrong
TypeError. I've made this mistake a few times, which is why I try to
remember to check the contents of the exception during my tests.
Notice that unittest doesn't help here, since assertRaises only checks
the exception type, and not the content.
It is possible to check all of these manually. You could defer the calculation to a "check_mod()" function
try:
raise TypeError(
check_mod("blah: %d", "I said 'PyCon2010'!"))
except TypeError:
pass
A check_mod function might look like
def check_mod(left, right):
try:
return left % right
except Exception, err:
print "Could not interpolate: %s" % (err,)
traceback.print_stack()
raise
Rewriting the AST for fun (and profit?)
The ast module has some support code for creating a new parse tree based on transforming another parse tree. I can transform all "%" binary operations to call a function for the left and right sides. Here's a non-working but mostly complete version of how that might look.
from ast import *
class RewriteInterpolation(NodeTransformer):
def visit_BinOp(self, node):
if isinstance(node.op, Mod):
new_node = Call(func=Name(id='check_string', ctx=Load()),
args=[node.left, node.right,
Num(n=node.lineno),
Num(n=node.col_offset)],
keywords = [], starargs=None, kwargs=None
)
copy_location(new_node, node)
fix_missing_locations(new_node)
return new_node
return node
code = open(filename).read()
tree = parse(code, filename)
tree = RewriteInterpolation().visit(tree)
What's missing is the code to define or import check_string. I'll
leave that for later. For now, just get the idea that you can parse
Python code to an AST, rewrite it in order to instrument certain
parts, then execute the result.
I ran something similar to this against the Python standard library and tests, in the hopes that I could find an bug. It took a lot of hands on fiddling, since some essential Python modules cannot be instrumented because the full path wasn't fully defined. I got it to work, and found no bugs. The closest was this code from difflib.py
try:
linenum = '%d' % linenum
id = ' id="%s%s"' % (self._prefix[side],linenum)
except TypeError:
# handle blank lines where linenum is '>' or ''
id = ''
where the comment is needed because otherwise the reason isn't
immediately obvious. While not a bug, it perhaps does show you that
the test revealed something. (Oh, and it also showed the several
hundred tests that the standard library does to test string
interpolation failures.)
(Well-known) Limitations in coverage.py
Take a look at this program. I've used coverage.py to run the program and annotate the code to display the coverage and highlight the lines which weren't executed.
You can see there are a other few problems which coverage did not test. Line 9 never executes "x=9" and the "raise TypeError" in line 17 is never reached, because of the string interpolation error in the parameter list. I've hacked together something over the last 30 hours to show that something better is possible.
A different approach
I want to generate coverage for this statement:x = 1I'll do that by parsing the code into the AST then rewriting the AST so it's equivalent to:
from ast_report import register_module
ast_enter, ast_leave, ast_reached = \
register_module('spam/testing.py',
{0: (1, 0)}, {} ) # unique identifer -> (lineno, col_offset)
if 1:
ast_enter[0] += 1
x = 3
ast_leave[0] += 1
The "register_module" function is something I'll write in a bit. It will take a filename and two location dictionaries. The first dictionary is for statements like assignment which are supposed to go to completion. That is, code before it and after it are supposed to execute. (Compare this to 'return', which will never allow code after it to run. That's what the second dictionary is used for.)
The key is a unique identifier associated with each statement with a coverage test, and the value is the lineno and col_offset pair which come from the AST.
The ast_enter and ast_leave dictionaries here are default dicts (though that's an implementation point). The "0" is same unique identifier in the location dictionary, and can be used to say that the statment at line 1, column 1 (col_offset starts with 0), was reached and left.
At this point someone in the audience astutely asked why I used an "if 1:" in the above. That's a limitation of how the ast.NodeTransformer works. It lets derived classes tranform a single term to a single other term, which means I need to transform a single statement (the assignment here) into a single other statement, and not three statements. I chose the "if 1:" because it's easy to write, it can contain an arbitrary number of sub-statements, and because Python's byte compiler knows how to optimize away the "if 1:" test.
If you think about this approach, the run-time overhead is pretty low, but it's a lot more than simple assignment. I don't know how it affects real-world code. Remember, it's been 30 hours since I started this, and I'm at conference as well.
Instrumenting the AST for code coverage
The next step is to automate all of this: convert a .py file into an AST, instrument the code to add these coverage checks, implement the reporting mechanism as an atexit hook, and for good measure, add the "%" TypeError check. To see if this is effective, convert the AST to byte code and save it to a .pyc file.
This calls for a horrible hack around a call to compileall.py. I've named the result "ast_compileall.py"
# ast_compileall.py
import __builtin__
from ast import *
import compileall
import sys
import traceback
import itertools
class RewriteInterpolation(NodeTransformer):
def __init__(self, filename):
self.filename = filename
self.enter_linenos = {} # id -> (lineno, col_offset)
self.reach_linenos = {} # id -> (lineno, col_offset)
self.counter = itertools.count()
def visit_Module(self, module_node):
# Need to import and call ast_report.register_module().
# These must occur after the "from __future__ import ..." statements.
# Find where I can insert them.
body_future = []
body_rest = []
for node in module_node.body:
node = self.visit(node)
if (not body_rest and isinstance(node, ImportFrom) and
node.module == "__future__"):
body_future.append(node)
else:
body_rest.append(node)
# It's easier to let Python convert the code to an AST
import_line = parse("from ast_report import register_module, check_string").body[0]
print ("ast_enter, ast_leave, ast_reached = register_module(%r, %r, %r)" %
(self.filename, self.enter_linenos, self.reach_linenos))
register_line = parse(
"ast_enter, ast_leave, ast_reached = register_module(%r, %r, %r)" %
(self.filename, self.enter_linenos, self.reach_linenos)).body[0]
# Assign a reasonable seeming line number.
lineno = 1
if body_future:
lineno = body_future[0].lineno
for new_node in (import_line, register_line):
new_node.col_offset = 1
new_node.lineno = lineno
new_body = body_future + [import_line, register_line] + body_rest
return Module(body=new_body)
# These are statements which should have an enter and leave
# (In retrospect, this isn't always true, eg, for 'if')
def track_enter_leave_lineno(self, node):
node = self.generic_visit(node)
id = next(self.counter)
enter = parse("ast_enter[%d] += 1" % id).body[0]
leave = parse("ast_leave[%d] += 1" % id).body[0]
self.enter_linenos[id] = (node.lineno, node.col_offset)
for new_node in (enter, leave):
copy_location(new_node, node)
# This is the code for "if 1: ..."
n = Num(n=1)
copy_location(n, node)
if_node = If(test=n, body=[enter, node, leave], orelse=[])
copy_location(if_node, node)
return if_node
visit_FunctionDef = track_enter_leave_lineno
visit_ClassDef = track_enter_leave_lineno
visit_Assign = track_enter_leave_lineno
visit_AugAssign = track_enter_leave_lineno
visit_Delete = track_enter_leave_lineno
visit_Print = track_enter_leave_lineno
visit_For = track_enter_leave_lineno
visit_While = track_enter_leave_lineno
visit_If = track_enter_leave_lineno
visit_With = track_enter_leave_lineno
visit_TryExcept = track_enter_leave_lineno
visit_TryFinally = track_enter_leave_lineno
visit_Assert = track_enter_leave_lineno
visit_Import = track_enter_leave_lineno
visit_ImportFrom = track_enter_leave_lineno
visit_Exec = track_enter_leave_lineno
#Global
visit_Expr = track_enter_leave_lineno
visit_Pass = track_enter_leave_lineno
# These statements can be reached, but they change
# control flow and are never exited.
def track_reached_lineno(self, node):
node = self.generic_visit(node)
id = next(self.counter)
reach = parse("ast_reached[%d] += 1" % id).body[0]
self.reach_linenos[id] = (node.lineno, node.col_offset)
copy_location(reach, node)
n = Num(n=1)
copy_location(n, node)
if_node = If(test=n, body=[reach, node], orelse=[])
copy_location(if_node, node)
return if_node
visit_Return = track_reached_lineno
visit_Raise = track_reached_lineno
visit_Break = track_reached_lineno
visit_Continue = track_reached_lineno
# Some code to instrument the run-time and check for '%' failures.
def visit_BinOp(self, node):
if isinstance(node.op, Mod):
new_node = Call(func=Name(id='check_string', ctx=Load()),
args=[node.left, node.right,
Num(n=node.lineno),
Num(n=node.col_offset)],
keywords = [], starargs=None, kwargs=None
)
copy_location(new_node, node)
fix_missing_locations(new_node)
return new_node
return node
old_compile = __builtin__.compile
def compile(source, filename, mode, flags=0): # skipping a few parameters
# My rewrite code uses ast.parse, which ends up calling this
# function with this argument, so pass it back to the real compile.
if flags == PyCF_ONLY_AST:
return old_compile(source, filename, mode, flags)
assert mode == "exec"
#traceback.print_stack()
code = open(filename).read()
tree = parse(code, filename)
tree = RewriteInterpolation(filename).visit(tree)
code = old_compile(tree, filename, "exec")
return code
# Ugly hack so I can force compileall to use my compile function.
__builtin__.compile = compile
exit_status = int(not compileall.main())
sys.exit(exit_status)
I placed this file in "spam/testing.py"
def main():
def f(x):
if x > 0:
return x*x
1/0
for i in range(4, 9):
if f(i) < 0: x=9
if i == 8:
continue
print "Here"
if i == 10:
continue
try:
raise TypeError("Hi! %d" % "sdfa")
except TypeError:
pass
main()
I then compiled all of the .py files in the 'spam' directory with
python ast_compileall.py spamand I made sure the following was on my PYTHONPATH as "ast_report.py"
# ast_report.py
from collections import defaultdict
import traceback
import atexit
import linecache
loaded_modules = []
class FileInfo(object):
def __init__(self, filename, enter_linenos, reach_linenos):
self.filename = filename
self.enter_linenos = enter_linenos
self.reach_linenos = reach_linenos
self.ast_enter = defaultdict(int)
self.ast_leave = defaultdict(int)
self.ast_reach = defaultdict(int)
def register_module(filename, enter_linenos, reach_linenos):
#print filename, enter_linenos, reach_linenos
info = FileInfo(filename, enter_linenos, reach_linenos)
loaded_modules.append(info)
return info.ast_enter, info.ast_leave, info.ast_reach
def check_string(left, right, lineno, col_offset):
if not isinstance(left, basestring):
return left % right
try:
return left % right
except Exception, err:
print "Could not interpolate: %s" % (err,)
traceback.print_stack()
raise
# Basic coverage report
def report_coverage():
for fileinfo in loaded_modules:
# This will contain a list of all results as a 3-ple of
# lineno, col_offset, "text message"
report = []
# These should have both 'enter' and 'leave' counts.
for id, (lineno, col_offset) in fileinfo.enter_linenos.items():
if id not in fileinfo.ast_enter:
report.append( (lineno, col_offset, "not entered") )
elif id not in fileinfo.ast_leave:
report.append( (lineno, col_offset, "enter %d but never left" %
fileinfo.ast_enter[id]) )
else:
delta = fileinfo.ast_leave[id] - fileinfo.ast_enter[id]
report.append( (lineno, col_offset, "enter %d leave %d (diff %d)" %
(fileinfo.ast_enter[id], fileinfo.ast_leave[id], delta)) )
# These only need to be 'reach'ed
for id, (lineno, col_offset) in fileinfo.reach_linenos.items():
if id not in fileinfo.ast_reach:
report.append( (lineno, col_offset, "not reached") )
else:
report.append( (lineno, col_offset, "reach %d" % (fileinfo.ast_reach[id],)) )
# sort by line number, breaking ties by column offset
report.sort()
print "Coverage results for file", fileinfo.filename
for lineno, col_offset, msg in report:
print "%d:%d %s" % (lineno, col_offset+1, msg)
print linecache.getline(fileinfo.filename, lineno).rstrip()
# Dump the coverage results when Python exist.
atexit.register(report_coverage)
(While I used an atexit hook here, I did that because it was the
fastest way to get to a proof-of-concept solution. Really I think this
should be more like how coverage.py works, with a command-line script
which sets up the run environment and reports the results at the end.)
Try it out!
This coverage code will only work on modules which were imported, where the .pyc file is used instead of the .py file. (But perhaps an import hook would be useful or at least interesting here?) What I do is import the module via the command-line
% cd spam/
% python -c 'import testing'
Could not interpolate: %d format: a number is required, not str
File "<string>", line 1, in <module>
File "spam/testing.py", line 21, in <module>
main()
File "spam/testing.py", line 785, in main
File "ast_report.py", line 30, in check_string
traceback.print_stack()
Coverage results for file spam/testing.py
1:1 enter 1 leave 1 (diff 0)
def main():
3:3 enter 1 leave 1 (diff 0)
def f(x):
4:5 enter 5 but never left
if x > 0:
5:7 reach 5
return x*x
6:5 not entered
1/0
8:3 enter 1 leave 1 (diff 0)
for i in range(4, 9):
9:5 enter 5 leave 5 (diff 0)
if f(i) < 0: x=9
9:18 not entered
if f(i) < 0: x=9
10:5 enter 5 leave 4 (diff -1)
if i == 8:
11:8 reach 1
continue
12:8 not entered
print "Here"
13:5 enter 4 leave 4 (diff 0)
if i == 10:
14:8 not reached
continue
16:3 enter 1 leave 1 (diff 0)
try:
17:7 reach 1
raise TypeError("Hi! %d" % "sdfa")
19:7 enter 1 leave 1 (diff 0)
pass
21:1 enter 1 leave 1 (diff 0)
main()
You can see that it reports the string interpolation without a problem, and if you look closely you'll see that it catches that the "if" on line 9 is executed while the "x=9" also on line 9 is never executed.
There's also some problems. Line 4 reports that the code was entered 5 times and never left, but that's a bit of a false positive since it left through a return statement. I think now, after additional thought, that the better solution is to put the "leave" test on the first line of each possible branch.
Pluses and minuses
There are some great advantages to this approach.
- I don't need to look at the stack frame to figure out where I am, or even use the sys.settrace() hook.
- I get coverage testing of every statement on a line.
- I can instrument a specific and limited set of Python files
- Full branch coverage is possible.
- I can add tests which are almost impossible to add otherwise (like "%d" % "asdf"; or what about checking if the RHS of an assert will actually work?)
- What about instrumenting all "d.keys()" calls in Python 2.x code to check and report if a dict keys() result is ever used as something other than the iterator, like it would be in Python 3.x?
There are some difficult problems as well. Consider:
x = arg or default_arg or die(_("missing arg"))
Branch reporting should say that 'arg' tested both True and False, that default_arg tested True and False and ... that the result of die() tested both True and False?
And just how should someone visualize all this extra data?
"See also" and ruminations
I talked with Ned some after my presentation. He pointed out that the complex part of coverage.py, which he's worked on a lot during the last year, is to make the system configurable so it can be told which coverage to ignore. I know what he means. In the late 1990s I added the "#pragma: no cover" option to the early form of coverage.py, which exists (although not my actual code) to this day.
If coverage works on a more fine-grained level, how do you suppress the false warnings so the true issues aren't hidden in the noise?
Ned also pointed out Matthew J. Desmarais' work with Canopy
instrument python code to generate robust coverage information. the goal is to provide modified condition/decision coverage metrics.I'm not the only one who has thought about instrumenting the AST, even in Python. (The Lisp community likely thought of this before I was born.) What I've hoped to do here is explain it well enough so that you can figure out how this approach works and come up with ways to extend it for the future.... or figure out why it fails.
If you are doing that, do bear in mind my python4ply package. It contains a full grammar definition for Python using PLY, with support for the decrepit AST from the compiler module. Potentially you could use it to have Python 3 generate an AST for Python 2, or even vice versa, with a lot more work.
Or, if you have both money and interest, perhaps you'll fund me? I am a consultant, after all. I mostly work in computational chemistry and my clients aren't interested in this sort of deep language analysis, so I only work on this during rare intervals. It's not only money, but access to people who want these sorts of capabilities and can give me feedback on what they want and how effective a solution is.
Or, if you want to work on it yourself - feel free! I hereby release all of this code to the public domain, and disavow any copyright interest in the code expressed in this article. You don't even have to mention my name. Just develop good testing tools.
I know there are a number of tools in the greater world of computing which can work on ASTs. I have no experience with them. Perhaps it's best to convert the Python AST to some other tree grammar where there is a tree manipulation language? When I'm feeling crazy I think "just convert the AST to XML then use XSLT to add the instrumentation, and convert the resulting XML back to an AST." How sane is that? And it would mean I would have to learn a lot more about XSLT. Or what about ANTLR's tree grammars? But then there's Manual Tree Walking Is Better Than Tree Grammars. It's a Brave New World.
Thanks!
I thank Armin Rigo, Brett Cannon, Grant Edwards, John Ehresman, Jeremy Hylton, Kurt Kaiser, Neal Norwitz, Neil Schemenauer, Nick Coghlan, Tim Peters, Martin von Löwis and everyone else who worked on the ast module. Without them this would be a much harder problem.
Any comments?
Leave them here.
New Cheminformatics Projects #
I've started two new open projects for cheminformatics and I'm looking for help in both of them.
Chemistry Toolkit Rosetta
The Chemistry Toolkit Rosetta (CTR) is a set of common cheminformatics tasks implemented using a variety of different toolkits and approaches. It is meant primarily as a way for people to understand and compare how the different APIs work.
Currently there are 16 tasks, 14 of which are well-defined and have at least one solution (in OpenEye/Python since that's what I know best). Several also have solutions in Pybel, and there are a couple RDKit and CDK solution as well.
Some of the CTR tasks are:
- Heavy atom counds from an SD file
- Working with SD tag data
- Find the 10 nearest neighbors in a data set
- Calculate TPSA
It needs your help. The project started in part because I don't know RDKit, CDK, or Indigo that well - to say nothing of the commercial tools available from Symyx, Accelrys, Schrodinger, and others. I know them a bit better now, but not enough.
Feel free to contribute a solution in your toolkit of choice! Or provide commentary, feedback, or improve an existing solution. You can even contribute a new task, if it's characteristic of a frequently encountered cheminformatics-related problem which several toolkits can handle.
By the way, I give a big thanks to Noel O'Boyle for his feedback on the project direction and for his Pybel and Cinfony contributions to help flesh out CTR before this public annoucement.
Chem Fingerprints
The other project I started is called "chem-fingerprints" or "chemfp" for short. Its goal is to develop a couple of file formats for cheminformatics fingerprints as well as tools and libraries which work with those formats.
The main problem it addresses is that there is no widely used fingerprint format, so each research group or even individual researcher ends up making a new one, as well as the tools to work with it. See the use cases for some more detailed examples.
So far I've written a proposal for a line-oriented text format called "FPS" meant to be easy to generate and parse, and have sketched out a inary format called FPB meant for fast loading, at the expense of some preprocessing.
The FPS format is simple enough that you can likely figure out most of it from this example, taken from the specification:
#FPS1 #num_bits=256 #software=RDKit/2009Q3_1 #params=RDKit-Fingerprint/1 minPath=1 maxPath=7 fpSize=256 nBitsPerHash=4 useHs=True #source=/Users/dalke/databases/Compound_00000001_00025000.sdf.gz #date=2010-01-27T02:22:26 fffeffbfb7fffedff7beefdbddf7ffffabff76cf6df7fcf6f7fffebf7d7ffd6f 1 fffeffbfb7fffedff7beefdbddf7ffffabff76cf6df7fcf6f7fffebf7d7ffd6f 2 ffffbfdfffffffffbfeffffffffffffffffffffffff77efffffffebfffffffef 3 00c02010002610000080800041100002084000440d100000c055048801224400 4
I've developed a set of tools to generate FPS fingerprints from OpenEye, OEChem, and RDKit, as well as to extract fingerprints from SD tags; specifically the CACTVS substructure keys in PubChem. These are available from the Mercurial repository.
These tools are in development status, and are primarily meant at this time as a way to get concrete feedback for the specification.g
Other tools I would like to develop, perhaps with your help, are command-line programs for similarity search and substructure filters.
I'm also looking for input and feedback on the format definitions, and for people who want to add support for these formats in their tools.
If you are interested in chemfp, then sign up on the chemfp mailing list.
Project hosting options? #
I started a new project for cheminformatics fingerprints and want to make it available for general use. It contains software under the MIT license and specifications under a license as lenient as I can make it. (Likely CC-BY.)
I looked around for project hosting. My requirements are:
- Mailing list with only Mailman-style double opt-in. Specifically, I expect most people who subscribe will not want to have to set up an account on the project hosting provider before signing up for the mailing list.
- Mercurial support
- Simple web hosting, preferable a wiki where I can put up specifications and a few related documents and where others can do some edits
- Bug and issue tracking would be nice, but not essential since this is a small project and a TODO in version control should be fine.
The options
I know there's a bunch of resources these days, and in my searches I found Wikipedia's Comparison of open source software hosting facilities. As you can see, there are quite a few. Sort on version control systems and it's Alioth, Assembla, BerliOS, Bitbucket, CodePlex, GNU Savannah, Google Code, JavaForge, KnowledgeForge, Project Kenai, and SourceForge.
Must have mailing list and web page or wiki hosting
Next, filter out those which don't have mailing lists, which removes Assembla, Bitbucket, and JavaForge. It's a shame about losing BitBucket since that's what I would have liked. With reluctance I also dropped Google Code since its mailing lists require a Google account. I think that's too high of a barrier of entry. I also dropped GNU Savannah since it doesn't have web or wiki hosting.
What's left are: Alioth, BerliOS, CodePlex, KnowledgeForge, Project Kenai, and SourceForge.
I want to try something other than SourceForge or a clone
Of those I have only used SourceForge, and done that for over 10 years. It feels very clunky and cluttered compared to Google Code and downloading packages is a nuisance for people like me who would rather curl the files than use a browser. Perhaps it's time to try something different? That puts BerliOS out, since it's derived from the SourceForge code base, as is GNU Savannah, and so is Alioth through GForge.
What's left? CodePlex, KnowledgeForge, and Project Kenai.
Must support non-member access to a mailing list
I looked at CodePlex. I think you have to be a CodePlex member in order to leave dicussions, and it uses web-based forum software instead of email. That is, I selected some of the project which have been downloaded the most often but never could find a "subscribe to the mailing list" option. Perhaps most people in the Microsoft Windows and .Net space don't do email?
In any case, it doesn't seem to fit my requirements.
Remaining options: KnowledgeForget and Project Kenai
I looked at KnowledgeForge and while it seems to fit my requirements, there aren't many people using it, although others may be using the underlying KForge application to host their own system. My concern is that the rough edges wouldn't have been worn down by other users.
That left me with Project Kenai, which also seemed to do what I wanted, and it has more and larger development projects, including JRuby. Okay, I'll try it out.
Project Kenai
(Update based on feedback. As of 27 Jan 2010 (or about two days after I registered on Kenai, and two days before I posted this essay), Oracle, who owns Sun, said they would be "phasing out of the public-facing domain used for the Project Kenai Beta site." Therefore, you shouldn't use it.)
I requested a new project hosting and got it. I set up the project, working on code, and updated the wiki. Seems to be nice enough, with really no problems to speak about. I was happy enough.
I liked some of the tweaks, like how it uses AJAX to update the displayed content rather than doing a full page submission like when editing Wikipedia. Though now that I think of it, I adore how StackOverflow shows the formatted content while you type.
Show stopper - non-member access to the mailing list
Until I got to the email part. Turns out Project Kenai does allow non-members to join a list, but they have to email subscribe request to the Sympa email system. Very much like the old majordomo list manager, and with no web-based front-end to help out.
I found that be searching the help files. There's no clue that that's even possible from the normal "mailing lists" page for a project. But perhaps I could remedy that with instructions on how to sign up without being a member.
The only way I could do that was on the wiki home page. I did that then asked a friend of mine to try it out. He followed the main mailing-list link from Kenai and never saw my note. Once on that page he couldn't figure out how to join without being a member, and he doesn't want to do all that just to join a mailing list.
Once I pointed out the manual instructions, he tried that out. I got an email which said I need to manualy confirm him as a member. On his side he only saw that he was now a member, and didn't like the lack of the Mailman-style double opt-in. As far as he could tell, anyone could register anyone else through a forged email.
That's a serious down-check, since while technically it meets my requirements, it doesn't meet the spirit.
No response to a feature requst
I posted this request to the features list a couple of days ago and got no response.
I do realize this is a free project, so I can make no demands nor should I expect fast response. That's why i waited a couple of days before writing this posting. But a reason for trying Project Kenai was because its size should mean it has more of these kinks worked out, and its support by Sun should imply there's someone to answer mail.
Just choose SourceForge?
As for the project, my conculsion is to just go ahead and use SourceForge. It's clumsy but I know it handles my needs.
Unless you have a better suggestion? Perhaps you think I should try BerliOS?
Cheminformatics, bioinformatics, and system biology positions available #
A couple of people I know have computational chemsitry and biology positions available, which might interest some of my readers.
University College Cork, Ireland
Noel O'Boyle (author of TwirlyMol, and Cinfony, contributor to OpenBabel and BlueObelisk, researcher in algorithms in cheminformatics and scoring functions for protein-ligand docking, and all-around good guy), has funding for a PhD student at the School of Pharmacy, University College Cork.
You would be developing open source tools for cheminformatics. I'm actually tempted by this one, except I think consulting pays better, I might have to move from Sweden, and I've got a different thesis I would like to work on.
Details about this PhD position.
Technical University of Denmark, on the north side of Copenhagen
Thomas Sicheritz-Ponten (whom I last saw at ISMB/Copenhagen, where he treated our table at the Scottish bar to a round of whiskys and I danced hustle with his girlfriend while the American musician played covers which everyone in the bar followed along to), has four open positions in his Metagenomics group at the Center for Biological Sequence Analysis. That's 3 PhD/postdoc positions in next gen sequencing related metagenomics and 1 postdoc position in archaeal genetics.
He also pointed out that there are 13 open position (PhD, postdoc, and programmer) for the entire center, including: postdoc within predictive bioinformatics for molecular epidemiology, scientific programmer for molecular epidemiology, PhD within epitope prediction, scientific programmer within disease systems biology, and postdoc or research assistant, bioinformatics generalist.
Details about the CBSA positions.
Get out there and start researching!
Copyright © 2001-2010 Dalke Scientific Software, LLC.


