Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2010/04/12/compiling_openchord

How to compile openCHORD on a Mac

In my upcoming training course I will teach how to develop web applications for cheminformatics. A few people have signed up already but there are still slots open for those interested. It will be in Leipzig in the latter part of May.

One topic will be connecting to a remote relational database to do chemistry queries. I decided to base it on TJ O'Donnell's openCHORD, which is a chemistry extension on top of PostgreSQL. TJ quite literally wrote the book on the Design and Use of Relational Database in Chemistry, which uses CHORD for its examples.

I am not a database person. I know enough to be able to use it, and design schemas. Most databases are designed with a keeper in mind (called a DBA, for "DataBase Administrator"), who knows and cares a lot about how the database works. I'm not one of those people. I dread each time I have to work with a database on my own.

(One exception is SQLite. It's designed to have zero administration, and because it's an in-process database it doesn't have the hassles of connecting to a database process that MySQL, PostgreSQL, Oracle and others have. But it doesn't make sense for pharmas to keep their corporate data in SQLite.)

TJ is a database person and has been using PostgreSQL for many years. He made his openCHORD code available. It's designed for someone like him who knows how to make the pieces go together. I don't have that experience so I struggled for a while. Here's my step-by-step process. I hope it helps me next time I need to do this, and you, if you want to try it out.

Prerequisites

openCHORD uses C and Python code to extend PostgreSQL functionality. It uses OpenBabel for the chemistry. This means you'll need to have Python installed (which is always the case on a Mac) as well as gcc (available in the developer tools). You'll also need to install OpenBabel, including the Python interface.

Install PostgreSQL with server- and client-side Python support

First I needed to install PostgreSQL, along with support for server-extensions written in Python ("plpythonu") and the Python client code.

I'm using a Mac. I have MacPorts installed. That's a package management system for a lot of the freely available Unix packages. My first step was to ask for the Python client library for PostgreSQL and see what happens:

% sudo port install py-pygresql
Password:
--->  Computing dependencies for py-pygresql
--->  Fetching postgresql83
--->  Attempting to fetch postgresql-8.3.8.tar.bz2 from http://arn.se.distfiles.macports.org/postgresql83
--->  Attempting to fetch postgresql-8.3.8.tar.bz2 from http://ftp7.de.postgresql.org/ftp.postgresql.org/source/v8.3.8/
--->  Verifying checksum(s) for postgresql83
--->  Extracting postgresql83
  ...

You see it installed version 8.3.8. The most recent version is 8.4.3 and the most recent in the 8.3 series is 8.3.10. But it should be plenty good enough for my needs.

plpython

However! After working through TJ's instructions I got a failure trying to execute:

sudo -u $owner createdb openchord
where "$owner" was superuser database owner, in my case "dalke" but it's often "postgres".

It gave me an error which included the message

could not access file "$libdir/plpython"
That's because PostgreSQL needs to be compiled with support for Python extensions in the server. The default ports installation does not do that. Instead, I need to request the "+python" install option
% sudo port install -d postgresql83-server +python
--->  Computing dependencies for postgresql83-server
--->  Fetching postgresql83-server
--->  Verifying checksum(s) for postgresql83-server
--->  Extracting postgresql83-server
--->  Configuring postgresql83-server
--->  Building postgresql83-server
--->  Staging postgresql83-server into destroot
--->  Creating launchd control script
###########################################################
# A startup item has been generated that will aid in
# starting postgresql83-server with launchd. It is disabled
# by default. Execute the following command to start it,
# and to cause it to launch at startup:
#
# sudo launchctl load -w /Library/LaunchDaemons/org.macports.postgresql83-server.plist
###########################################################
--->  Installing postgresql83-server @8.3.8_0

To create a database instance, after install do
 sudo mkdir -p /opt/local/var/db/postgresql83/defaultdb
 sudo chown postgres:postgres /opt/local/var/db/postgresql83/defaultdb
 sudo su postgres -c '/opt/local/lib/postgresql83/bin/initdb -D /opt/local/var/db/postgresql83/defaultdb'

To tweak your DBMS, consider increasing kern.sysv.shmmax by adding an increased kern.sysv.shmmax .. to /etc/sysctl.conf
--->  Activating postgresql83-server @8.3.8_0
--->  Cleaning postgresql83-server

PostreSQL is now installed. Make sure its installation directory is on your shell path. If you used Mac ports and installed version 8.3 like me (and use the same shell I do) then that's:

    set path= ( $path /opt/local/lib/postgresql85/bin )

Initialize database storage with initdb

The next step is to initialized storage for the database. I'm going to put them under ~/teaching/pg .

% initdb -D ~/teaching/pg
The files belonging to this database system will be owned by user "dalke".
This user must also own the server process.

The database cluster will be initialized with locale en_US.UTF-8.
The default database encoding has accordingly been set to UTF8.
The default text search configuration will be set to "english".

creating directory /Users/dalke/teaching/pg ... ok
creating subdirectories ... ok
selecting default max_connections ... 20
selecting default shared_buffers/max_fsm_pages ... 1600kB/20000
creating configuration files ... ok
creating template1 database in /Users/dalke/teaching/pg/base/1 ... ok
initializing pg_authid ... ok
initializing dependencies ... ok
creating system views ... ok
loading system objects' descriptions ... ok
creating conversions ... ok
creating dictionaries ... ok
setting privileges on built-in objects ... ok
creating information schema ... ok
vacuuming database template1 ... ok
copying template1 to template0 ... ok
copying template1 to postgres ... ok

WARNING: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the -A option the
next time you run initdb.

Success. You can now start the database server using:

    postgres -D /Users/dalke/teaching/pg
or
    pg_ctl -D /Users/dalke/teaching/pg -l logfile start
All this did was initialize storage space. It did not start the database for you. I'll do that after I install the openCHORD extension.

Installing openCHORD

Now that you know the base PostgreSQL installation works, the next step is to install the openCHORD extension. The source file is named openchord.tgz. It unpacks to the local directory so I did:

% curl -O http://www.gnova.com/demos/openchord.tgz
% mkdir openchord
% cd openchord
% tar -xvzf ../openchord.tgz
x amw.sql
x bits.sql
x frowns-core.sql
x makedb
x Makefile
x makenci
x openbabel-core.sql
x openchord.py
x perlmol-core.sql
x public166keys.sql
x sdfloader
x similarity.sql
x splitter.sql
x tpsa.sql
x varbit.c
x varbit.so
x varbit.sql
x vla4.sdf

There are three installation phases:

Build and install the shared library

The shared library "varbit.so" defines a few functions to help work with bitsets, which are a built-in datatype in PostgreSQL. openCHORD comes with a Makefile but it only works for Linux and is hard-coded for PostgreSQL 8.2 installed under /usr. I have a Mac, a different database version, and installed in a different locations. The easiest solution was to make a platform-specific file named "Makefile.mac" which contains the following

# Makefile.mac
# Compile openCHORD for a Mac

INCS = -I/opt/local/include/postgresql83/server
LIBS = -L/opt/local/lib/postgresql83/ -lpq

varbit.so: varbit.c
	$(CC) -c varbit.c $(INCS)
	$(CC) -bundle -flat_namespace -undefined suppress -o varbit.so varbit.o

Part of openCHORD fails to compiled because of a change between PostgreSQL 8.2 and 8.3. If you try to compile as-is then you'll see:

% make -f Makefile.mac
cc -c varbit.c -I/opt/local/include/postgresql83/server
varbit.c: In function 'bit_set':
varbit.c:90: error: lvalue required as left operand of assignment
make: *** [varbit.so] Error 1
This is due to what looks like a change to make a macro look more like an l-value. Here's the basic diff:
% diff varbit.c varbit.c.orig
90c90
< 	SET_VARSIZE(result, rlen);
---
> 	VARATT_SIZEP(result) = rlen;
and here's the unified diff version of the same thing:
% diff -u varbit.c varbit.c.orig
--- varbit.c	2010-04-11 01:56:37.000000000 +0200
+++ varbit.c.orig	2010-04-11 01:55:50.000000000 +0200
@@ -87,7 +87,7 @@
 /* create result bitstring and copy input bitstring to result */
         int rlen = VARBITTOTALLEN(alen);
 	result = (VarBit *) palloc0(rlen);
-	SET_VARSIZE(result, rlen);
+   VARATT_SIZEP(result) = rlen;
    VARBITLEN(result) = alen;
 	unsigned char *ap = VARBITS(a);
 	unsigned char *rp = VARBITS(result);

With that in place,

% make -f Makefile.mac
cc -c varbit.c -I/opt/local/include/postgresql83/server
cc -bundle -flat_namespace -undefined suppress -o varbit.so varbit.o

This shared library must be added to the PostgreSQL package library. Use the helper utility "pg_config" to get that location and install the varbit.so file:

% pg_config --pkglibdir
/opt/local/lib/postgresql83
% sudo cp varbit.so `pg_config --pkglibdir`

Start the database server and verify the installation

Now you can start the database. I'll start it in interactive mode so I can see the log messages sent to the terminal, and switch to another terminal window to keep on working

% postgres -D /Users/dalke/teaching/pg
There's also a way to set it up in the background using "pg_ctl" and a way to have the server start automatically when the computer starts up, via launchctl. I won't go into them though.

How can you verify that it installed correctly? If you look in varbit.c you'll see it defined the functions

Datum   nbits_set(PG_FUNCTION_ARGS);
Datum   isbit_set(PG_FUNCTION_ARGS);
Datum   bit_set(PG_FUNCTION_ARGS);
Datum   bit_contains(PG_FUNCTION_ARGS);
I should be able to call them.

I'll start the interactive PostgreSQL client. The default database name is my username but that database doesn't exist, so I'll tell it to use the "postgres" database. (You can get a list of all databases with "psql -l".)

% psql postgres
Welcome to psql 8.3.8, the PostgreSQL interactive terminal.

Type:  \copyright for distribution terms
       \h for help with SQL commands
       \? for help with psql commands
       \g or terminate with semicolon to execute query
       \q to quit

postgres=#
I have to tell PostgreSQL to create a new function based on the extension module. The actual SQL text is:
CREATE or REPLACE FUNCTION nbits_set(bit)
 RETURNS integer AS 'varbit', 'nbits_set'
 LANGUAGE c IMMUTABLE STRICT;
which I copied from 'varbits.sql'. I'll do that at the client prompt:
postgres=# CREATE or REPLACE FUNCTION nbits_set(bit)
postgres-#  RETURNS integer AS 'varbit', 'nbits_set'
postgres-#  LANGUAGE c IMMUTABLE STRICT;
CREATE FUNCTION
postgres=#
then test it out. The b'1001' uses the special PostgreSQL extension for making a bit array.
postgres-# select nbits_set(b'1001');
 nbits_set 
-----------
         2
(1 row)
openchord=# select nbits_set(b'001');
 nbits_set 
-----------
         1
(1 row)
Go ahead and try other inputs to check that it works.

Now that I know it works, I'll remove the function since I don't actually need it in the "postgres" database. I'll drop it then prove that it's no longer present:

postgres=# DROP FUNCTION nbits_set(bit);
DROP FUNCTION
postgres=# select nbits_set(b'1001');
ERROR:  function nbits_set(bit) does not exist
LINE 1: select nbits_set(b'1001');
               ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.

Run the SQL setup scripts

The next step is to run the various SQL scripts which install the extension function and set up some auxillary tables. These are in the script "makedb" but there were some problems with the script. For one, it assumed the "postgres" username exists, which isn't true for my install. I've tweaked it a bit to make it work for me:

# makedb (modified from TJ's original version)
owner=$1
if [ "$owner" == "" ] ; then
  echo "No owner (user name) specified"
  echo "usage: makedb $owner"
  exit
fi

sudo -u $owner createdb openchord
sudo -u $owner createlang plpythonu openchord
sudo -u $owner psql openchord <varbit.sql 
sudo -u $owner psql openchord <openbabel-core.sql 
sudo -u $owner psql openchord <bits.sql 
sudo -u $owner psql openchord <public166keys.sql 
sudo -u $owner psql openchord <similarity.sql 
sudo -u $owner psql openchord <amw.sql 
sudo -u $owner psql openchord <tpsa.sql 

The individual SQL scripts do the following:

Here I'll run the script, which needs an account name with sufficient privileges:

% bash makedb dalke
CREATE FUNCTION
COMMENT
CREATE FUNCTION
COMMENT
 ...
CREATE FUNCTION
COMMENT

Did it work? I'll test to see if the nbits_set function was added to the newly created "openchord" database:

% psql openchord
Welcome to psql 8.3.8, the PostgreSQL interactive terminal.

Type:  \copyright for distribution terms
       \h for help with SQL commands
       \? for help with psql commands
       \g or terminate with semicolon to execute query
       \q to quit

openchord=# select nbits_set(b'1000');
 nbits_set 
-----------
         1
(1 row)
Success!

What doesn't yet work is the code which calls out to OpenBabel via Python.

openchord=# select amw('c1ccccc1O');
ERROR:  plpython: function "count_matches" failed
DETAIL:  <type 'exceptions.ImportError'>: No module named openchord
CONTEXT:  SQL function "amw" statement 1
To fix that requires the final step.

Install openchord.py

The file "openchord.py" contains a set of support functions used by the functions created by openbabel-core.sql, most notably parsing SMILES and SMARTS strings and caching the resulting molecule and pattern objects. It must be installed on the PYTHONPATH for the PostgreSQL extensions to be able to use it.

It is possible to install the file by hand, but I would rather let Python do it for me. I created the following "setup.py"

# setup.py
from distutils.core import setup

setup(name="openchord",
      version='1.0',
      py_modules=['openchord'],
      )
then did the standard
% sudo python setup.py install

The "import openchord" failure from the previous section probably means the Python runtime in the database server doesn't know that the extension can be loaded available now, so I killed the database server and restarted it.

Does it work?

% psql openchord
 ...
openchord=# select amw('c1ccccc1O');
  amw  
-------
 94.12
(1 row)

Yes, it does! I now have openCHORD installed.

Importing the NCI data set

openCHORD comes with a script named "makenci" which imports the NCI data set (NCI-Open_09-03.sdf.gz) into the openchord database.

The script doesn't work on Macs because it assumes 'zcat' can handle .gz extensions. I also didn't like how it gets the database username from the variable USERNAME instead of the command-line. Here are the tweaks:

% diff  makenci.orig makenci
2c2
< owner=$USERNAME
---
> #owner=$USERNAME
11c11
< time zcat NCI-Open_09-03.sdf.gz | iconv -c | perl sdfloader nci | sudo -u "$owner" psql openchord
---
> time gzcat NCI-Open_09-03.sdf.gz | iconv -c | perl sdfloader nci | sudo -u "$owner" psql openchord
or in unified format:
--- makenci.orig	2010-04-11 03:16:26.000000000 +0200
+++ makenci	2010-04-11 03:16:44.000000000 +0200
@@ -1,5 +1,5 @@
 owner=$1
-owner=$USERNAME
+#owner=$USERNAME
 if [ "$owner" == "" ] ; then
  echo "no owner (user name) specified"
  echo "usage: makenci owner"
@@ -8,7 +8,7 @@
 
 # iconv -c removes any chars that cannot be represented as utf-8, which postgres uses
 # see iconv --help to convert chars in another encoding
-time zcat NCI-Open_09-03.sdf.gz | iconv -c | perl sdfloader nci | sudo -u "$owner" psql openchord
+time gzcat NCI-Open_09-03.sdf.gz | iconv -c | perl sdfloader nci | sudo -u "$owner" psql openchord
 
 #  you may wish to use some of the following SQL commmands on the new schema and tables
 #Grant Usage On Schema nci To Public;

If you look at the script it's mostly comments, containing suggestions of how to work with the data. The core code is this single line:

time gzcat NCI-Open_09-03.sdf.gz | iconv -c | perl sdfloader nci | sudo -u "$owner" psql openchord
which takes the NCI data set, removes non-ASCII characters (with iconv), then calls "sdfloader" to parse an SD file and generate SQL commands for bulk data loading, which it passes to the psql client for transfer to the database server.

The documentation warns that the data load takes an hour. It took my laptop a bit longer than that. Here's what it looked like in operation:

% bash ./makenci dalke
CREATE SCHEMA
GRANT
CREATE SEQUENCE
CREATE TABLE
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "structure_pkey" for table "structure"
CREATE TABLE
CREATE TABLE
GRANT
GRANT
GRANT
  {several minutes with no output}
SET
  {much longer period with no output (about 50 minutes)}
INSERT 0 260071
  {more waiting}

server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

real	68m43.143s
user	2m0.958s
sys	0m10.789s

Tracking down an OpenBabel problem. Not yet successful.

That was unexpected. The database server crashed. The database log says:

LOG:  server process (PID 58767) was terminated by signal 11: Segmentation fault
LOG:  terminating any other active server processes
LOG:  all server processes terminated; reinitializing
LOG:  database system was interrupted; last known up at 2010-04-11 16:19:14 CEST
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  redo starts at 0/4A00E6B8
LOG:  unexpected pageaddr 0/45BCA000 in log file 0, segment 76, offset 12361728
LOG:  redo done at 0/4CBC9868
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started

Based on the outputs, this happened either during canonical SMILES generation or during fingerprint generation. Databases are designed to keep the data even in the face of crashes, so figuring out where it crashed shouldn't be too hard. The relevant code from sdfloader is:

Insert Into $schema.structure (id, name, isosmiles, coords, atoms)
   Select id, (molfile_mol(molfile)).* from $schema.sdf;

Update $schema.structure Set cansmiles=cansmiles(isosmiles) Where valid(isosmiles);
Update $schema.structure Set fp=fp(cansmiles) Where valid(cansmiles);
I'll ask how many records there are, how many have isosmiles set, how many have cansmiles set, and how many have their fingerprints set:
openchord=# select count(*) from nci.structure;
 count  
--------
 260071
(1 row)

openchord=# select count(*) from nci.structure where nci.structure.isosmiles is NULL;
 count 
-------
     0
(1 row)

openchord=# select count(*) from nci.structure where nci.structure.cansmiles is NULL;
 count  
--------
 260071
(1 row)

openchord=# select count(*) from nci.structure where nci.structure.fp is NULL;
 count  
--------
 260071
(1 row)

Huh. It looks like the problem was with canonicalization. Okay, I'll walk though that manually:

openchord=# select count(*) from nci.structure where openbabel.valid(isosmiles);
server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Wow! Some sort of bug in openbabel.valid()? Investigating ....

I've been working with TJ on trying to track this down. It only happens on my Mac, which is running OS X 10.6 and OpenBabel 2.2.3. I have a reproducible which only uses OpenBabel

import sys
import gzip
import openbabel

f = gzip.open("NCI-Open_09-03.sdf.gz")
f = iter(enumerate(f))
GD = {}
GD['mol'] = dict()
GD['obc'] = openbabel.OBConversion()
GD['obc'].SetInFormat("smi")
GD['nmol'] = 0
GD['maxsmi'] = 10000

n = 0
for lineno, line in f:
   if lineno % 10000 == 0:
       sys.stdout.write("\r %d / %d" % (n, lineno))
       sys.stdout.flush()
   if line.startswith("> <E_SMILES>"):
       lineno, line = next(f)
       smiles = line.strip()
       #mol = openchord.parse_smi(GD, smiles)

       if GD['nmol'] < GD['maxsmi']:
           mol = openbabel.OBMol()
           GD['nmol'] += 1
           #plpy.notice('new mol for %s' % smiles)
       else:
           key,mol = GD['mol'].popitem()
           #plpy.notice('mol reuse %s for %s' % (key,smiles))

       if GD['obc'].ReadString(mol, smiles):
           GD['mol'][smiles] = mol
           # return copy is slower, but safer?
           # return openbabel.OBMol(mol)

       n += 1
The errors I get are:
% python tj.py
226288 / 46080000Segmentation fault
meaning it died here after reading over 226,282 SMILES and over 460,000,000 lines into the input file. This takes a long time to fail and has been very irritating to try to track down.

The stack trace one time (through Apple's crash reporter) looked like:

Thread 0 Crashed:  Dispatch queue: com.apple.main-thread
0   libstdc++.6.dylib             	0x00007fff861e47cd __dynamic_cast + 89
1   smilesformat.so               	0x0000000101a0fd8e OpenBabel::SMIBaseFormat::ReadMolecule(OpenBabel::OBBase*, OpenBabel::OBConversion*) + 78 (base.h:254)
2   libopenbabel.3.dylib          	0x00000001013dfe2c OpenBabel::OBConversion::Read(OpenBabel::OBBase*, std::istream*) + 220 (obconversion.cpp:745)
3   libopenbabel.3.dylib          	0x00000001013e3f4c OpenBabel::OBConversion::ReadString(OpenBabel::OBBase*, std::string) + 508 (obconversion.cpp:893)
4   _openbabel.so                 	0x0000000101135d6c _wrap_OBConversion_ReadString + 881
and another time (through gdb) was
#0  OpenBabel::OBMol::DestroyBond (this=<value temporarily unavailable, due to optimizations>, bond=0x10541efa0) at mol.cpp:1471
#1  0x0000000101a0fd9f in OpenBabel::OBConversion::GetInStream () at /Users/dalke/ftps/openbabel-2.2.3/include/openbabel/obconversion.h:256
#2  0x0000000101a0fd9f in OpenBabel::SMIBaseFormat::ReadMolecule (this=0x10541efa0, pOb=0x102c83990, pConv=0x10027cc10) at base.h:854
#3  0x00000001013dfe2c in OpenBabel::OBConversion::Read (this=0x10027cc10, pOb=0x102c81140, pin=<value temporarily unavailable, due to optimizations>) at obconversion.cpp:745
#4  0x00000001013e3f4c in OpenBabel::OBConversion::ReadString (this=0x10027cc10, pOb=0x102c81140, input=<value temporarily unavailable, due to optimizations>) at obconversion.cpp:893
warning: .o file "/Users/dalke/ftps/openbabel-2.2.3/scripts/python/build/temp.macosx-10.6-universal-2.6/openbabel_python.o" more recent than executable timestamp in "/Library/Python/2.6/site-packages/_openbabel.so"
warning: Couldn't open object file '/Users/dalke/ftps/openbabel-2.2.3/scripts/python/build/temp.macosx-10.6-universal-2.6/openbabel_python.o'
#5  0x0000000101135d6c in _wrap_OBConversion_ReadString ()
(Both times truncating the stack once it got into the Python run-time.)

I'm now running it with memory checks enabled (valgrind does not work on Mac OS X 10.6). That's very slow so I'll let it run overnight. I'll also see if building OpenBabel from version control fixed any problems, and email the OpenBabel list to see if anyone there knows what's going on.


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB