Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2005/04/18/wrapping_command_line_programs_IV

Wrapping command-line programs, part IV

The first essay of this series showed how to use Python's subprocess module to do a one-shot call of an external program. The second turned that interface into a coprocess for better performance, at the cost of greater complexity. To make my code work I needed to recompile the code and flush the output after every name. That option isn't always available so the third essay showed how to use pseudo-ttys to force the coprocess into line buffer mode.

I have another trick up my sleeve. Nearly all programs, including the OpenEye ones, use shared libraries. When a program starts the run-time loader looks at which libraries is needs and loads them as well. Here's a list of the ones used by mol2nam. (I've switched to a Linux machine for this essay which is why the prompts are different.)

[~/src]$ ldd $OE_DIR/bin/mol2nam
        linux-gate.so.1 =>  (0xffffe000)
        libz.so.1 => /lib/libz.so.1 (0x4002c000)
        libstdc++.so.5 => /usr/lib/libstdc++.so.5 (0x4003d000)
        libm.so.6 => /lib/tls/libm.so.6 (0x400fe000)
        libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x40121000)
        libc.so.6 => /lib/tls/libc.so.6 (0x4012a000)
        /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)
[~/src]$ 
The ones I recognize are the compression library ("libz"), the standard C++ library ("libstdc++"), the math library ("libm"), the gcc library ("libgcc_s") and the standard C library ("libc").

Here's a list of some of the external functions mol2nam needs in order to run:

[~/src]$ objdump -T $OE_DIR/bin/mol2nam 

/usr/local/openeye/bin/mol2nam:     file format elf32-i386

DYNAMIC SYMBOL TABLE:
00000000      DF *UND*  00000065  GLIBCPP_3.2 _ZNKSs17find_first_not_ofEPKcjj
00000000      DF *UND*  00000023  GLIBC_2.0   __umoddi3
00000000      DF *UND*  000000dc  GLIBC_2.0   __divdi3
00000000      DF *UND*  000002fa              deflate
00000000      DF *UND*  00002c59  GLIBC_2.0   __strtod_internal
00000000      DF *UND*  0000005b  GLIBCPP_3.2 _ZNKSs7compareERKSs
00000000      DF *UND*  000000c6  GLIBC_2.0   vsprintf
00000000  w   D  *UND*  00000000              pthread_create
00000000      DF *UND*  000000a8  GLIBCPP_3.2 _ZNSs7replaceEN9__gnu_cxx17__norma
l_iteratorIPcSsEES2_jc
00000000      DF *UND*  0000006d  GLIBC_2.0   feof
082334c0  w   DO .bss   00000014  GLIBCPP_3.2 _ZTVSt9bad_alloc
082334d8  w   DO .bss   00000010  GLIBCPP_3.2 _ZTIPb
00000000      DF *UND*  000000f1  GLIBC_2.0   ungetc
  ...
The complicated names are for the C++ function name mangling needed to work with a library format designed for C-style names.

With care and with some restrictions it's possible to tell the run-time loader to use functions from another library. The mechanism is different on different operating systems but the general idea is to create what's often call a shim. The shim implements one or more of the functions needed by the program so the loader imports those definitions. When the program calls that function the shim is then free to do whatever it wants. In some cases it may define completely new functionality, like a new memory allocator or random number generator. In others it may extend existing functionality like zlibc which transparently opens compressed files as if they are uncompressed. If the file is not compressed it forwards the call to the original function.

In this case I want to set stdout to be in line-buffered mode. I need to hook into a function mol2nam needs. I tried fdopen() and printf() but those didn't seem to work and I don't know why. Instead I remembered that the OpenEye tools use getenv() to get the value of the OE_DIR environment variable. That's an easy hook so I wrote the following as linebuffered_stdout.c:

/* LD_PRELOAD shim to set stdout to line buffered mode when
   getenv is called  */

/*  "The  symbols  RTLD_DEFAULT  and RTLD_NEXT are defined by <dlfcn.h>
    only when _GNU_SOURCE was defined before including it. " */
#define _GNU_SOURCE
#include <stdlib.h>
#include <stdio.h>
#include <dlfcn.h>

char *getenv(const char *s) {
  char *(*real_getenv)(const char *s);
  setlinebuf(stdout);
  real_getenv = dlsym(RTLD_NEXT, "getenv");
  return real_getenv(s);
}
The function has the same call signature as the real getenv(). When it's called it sets stdout to be in line buffered mode. After that it used dlsym() to find the next shared library in the list that has a function named "getenv". I don't do any error checking since I know it exists. I then call it and return whatever it returns.

I compiled the code with the following

cc -o linebuffered_stdout.so -fpic -shared linebuffered_stdout.c -ldl -lc
The result is a shared library named linebuffered_stdout.so. To see it in action I send mol2nam two lines of input. The first is "C" and the second is "CC" but there's a 4 second delay between them. Note the -u option to Python to make sure it doesn't buffer. I pipe the output side of mol2nam to cat -u. The pipe is to make mol2nam be in block buffer mode and the -u option to cat tells it to be unbuffered.
[~/src]$ python -u -c 'import time;print "C";time.sleep(4);print "CC"' | \
?   $OE_DIR/bin/mol2nam - | cat -u
mol2nam v1.0  Structure to Name Conversion
OpenEye Scientific Software, November 2003

methane
ethane
[~/src]$
There is an obvious delay then both compound names are printed with no delay between them. This is the expected behaviour for block buffered mode.

Here's the version of the same test case using LD_PRELOAD to insert this shim that forces stdout to be line buffered.

[~/src]$ python -u -c 'import time;print "C";time.sleep(4);print "CC"' | \
?   env LD_PRELOAD=./linebuffered_stdout.so $OE_DIR/bin/mol2nam - | cat -u
mol2nam v1.0  Structure to Name Conversion
OpenEye Scientific Software, November 2003

methane
ethane
[~/src]
If you watch it happen you'll see that "methane" is printed out then there's a pause of a few seconds before "ethane" is printed out. It works!

The different operating systems use different ways to affect the run-time loader. My primary machine is a Mac and after some 6 hours of searching, testing, and reading documentation and source still I couldn't figure out the equivalent to dlsym(RTLD_NEXT,...). Please let me know how to implement this sort of shim under that OS. (If I really needed something to work I would hard-code my shim getenv() to return the expected values and not worry about calling the original function.)

To make this work with the smi2name wrapper code I wrote a small shell script to set LD_PRELOAD correctly before starting mol2nam.

#!/bin/sh

LD_PRELOAD=./linebuffered_stdout.so $OE_DIR/bin/mol2nam -
I then changed the default MOL2NAM setting in the Python file to point to this wrapper shell script and ran the self test. After a few tens of seconds I was worried. Was the code stuck somewhere? Quit the program and add some print statements.

The problem was the segmentation fault test with the very large SMILES string. It didn't cause a segmentation fault under Linux. I made the string some 16 times larger and it still didn't crash. So I commented out that test and reran the test suite. Everything passed.


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2020 Andrew Dalke Scientific AB