Wrapping command-line programs, part V
I'm nearly out of tricks for dealing with cantankerous programs but the ones I've outlined should be enough. I've only once had a real need to disassemble the binary and tweak the assembly code (which isn't allowed under my license with OpenEye). I've come across but never tried techniques like dynamically override a program while it's running.
So let's get to more practical matters. For the following I'll use the subprocess version of smi2name developed in the second essay of this current series.
OpenEye's naming code is very fast but give it a large enough structure and it still takes some time. Here's a little progam to time how long it takes to name structures of the form "C" * length where length is 1, 2, 5, 10, ....
import time
import smi2name_subprocess as smi2name
# Make sure the coprocess has started
smi2name.smi2name("C")
for i in (1, 2, 5, 10, 20, 50, 100, 200, 500, 1000, 2000,
5000, 10000, 20000):
t1 = time.time()
name = smi2name.smi2name("C" * i)
t2 = time.time()
print i, "%.5f" % (t2-t1)
As you can see, it scales roughly quadratically.
| length | time in seconds |
|---|---|
| 1 | 0.00045 |
| 2 | 0.00093 |
| 5 | 0.00113 |
| 10 | 0.00084 |
| 20 | 0.0014 |
| 50 | 0.0050 |
| 100 | 0.014 |
| 200 | 0.066 |
| 500 | 0.275 |
| 1000 | 1.023 |
| 2000 | 4.11 |
| 5000 | 28.0 |
| 10000 | 120. |
| 20000 | 522. |
Suppose I try hit control-C part way through a long computation.
>>> import smi2name_subprocess as smi2name
>>> smi2name.smi2name("C" * 10000)
^CTraceback (most recent call last):
File "<stdin>", line 1, in ?
File "smi2name_subprocess.py", line 91, in smi2name
return _smi2name(smiles)
File "smi2name_subprocess.py", line 63, in smi2name
rlist, _, _ = select.select([mol2nam.stdout, mol2nam.stderr], [], [])
KeyboardInterrupt
>>> smi2name.smi2name("C")
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "smi2name_subprocess.py", line 91, in smi2name
return _smi2name(smiles)
File "smi2name_subprocess.py", line 61, in smi2name
mol2nam.stdin.write(smiles + "\n")
IOError: [Errno 32] Broken pipe
>>>
The control-C killed the coprocess. Why? I don't know. It's part of
the great mystery that is Unix terminal process control. But it's
easy to fix. If the write fails, reset the connection to the
coprocess and try again. If it fails again then let the exception
occur because there's no way to even attempt a recovery.
def smi2name(self, smiles):
"""convert a SMILES string into an IUPAC name"""
if smiles == "":
return "vacuum"
m = _unexpected_char_pat.search(smiles)
if m:
raise NamingError("Unexpected character at position %d" %
(m.start(0)+1,))
mol2nam = self._get_mol2nam()
try:
mol2nam.stdin.write(smiles + "\n")
except IOError:
# coprocess died since the last call? Restart the connection
self._mol2nam = None
mol2nam = self._get_mol2nam()
mol2nam.stdin.write(smiles + "\n")
mol2nam.stdin.flush()
Suppose that mol2nam occasionally takes too long to run. It might be beacuse of an extremely long SMILES input or some sort of hyptothetical bug that puts it in an infinite loop. I want the wrapper code to identify that case and recover. Detection is easy. If there's nothing from stdout or stderr for more than some timeout value then it's too long. We're already using a select() statement to figure out which pipe has activity. The select() has an optional 4th argument which is the timeout. If the timeout is reached it returns three empty lists.
What remains is killing the child. We can't simply close the subprocess handle to the coprocess because if the coprocess is really in an infinite loop it will not notice that its stdin was closed. The Unix way is to send it a signal via os.kill(). You've probably used kill -9 from the Unix command-line to kill a process. Same idea. Except that -9 is usually too harsh. It doesn't give the program any way to quit gracefully. Because mol2nam doesn't ignore the polite SIGTERM signal to end I'll use that one instead.
import re, select
import subprocess
import os, signal
MOL2NAM = "ogham/mol2nam"
class NamingError(Exception):
pass
# Used to find the character position that cause the problem
_error_pos_pat = re.compile(r"^Warning: ( *)\^", re.MULTILINE)
# Check for characters other than printable ASCII
_unexpected_char_pat = re.compile(r"[^\040-\0176]")
def _find_error(text):
...omitted; unchanged from the last time...
class Smi2Name:
def __init__(self, executable = None, timeout = None):
# a subprocess.Popen connected to mol2nam
self._mol2nam = None
if executable is None:
executable = MOL2NAM
self.executable = executable
self.timeout = timeout
def _get_mol2nam(self):
... omitted; unchanged from the last time...
def smi2name(self, smiles):
"""convert a SMILES string into an IUPAC name"""
if smiles == "":
return "vacuum"
m = _unexpected_char_pat.search(smiles)
if m:
raise NamingError("Unexpected character at position %d" %
(m.start(0)+1,))
mol2nam = self._get_mol2nam()
try:
mol2nam.stdin.write(smiles + "\n")
except IOError:
# coprocess died since the last call? Restart the connection
self._mol2nam = None
mol2nam = self._get_mol2nam()
mol2nam.stdin.write(smiles + "\n")
mol2nam.stdin.flush()
rlist, _, _ = select.select([mol2nam.stdout, mol2nam.stderr],
[], [], self.timeout)
if mol2nam.stderr in rlist:
# Tells mol2nam to quit
mol2nam.stdin.close()
stderr_text = mol2nam.stderr.read()
# Doing this will restart the subprocess the next time through
self._mol2nam = None
raise NamingError(_find_error(stderr_text))
if mol2nam.stdout in rlist:
name = mol2nam.stdout.readline().rstrip()
if "BLAH" in name:
raise NamingError("Unsupported structure")
return name
# Timeout reached. Kill the child and restart.
try:
os.kill(mol2nam.pid, signal.SIGTERM)
except OSError:
# Already died?
pass
self._mol2nam = None
raise NamingError("timeout reached")
... the rest of the code is unchanged...
Testing it out ...
>>> import smi2name_subprocess as smi2name
>>> namer = smi2name.Smi2Name(timeout = 2)
>>> namer.smi2name("C" * 1000)
'kiliane'
>>> namer.smi2name("C" * 5000)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/Users/dalke/tmp/smi2name_subprocess.py", line 96, in smi2name
raise NamingError("timeout reached")
smi2name_subprocess.NamingError: timeout reached
>>> namer.smi2name("C" * 750)
'pentacontaheptactane'
>>>
The mol2nam program is easy to kill. Occasionally there are harder ones to kill. For example, the newly created program may in turn start its own processes. The grandchildren processes may ignore the kill command sent to the intermediate parent so they would need to be reaped as well.
There are a few ways to identify them. On some operating systems you can get a list of all the children processes to a given process. For Linux it's available from /proc/pid/status in the lines starting "PPid:\t".
In one case I found the best solution was to create a new process group because I could use os.killpg() to kill the parent of the group and all of its children. Assuming none of the children decide to start a new process group. Making the new process group is easy. It needs to be done by the subprocess.Popen() after the fork but before the exec, which is why there's a preexec_fn option to Popen(). For the mol2nam it would look like this:
mol2nam = subprocess.Popen( (self.executable, "-"),
stdin = subprocess.PIPE,
stdout = subprocess.PIPE,
stderr = subprocess.PIPE,
close_fds = True,
preexec_fn = os.setpgrp )
To kill the process group, instead of using os.kill() use
os.killpg(vsound.pid, signal.SIGTERM)
Whew! With this set of essays you've now got a good idea of the different ways to communicate with a command-line program and of some of the tricky things to watch out for. Now, start wrapping! :).
Copyright © 2001-2008 Dalke Scientific Software, LLC.


