Wrapping command-line programs, part V
I'm nearly out of tricks for dealing with cantankerous programs but the ones I've outlined should be enough. I've only once had a real need to disassemble the binary and tweak the assembly code (which isn't allowed under my license with OpenEye). I've come across but never tried techniques like dynamically override a program while it's running.
So let's get to more practical matters. For the following I'll use the subprocess version of smi2name developed in the second essay of this current series.
OpenEye's naming code is very fast but give it a large enough structure and it still takes some time. Here's a little progam to time how long it takes to name structures of the form "C" * length where length is 1, 2, 5, 10, ....
import time import smi2name_subprocess as smi2name # Make sure the coprocess has started smi2name.smi2name("C") for i in (1, 2, 5, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10000, 20000): t1 = time.time() name = smi2name.smi2name("C" * i) t2 = time.time() print i, "%.5f" % (t2-t1)As you can see, it scales roughly quadratically.
|length||time in seconds|
Suppose I try hit control-C part way through a long computation.
>>> import smi2name_subprocess as smi2name >>> smi2name.smi2name("C" * 10000) ^CTraceback (most recent call last): File "<stdin>", line 1, in ? File "smi2name_subprocess.py", line 91, in smi2name return _smi2name(smiles) File "smi2name_subprocess.py", line 63, in smi2name rlist, _, _ = select.select([mol2nam.stdout, mol2nam.stderr], , ) KeyboardInterrupt >>> smi2name.smi2name("C") Traceback (most recent call last): File "<stdin>", line 1, in ? File "smi2name_subprocess.py", line 91, in smi2name return _smi2name(smiles) File "smi2name_subprocess.py", line 61, in smi2name mol2nam.stdin.write(smiles + "\n") IOError: [Errno 32] Broken pipe >>>The control-C killed the coprocess. Why? I don't know. It's part of the great mystery that is Unix terminal process control. But it's easy to fix. If the write fails, reset the connection to the coprocess and try again. If it fails again then let the exception occur because there's no way to even attempt a recovery.
def smi2name(self, smiles): """convert a SMILES string into an IUPAC name""" if smiles == "": return "vacuum" m = _unexpected_char_pat.search(smiles) if m: raise NamingError("Unexpected character at position %d" % (m.start(0)+1,)) mol2nam = self._get_mol2nam() try: mol2nam.stdin.write(smiles + "\n") except IOError: # coprocess died since the last call? Restart the connection self._mol2nam = None mol2nam = self._get_mol2nam() mol2nam.stdin.write(smiles + "\n") mol2nam.stdin.flush()
Suppose that mol2nam occasionally takes too long to run. It might be beacuse of an extremely long SMILES input or some sort of hyptothetical bug that puts it in an infinite loop. I want the wrapper code to identify that case and recover. Detection is easy. If there's nothing from stdout or stderr for more than some timeout value then it's too long. We're already using a select() statement to figure out which pipe has activity. The select() has an optional 4th argument which is the timeout. If the timeout is reached it returns three empty lists.
What remains is killing the child. We can't simply close the subprocess handle to the coprocess because if the coprocess is really in an infinite loop it will not notice that its stdin was closed. The Unix way is to send it a signal via os.kill(). You've probably used kill -9 from the Unix command-line to kill a process. Same idea. Except that -9 is usually too harsh. It doesn't give the program any way to quit gracefully. Because mol2nam doesn't ignore the polite SIGTERM signal to end I'll use that one instead.
import re, select import subprocess import os, signal MOL2NAM = "ogham/mol2nam" class NamingError(Exception): pass # Used to find the character position that cause the problem _error_pos_pat = re.compile(r"^Warning: ( *)\^", re.MULTILINE) # Check for characters other than printable ASCII _unexpected_char_pat = re.compile(r"[^\040-\0176]") def _find_error(text): ...omitted; unchanged from the last time... class Smi2Name: def __init__(self, executable = None, timeout = None): # a subprocess.Popen connected to mol2nam self._mol2nam = None if executable is None: executable = MOL2NAM self.executable = executable self.timeout = timeout def _get_mol2nam(self): ... omitted; unchanged from the last time... def smi2name(self, smiles): """convert a SMILES string into an IUPAC name""" if smiles == "": return "vacuum" m = _unexpected_char_pat.search(smiles) if m: raise NamingError("Unexpected character at position %d" % (m.start(0)+1,)) mol2nam = self._get_mol2nam() try: mol2nam.stdin.write(smiles + "\n") except IOError: # coprocess died since the last call? Restart the connection self._mol2nam = None mol2nam = self._get_mol2nam() mol2nam.stdin.write(smiles + "\n") mol2nam.stdin.flush() rlist, _, _ = select.select([mol2nam.stdout, mol2nam.stderr], , , self.timeout) if mol2nam.stderr in rlist: # Tells mol2nam to quit mol2nam.stdin.close() stderr_text = mol2nam.stderr.read() # Doing this will restart the subprocess the next time through self._mol2nam = None raise NamingError(_find_error(stderr_text)) if mol2nam.stdout in rlist: name = mol2nam.stdout.readline().rstrip() if "BLAH" in name: raise NamingError("Unsupported structure") return name # Timeout reached. Kill the child and restart. try: os.kill(mol2nam.pid, signal.SIGTERM) except OSError: # Already died? pass self._mol2nam = None raise NamingError("timeout reached") ... the rest of the code is unchanged...Testing it out ...
>>> import smi2name_subprocess as smi2name >>> namer = smi2name.Smi2Name(timeout = 2) >>> namer.smi2name("C" * 1000) 'kiliane' >>> namer.smi2name("C" * 5000) Traceback (most recent call last): File "<stdin>", line 1, in ? File "/Users/dalke/tmp/smi2name_subprocess.py", line 96, in smi2name raise NamingError("timeout reached") smi2name_subprocess.NamingError: timeout reached >>> namer.smi2name("C" * 750) 'pentacontaheptactane' >>>
The mol2nam program is easy to kill. Occasionally there are harder ones to kill. For example, the newly created program may in turn start its own processes. The grandchildren processes may ignore the kill command sent to the intermediate parent so they would need to be reaped as well.
There are a few ways to identify them. On some operating systems you can get a list of all the children processes to a given process. For Linux it's available from /proc/pid/status in the lines starting "PPid:\t".
In one case I found the best solution was to create a new process group because I could use os.killpg() to kill the parent of the group and all of its children. Assuming none of the children decide to start a new process group. Making the new process group is easy. It needs to be done by the subprocess.Popen() after the fork but before the exec, which is why there's a preexec_fn option to Popen(). For the mol2nam it would look like this:
mol2nam = subprocess.Popen( (self.executable, "-"), stdin = subprocess.PIPE, stdout = subprocess.PIPE, stderr = subprocess.PIPE, close_fds = True, preexec_fn = os.setpgrp )To kill the process group, instead of using os.kill() use
Whew! With this set of essays you've now got a good idea of the different ways to communicate with a command-line program and of some of the tricky things to watch out for. Now, start wrapping! :).
Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me
Copyright © 2001-2013 Andrew Dalke Scientific AB