Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2005/04/17/wrapping_command_line_programs_III

Wrapping command-line programs, part III

In the second article in this series I showed how to use OpenEye's mol2nam program as coprocess from Python. To make it work I had to edit the original source code to add an fflush after writing the name. Otherwise the output was buffered and inaccessible to the Python wrapper. Sadly, programs in this field rarely come with recompilable source code. What could be done if I couldn't add the fflush?

When the C stdio library initializes stdout it checks if the output is a terminal. If so it sets the output mode to line buffered. Otherwise it is put into block mode. A console window is a terminal but files and pipes are not. Not many people use actual terminals these days ("tty" is short for "teletypewriter"). Instead they are emulated using what are called pseudo-ttys. We can create and use our own pty to communicate with the original OpenEye mol2nam in line buffered mode.

This gets into an aspect of Unix that I don't know well. There's a 30+ year history of terminal control that I've never had to worry and was never interested in learning. I have only vague ideas of what ioctl, fcntl and tcgetattr/tcsetattr do. What I'm about to describe works, but there may be ways to make it work better. Please let me know if there's a better way.

Instead what I do is let someone else provide a higher-level interface to the terminal control functions. Pexpect is a Python library influenced by Don Libes' venerable Expect package. It opens a pty connected to a process, sets the terminal modes correctly, provides a Python file-like interface, and a few bits of extra functionality. For more details you should read the documentation and scan the source code.

Using pexpect is quite simple, except when aspects of the archaic, baroque pty interface appear. The main interface is the spawn class, which takes the command to run and an optional timeout. The newly created instance implements file-like method, so you can still read, readline and write to the interface. Here's the code to connect to mol2nam and skip the three lines of the header.

    def _get_mol2nam(self):
        if self._mol2nam is None:
            mol2nam = pexpect.spawn(self.executable, ["-"])

            # skip the three header lines
            mol2nam.readline()
            mol2nam.readline()
            mol2nam.readline()
            self._mol2nam = mol2nam
            
        return self._mol2nam
One difference between this and the subprocess interface is that everything is communicated over a single bidirectional connection. There is no difference between stdin, stdout and stderr. That's why the previous code snipped used only readline() and didn't specify which input to read from.

By default terminals echo the input so something written to the spawn instance will also be read. When I write the SMILES line to the process I need to skip the echoed response, like this

        mol2nam = self._get_mol2nam()
        mol2nam.write(smiles + "\n")
        
        mol2nam.readline()  # skip echoed line
        line = mol2nam.readline()
The spawn class has a setecho method which I hoped would prevent the echoing. When I toggled it I no longer got output from mol2nam. I don't know why. I wrote a simple program that implements enough of the mol2nam protocol to pass the self test but was not able to make it reproduce the problem. Did I mention I don't like working with ptys?

The readline method gets both the systematic name sent to stdout and the error messages sent to stderr. Luckily they are easy to distinguish because "Warning:" is not in any systematic name. As before, if there is an error I want to restart the connection. I can't simply close the input stream then read from the output stream because they are the same connection. If I close one then I close the other. Instead I need to use a pexpect method called sendeof which tells mol2nam that there is no more input. The relevant code is:

        if line.startswith("Warning:"):
            # Tell mol2nam to exit, process the output to get the
            # error message, then reset for the next SMILES.
            mol2nam.sendeof()
            text = line + mol2nam.read()
            self._mol2nam =  None
            raise NamingError(_find_error(text))
        elif "BLAH" in line:
            raise NamingError("Unsupported structure")
        else:
            return line.rstrip()
Bear in mind that lines coming from a pty end with "\r\n" and not just the "\n" character. This is an aspect of using an API designed to support typewriter printer carriages. If the last line had been return line[:-1] then it would have the extra "\r" character. The rstrip() method removes all whitespace on the right so works just fine.

With these in place I ran the regression test code. It didn't pass. The problem was with the very long SMILES string meant to force mol2nam to give a segmentation fault. I couldn't figure out what was going on so I finally wrote a new program that implements the mol2nam API and should be able to pass the regression test. This let me watch what was going on with that side of the connection. Because the stdout and stderr went back to the wrapper code you can see I opened up /dev/ttyp4 and wrote output there. I had another terminal window open and from the tty command I knew it was using that pseudo-tty handle. (In unix nearly all I/O is a file.) By writing to that file the output goes to that terminal's display. An advantage is that try/except block around the call to main(). That let me see the exceptions during testing which otherwise would have been put somewhere in the pexpect interface.

Here's my test code. I open the debug file in unbuffered mode used the "-u" option on the #! line to tell Python to use unbuffered stdin and stdout. The function _W constructs an appropriate "Warning:" message that will pass the smi2name test code.

#!/usr/bin/python -u
import sys

debugf = open("/dev/ttyp4", "w", 0)
debugf.write("Starting\n")

def _W(s, i):
    lines = ["Warning:",
             "Warning: " + s]
    if i:
        lines.append("Warning:" + (" "*i) + "^\n")
    return "\n".join(lines) + "\n"

answers = {
    "C": "methane",
    "U": _W("", 1),
    "CC1": _W("Unclosed ring.", 3),
    "LONG": _W("Unclosed branch.", None),
    "CCCC(C": _W("Unclosed branch.", 6),
    "CCCCCC)C": _W("", 7),
    "[U]": "uranium",
    "C1CC23CC4CC3C1C(C2)CC4": "BLAH",
    "C#N": "hydrogen cyanide",
}

def main():
    print "header line 1"
    print "header line 2"
    print "header line 3"


    while 1:
        line = sys.stdin.readline()
        print >>debugf, "Got line", repr(line)
        if not line:
            break
        if len(line) > 80:
            line = "LONG"
        print answers[line.rstrip()]


try:
    main()
except:
    import traceback
    traceback.print_exc(file=debugf)

When I used this I found that the long SMILES was never getting to the coprocess, though the short SMILES strings were. Through experimentation I found that if the SMILES was 1000 characters or smaller then it would be sent to the coprocess but anything longer caused problems. When I tried to write a long SMILES I found that I got several chr(7) bytes if I read from the spawn interface. ASCII character 7 is for BEL, which should ring the terminal bell. This strongly suggests the terminal is buffering.

Terminals have two major modes; cooked and raw. When you type something on the command-line you expect to be able two^H^Ho edit the line before pressing enter. Various characters get treated as editing characters, like backspace (which is often either ASCII 8/^H) or ASCII 127/^?) and "kill line" (ASCII 21/^U). You also expect control-C (ASCII 3) to kill a process and control-Z (ASCII 26) to suspend it. When the terminal supports these conversions it is in cooked mode because it is processing the input. Otherwise it is in raw mode. Programs like vi and emacs use raw mode to capture each character as its pressed and to change the meaning of things like control-C.

To test if this was the case I used the test string "CC"+chr(21)+"S". In cooked mode the special character in the middle kills the line; it erases everything before it on the input line. If the terminal is in cooked mode then the result should be "hydrogen sulfide". And indeed it is. I also tried using chr(8) for backspace but had to switch to chr(127) which is what the terminal actually uses. The "stty -a" command lists all of the special characters.

cchars: discard = ^O; dsusp = ^Y; eof = ^D; eol = <undef>;
        eol2 = <undef>; erase = ^?; intr = ^C; kill = ^U; lnext = ^V;
        min = 1; quit = ^\; reprint = ^R; start = ^Q; status = <undef>;
        stop = ^S; susp = ^Z; time = 0; werase = ^W;
The backspace worked as did chr(3) for control-C and chr(26) for control-Z.

The problem is we're in cooked mode so the terminal saves a 1000 byte buffer to allow for editing. I want to switch into raw mode but I don't know how. When I try "import tty" then "tty.setraw(mol2nam.child_fd)" then the interface just hangs. Like I said, I don't know the details of ptys well enough.

Luckily for me I don't need know them. For this interface it's okay to limit the SMILES string to no more than 1,000 characters and to prohibit anything other than the printable ASCII characters. I mentioned in the first essay of this current series that I don't like checking for incorrect data at this level of the API. The exception is for cases like this where bad input can cascade and have big or unexpected problems.

Here's the code for checking for these cases.

# Check for characters other than printable ASCII
_unexpected_char_pat = re.compile(r"[^\040-\0176]")
...
    def smi2name(self, smiles):
        """convert a SMILES string into an IUPAC name"""
        if smiles == "":
            return "vacuum"
        elif "\n" in smiles:
            raise NamingError("Newline not allowed in SMILES")
        elif len(smiles) > 1000:
            raise NamingError("SMILES too long")
        m = _unexpected_char_pat.search(smiles)
        if m:
            raise NamingError("Unexpected character at position %d" %
                              (m.start(0)+1,))
        ...
...
and the test cases for them. Note that the error message for the long SMILES string has changed.
def test():
    for smi, name, errmsg in (
        ("C", "methane", None),
        ("C"+chr(127)+"S", None, "Unexpected character at position 2"),
        ("CC"+chr(3), None, "Unexpected character at position 3"),
        ("S", "hydrogen sulfide", None),
            ...
        ("C"*32764 + "(C)", None, "SMILES too long"),
        ("C"*1000 , "kiliane", None),
        ("C"*1001 , None, "SMILES too long"),

Hmm, "\n" is also a special character so I can remove the special test for ""Newline not allowed in SMILES".

Here's the version of the wrapper code that use pseudo-ttys through the pexpect library


import os, re
import pexpect

MOL2NAM = os.path.join(os.environ["OE_DIR"], "bin", "mol2nam")

class NamingError(Exception):
    pass

# Used to find the character position that cause the problem
_error_pos_pat = re.compile(r"^Warning: ( *)\^", re.MULTILINE)

# Check for characters other than printable ASCII
_unexpected_char_pat = re.compile(r"[^\040-\0176]")

def _find_error(text):
    errmsg = "Cannot parse SMILES"
    if "\nWarning: Unclosed branch." in text:
        errmsg = "Unclosed branch"
    elif "\nWarning: Unclosed ring." in text:
        errmsg = "Unclosed ring"

    m = _error_pos_pat.search(text)
    if m:
        errpos = len(m.group(1)) + 1
        errmsg = errmsg + " at position %d" % errpos
        
    return errmsg

class Smi2Name:
    def __init__(self, executable = None):
        # a pexpect.spawn connected to mol2nam
        self._mol2nam = None
        if executable is None:
            executable = MOL2NAM
        self.executable = executable

    def _get_mol2nam(self):
        if self._mol2nam is None:
            mol2nam = pexpect.spawn(self.executable, ["-"])

            # skip the three header lines
            mol2nam.readline()
            mol2nam.readline()
            mol2nam.readline()
            self._mol2nam = mol2nam
            
        return self._mol2nam
    
    def smi2name(self, smiles):
        """convert a SMILES string into an IUPAC name"""
        if smiles == "":
            return "vacuum"
        elif len(smiles) > 1000:
            raise NamingError("SMILES too long")
        m = _unexpected_char_pat.search(smiles)
        if m:
            raise NamingError("Unexpected character at position %d" %
                              (m.start(0)+1,))

        mol2nam = self._get_mol2nam()
        mol2nam.write(smiles + "\n")
        
        mol2nam.readline()  # skip echoed line
        line = mol2nam.readline()
        
        if line.startswith("Warning:"):
            # Tell mol2nam to exit, process the output to get the
            # error message, then reset for the next SMILES.
            mol2nam.sendeof()
            text = line + mol2nam.read()
            self._mol2nam =  None
            raise NamingError(_find_error(text))
        elif "BLAH" in line:
            raise NamingError("Unsupported structure")
        else:
            return line.rstrip()

# Defer instantiation of the wrapper until it's needed.
# This lets other code change MOL2NAM if needed, but changes
# will only work if done before calling this function.
_smi2name = None
def smi2name(smiles):
    """convert a SMILES string into an IUPAC name"""
    global _smi2name
    if _smi2name is None:
        _smi2name = Smi2Name().smi2name
    return _smi2name(smiles)
        

def test():
    for smi, name, errmsg in (
        ("C", "methane", None),
        ("C"+chr(127)+"S", None, "Unexpected character at position 2"),
        ("CC"+chr(3), None, "Unexpected character at position 3"),
        ("S", "hydrogen sulfide", None),
        ("U", None, "Cannot parse SMILES at position 1"),
        ("CC1", None, "Unclosed ring at position 3"),
        ("C", "methane", None),
        ("C"*32764 + "(C)", None, "SMILES too long"),
        ("C"*1000 , "kiliane", None),
        ("C"*1001 , None, "SMILES too long"),
        ("C\nC", None, "Unexpected character at position 2"),
        ("CCCC(C", None, "Unclosed branch at position 6"),
        ("CCCCCC)C", None, "Cannot parse SMILES at position 7"),
        ("[U]", "uranium", None),
        ("", "vacuum", None),
        ("C1CC23CC4CC3C1C(C2)CC4", None, "Unsupported structure"),
        ("C#N", "hydrogen cyanide", None)):

        computed_name = computed_errmsg = None
        try:
            computed_name = smi2name(smi)
        except NamingError, err:
            computed_errmsg = str(err)

        if (name != computed_name or
            errmsg != computed_errmsg):
            raise AssertionError("SMILES: %r expected (%r %r) got (%r %r)"
                                 % (smi, name, errmsg,
                                    computed_name, computed_errmsg))
    print "All tests passed."


if __name__ == "__main__":
    test()
and here are the timing numbers I got
Total time: 79.91
Time per compound: 0.01
59.450u 12.400s 1:20.11 89.6%   0+0k 0+1io 0pf+0w
At 80 seconds there is definitely a performance hit using a pty, probably from terminal cooking through I didn't try to track it down. Compare that to the subprocess pipe code which runs in 25 seconds and the command-line interface at 15 seconds. Still it's better than the first version which restarted mol2nam for every call and took nearly 370 seconds. Unlike the subprocess version it uses the mol2nam provided by OEChem (no recompile needed) and unlike mol2nam by itself it reports exactly which structures could not be parsed.


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB