Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2005/04/15/wrapping_command_line_programs_II

Wrapping command-line programs, part II

My previous essay showed how to create Python function called smi2name which in turn calls OEChem's mol2nam to compute the systematic name from the SMILE string.

While it works there are ways to improve the code for better performance and flexibility. Let's start with the performance. I pulled out the first 10,000 compounds from the NCI data set and converted them into SMILES. Using the command-line program directly it takes

% time mol2nam first_10000_nci.smi > /dev/null
mol2nam v1.0  Structure to Name Conversion
OpenEye Scientific Software, November 2003

15.010u 0.110s 0:16.67 90.7%    0+0k 0+0io 0pf+0w
% 
about 15 seconds to convert all the records into names.

By comparison, here's code to test the performance of the Python interface. There is some extra code to report progress and list which compounds had problems but their performance impact should be minimal.


from smi2name import smi2name

def timings():
    import time
    t1 = time.time()
    for i, line in enumerate(open("first_10000_nci.smi")):
        if i % 1000 == 999:
            print "%.4f s/cmpd" % ((time.time()-t1) / (i+1))
        smiles = line[:-1]
        try:
            name = smi2name(smiles)
        except NamingError, err:
            print "problem", repr(smiles), repr(str(err))
    t2 = time.time()
    print "Total time: %.2f" % (t2-t1)
    print "Time per compound: %.4f" % ((t2-t1)/(i+1))

if __name__ == "__main__":
    timings()

The results?
Total time: 368.47
Time per compound: 0.04
137.820u 168.610s 6:08.74 83.1% 0+0k 0+0io 0pf+0w
or roughly 1/10th the performance.

Where is the time spent? I had the timing code use a dummy smi2name function

def smi2name(smiles):
    return smiles
and the result was.
Total time: 0.07
Time per compound: 0.00
0.170u 0.080s 0:00.28 89.2%     0+0k 0+0io 0pf+0w
Nearly all the time is spent in smi2mol. From experience I know most of the time is spent in starting a new external program every time. The Python code needs to do a fork/exec. While mol2nam has its own startup costs this says 83.1% of the CPU time was spent in the Python code so that cost is not a large contribution.

If the performance hit is a problem it can be solved by using mol2nam as a coprocess. (I'm using the term from Stevens' Advanced Programming in the UNIX Environment.) The external program already expects a list of structures as input and gives the names for each one as output. It should be a straight-forward matter to keep its stdin/stdout/stderr pipes open. When a new name is needed, write the SMILES to its stdin and check its stdout and stderr.

This version of the wrapper will store information about the interface so I need to use a class. I'll use the __call__ special method so that class instances can implement function semantics. Note that the subprocess isn't started until it's needed. As you see, I create an instance of it later on in the module, referenced by smi2name. If the subprocess is started in the constructor then importing the module always has the side effect of opening the connection. This is inelegant: if smi2name is never called why open the connection and incur the import performance hit? It's also a bug on some systems because MOL2NAM might not be located in $OE_DIR/bin/mol2nam. Code that uses this module should have the chance to import it and change the setting before the connection starts.

import os
import subprocess

MOL2NAM = os.path.join(os.environ["OE_DIR"], "bin", "mol2nam")

class Smi2Name:
    def __init__(self):
        # a subprocess.Popen connected to mol2nam
        self._mol2nam = None

    def _get_mol2nam(self):
        if self._mol2nam is None:
            self._mol2nam = subprocess.Popen( (MOL2NAM, "-"),
                                              stdin = subprocess.PIPE,
                                              stdout = subprocess.PIPE,
                                              stderr = subprocess.PIPE,
                                              close_fds = True)
        return self._mol2nam
    
    def __call__(self, smiles):
        mol2nam = self._get_mol2nam()
        mol2nam.stdin.write(smiles + "\n")
        return mol2nam.stdout.readline()

smi2name = Smi2Name()

print smi2name("C")

Run it and ... Python hangs. Why?

Again, from experience the most likely reason is the buffering done by the I/O library. It's faster to read and write a block of text at a time than character at a time but it's often easiest to write a program expecting to process a character at a time. The standard C and C++ libraries include an I/O layer which can help. It buffers multiple read or write requests in user space (the program) and merges them into a single read/write made to the operating system.

There are three buffer modes: unbuffered, line buffered, and block buffered. If not explicitly specified the default is to use line buffered if the input or output is a terminal. This means the buffer is flushed once a newline is seen. If it's not a terminal the default is to be unbuffered. For example, if the buffer is 512 bytes in size then there will need to be 512 bytes written to the buffer before any of it is actually written to the file or pipe.

There are three pipes here, stdin, stdout and stderr. Stderr is always opened in unbuffered mode but we will have to worry about the other two. From the wrapper's side we can force the flush ourselves


    def __call__(self, smiles):
        mol2nam = self._get_mol2nam()
        mol2nam.stdin.write(smiles + "\n")
        mol2nam.stdin.flush()
        return mol2nam.stdout.readline()

Test it again and ... python still hangs.

This almost certainly means that the command-line program isn't flushing its output after every line. To test it out I'll close stdin after writing the SMILES string. This should cause mol2nam to exit, which includes flushing the buffer.


    def __call__(self, smiles):
        mol2nam = self._get_mol2nam()
        mol2nam.stdin.write(smiles + "\n")
        mol2nam.stdin.close()
        return mol2nam.stdout.readline()

% python smi2name.py
methane

% 
That worked so it looks we need a way to get around the block buffering for a pipe.

As it turns out, OpenEye distributes the source code for mol2nam. It's at $OE_DIR/examples/ogham/mol2nam.cpp . The relevant code is about 6 lines from the end of the program. I'll include the loop itself and the IUPAC call for context

  while (ifs >> mol)
  {
      ...
    std::string name = OEIUPAC::OECreateIUPACName(mol);
    if (outname)
       ...
    else printf("%s\n",name.c_str());
  }
  fflush(stdout);
  return 0;
}
The "outname" is the output filename and it not used in this case. To make the code flush change the "else printf" block to include the fflush:
    else
    {
      printf("%s\n",name.c_str());
      fflush(stdout);
    }
  }
  fflush(stdout);
  return 0;
}

Depending on where the code is located you may need to edit the Makefile to have it point to the OEChem directory. I changed the line at the top that read OEDIR=../.. into OEDIR=$(OE_DIR). Then did "make mol2nam".

The wrapper code will need to point to the new mol2nam. For now just change MOL2NAM to the correct filename. Here's the code as it currently stands:

import os
import subprocess

#MOL2NAM = os.path.join(os.environ["OE_DIR"], "bin", "mol2nam")
## The modified mol2nam which flushes stdout after writing the name
MOL2NAM = "ogham/mol2nam"

class Smi2Name:
    def __init__(self):
        # a subprocess.Popen connected to mol2nam
        self._mol2nam = None

    def _get_mol2nam(self):
        if self._mol2nam is None:
            self._mol2nam = subprocess.Popen( (MOL2NAM, "-"),
                                              stdin = subprocess.PIPE,
                                              stdout = subprocess.PIPE,
                                              stderr = subprocess.PIPE,
                                              close_fds = True)
        return self._mol2nam
    
    def __call__(self, smiles):
        mol2nam = self._get_mol2nam()
        mol2nam.stdin.write(smiles + "\n")
        mol2nam.stdin.flush()
        return mol2nam.stdout.readline()

smi2name = Smi2Name()

print smi2name("C")
print smi2name("c1ccccc1O")
and when I run it I get
% python smi2name.py
methane

phenol

% 
We're on the right track! We'll need to remove the newline and add error handling but it's enough to test the new performance.

I tweaked the timings function so it would run with this new code and found that the pipeline version takes about 21 seconds to run through 10,000 SMILES strings.

Total time: 21.01
Time per compound: 0.00
1.320u 1.630s 0:21.24 13.8%     0+0k 0+0io 0pf+0w
The command-line version took only 16 seconds to run so there may be a performance hit. The earlier version sent its data to /dev/null which doesn't need to do anything so should be faster than writing to a file or to a pipe. To see if the performance hit was a side effect of the fflush statement, I timed the two versions, telling both to save the results to a file. The best of two results are here with the fflush version is listed second.
% time mol2nam first_10000_nci.smi > tmp.tmp
mol2nam v1.0  Structure to Name Conversion
OpenEye Scientific Software, November 2003

14.850u 0.160s 0:16.40 91.5%    0+0k 0+3io 0pf+0w
% time ./ogham/mol2nam first_10000_nci.smi > tmp.tmp
mol2nam v1.0  Structure to Name Conversion
OpenEye Scientific Software, November 2003

15.130u 0.380s 0:16.81 92.2%    0+0k 0+0io 0pf+0w
% 
As you can see, writing to a file about as fast as writing to /dev/null and the extra time for fflush() is about small. It's 2% here but that could be due to my machine's variability because the numbers I get are +/- 0.2 seconds.

Error detection and handling for a coprocess is tricky, especially for programs like mol2nam which aren't designed to be used that way. When writing a wrapper for them you need to be very careful and be aware of the different possibilities that might happen.

The first thing mol2nam does when it starts is print 3 header lines to stderr. It then waits on stdin for input. If the input is valid it writes a name to stdout. If the input is invalid it writes some text to stderr; the actual text depends on the error. All output ends with a newline.

This will require determining if data is ready on the program's stdout or stderr. Normally a file read blocks until there is data so if I read from stdout and the actual text is sent to stderr then my wrapper code will never exit, and vice versa. Threads are one solution but rather complex for this task. In Unix the select command is used to figure out which file has input. You'll need to read the Python and Unix documentation for details. The short version is that select takes a list of possible file descriptors that might have read, write, or error actions and an optional timeout. It returns a list of the read, write, and error handles that had activity. If select returns because of the timeout these three lists are empty.

Here's the first attempt to figure out if there was an error. The major changes are to read the three header lines from stderr and test if output is ready from stdout or stderr. I also get rid of the trailing newline in the name.

import select
import subprocess

MOL2NAM = "ogham/mol2nam"

class Smi2Name:
    def __init__(self):
        self._mol2nam = None

    def _get_mol2nam(self):
        if self._mol2nam is None:
            mol2nam = subprocess.Popen( (MOL2NAM, "-"),
                                        stdin = subprocess.PIPE,
                                        stdout = subprocess.PIPE,
                                        stderr = subprocess.PIPE,
                                        close_fds = True)
            # skip the three header lines
            mol2nam.stderr.readline()
            mol2nam.stderr.readline()
            mol2nam.stderr.readline()
            self._mol2nam = mol2nam
            
        return self._mol2nam
    
    def __call__(self, smiles):
        mol2nam = self._get_mol2nam()
        mol2nam.stdin.write(smiles + "\n")
        mol2nam.stdin.flush()
        rlist, _, _ = select.select([mol2nam.stdout, mol2nam.stderr], [], [])
        if mol2nam.stderr in rlist:
            return "ERROR"
        if mol2nam.stdout in rlist:
            return mol2nam.stdout.readline().rstrip()
        raise AssertionError("Should not get here")

smi2name = Smi2Name()

print repr(smi2name("C"))
print repr(smi2name("OO"))
print repr(smi2name("Q"))
print repr(smi2name("C1CC"))
print repr(smi2name("c1ccccc1O"))
Here's the output
% python smi2name.py
'methane'
'hydrogen peroxide'
'ERROR'
'ERROR'
'ERROR'
% 
It detected that "Q" was an error but didn't report phenol for c1ccccc1O. Why? Because there's still text waiting to be read from stderr. The wrapper needs to read all the text before going on. I can't say stderr.read() because that won't return until stderr is closed by mol2nam. I can't easily say stderr.readline() because I don't know how many lines to read. It depends on which error occured. I could hard-code that information but that's fragile.

I do have access to the source code for mol2nam so you would think I could change the output to make life easier for the wrapper. I would like to print "ERROR" to stdout if a line could not be read and print "Warning done." to stderr to provide an unambiguous signal of the end of the stderr text. I can't do either of these. If you instrument the code you'll see that the warnings come during while (ifs >> mol). It's inside the OEChem code and there's no way for me to change the behaviour.

Another possibility is to read lines until select says the input is empty.

        # This code does not work
        text = ""
        while mol2nam.stderr in select.select([mol2nam.stderr], [], [])[0]:
          text = text + mol2nam.stderr.readline()
There is a subtle timing bug in this code. OEChem's output can occur in parts. It might act similar to the following
  cerr << "Warning: Error parsing SMILES:" << endl;
  cerr << "Warning: Unclosed ring." << endl;
  cerr << "Warning: " << smiles << endl;
  cerr << "Warning: " << nspaces(i) << "^" << endl;
Depending on the system settings and the size of the SMILES string this text could arrive in the pipe in parts. Suppose there are two parts and the first one happens to end with a newline. Unix is a multitasking system so at this point it could pass control to the wrapper. The wrapper reads all the lines in the block, which empties the buffer. At this point select says the input is empty so the read loop ends.

But when control returns to mol2nam it prints additional text to stderr. The wrapper code is then desynchronized and will get confused in subsequent calls. With some protocols it's possible to become resyncronized but I don't think that's the case here.

There's one other thing to watch out for. What happens if mol2nam crashes? In the previous essay I showed how a large enough SMILES string causes a segmentation fault, but not until printing something to stderr. The wrapper needs to be able to recover from that case.

I've found that when errors are rare and tricky to recover from then the simplest solution is to shut down and restart the coprocess. Doing that is easy. mol2nam quits when there is no more input so just close its stdin. During the close it will flush all of its buffers so the error text will be available in stderr and can be retreived with a read(). I'll ignore the contents for now and just return the string "ERROR". After all that I'll set _mol2nam to None so the next time through the code will restart the subprocess connection.

import select
import subprocess

MOL2NAM = "ogham/mol2nam"

class Smi2Name:
    def __init__(self):
        # a subprocess.Popen connected to mol2nam
        self._mol2nam = None

    def _get_mol2nam(self):
        if self._mol2nam is None:
            mol2nam = subprocess.Popen( (MOL2NAM, "-"),
                                        stdin = subprocess.PIPE,
                                        stdout = subprocess.PIPE,
                                        stderr = subprocess.PIPE,
                                        close_fds = True)
            # skip the three header lines
            mol2nam.stderr.readline()
            mol2nam.stderr.readline()
            mol2nam.stderr.readline()
            self._mol2nam = mol2nam
            
        return self._mol2nam
    
    def __call__(self, smiles):
        mol2nam = self._get_mol2nam()
        mol2nam.stdin.write(smiles + "\n")
        mol2nam.stdin.flush()
        rlist, _, _ = select.select([mol2nam.stdout, mol2nam.stderr], [], [])
        if mol2nam.stderr in rlist:
            # Tells mol2nam to quit
            mol2nam.stdin.close()
            stderr_text = mol2nam.stderr.read()

            # Doing this will restart the subprocess the next time through
            self._mol2nam = None
            return "ERROR"

        if mol2nam.stdout in rlist:
            return mol2nam.stdout.readline().rstrip()
        raise AssertionError("Should not get here")

smi2name = Smi2Name()

print repr(smi2name("C"))
print repr(smi2name("OO"))
print repr(smi2name("Q"))
print repr(smi2name("C1CC"))
print repr(smi2name("c1ccccc1O"))
and the output is
% python smi2name.py
'methane'
'hydrogen peroxide'
'ERROR'
'ERROR'
'phenol'
% 
Getting closer!

The next step is to parse the error text and raise an exception. I wrote code for most of that in the previous essay so I'll just reuse it here. With it in place the API is identical to the simple function from the last time so I can also reuse the self-test code. It's a good thing I had the test code because I had forgotten the special-case check for the empty SMILES string. If the SMILES string does contain a newline then it will desynchronize the protocol. That will cause significant problems that are hard to track down so I added a check for that case and test code. To make sure everything was synchronized at the end I added one more test at the end with a valid SMILES.

Here then is the current version of the code.

import re, select
import subprocess

MOL2NAM = "ogham/mol2nam"

class NamingError(Exception):
    pass

# Used to find the character position that cause the problem
_error_pos_pat = re.compile(r"^Warning: ( *)\^", re.MULTILINE)

def _find_error(text):
    errmsg = "Cannot parse SMILES"
    if "\nWarning: Unclosed branch." in text:
        errmsg = "Unclosed branch"
    elif "\nWarning: Unclosed ring." in text:
        errmsg = "Unclosed ring"

    m = _error_pos_pat.search(text)
    if m:
        errpos = len(m.group(1)) + 1
        errmsg = errmsg + " at position %d" % errpos
        
    return errmsg


class Smi2Name:
    def __init__(self):
        # a subprocess.Popen connected to mol2nam
        self._mol2nam = None

    def _get_mol2nam(self):
        if self._mol2nam is None:
            mol2nam = subprocess.Popen( (MOL2NAM, "-"),
                                        stdin = subprocess.PIPE,
                                        stdout = subprocess.PIPE,
                                        stderr = subprocess.PIPE,
                                        close_fds = True)
            # skip the three header lines
            mol2nam.stderr.readline()
            mol2nam.stderr.readline()
            mol2nam.stderr.readline()
            self._mol2nam = mol2nam
            
        return self._mol2nam
    
    def __call__(self, smiles):
        if smiles == "":
            return "vacuum"
        elif "\n" in smiles:
            raise NamingError("Newline not allowed in SMILES")
        mol2nam = self._get_mol2nam()
        mol2nam.stdin.write(smiles + "\n")
        mol2nam.stdin.flush()
        rlist, _, _ = select.select([mol2nam.stdout, mol2nam.stderr], [], [])
        if mol2nam.stderr in rlist:
            # Tells mol2nam to quit
            mol2nam.stdin.close()
            stderr_text = mol2nam.stderr.read()

            # Doing this will restart the subprocess the next time through
            self._mol2nam = None
            
            raise NamingError(_find_error(stderr_text))
            
        if mol2nam.stdout in rlist:
            name = mol2nam.stdout.readline().rstrip()
            if "BLAH" in name:
                raise NamingError("Unsupported structure")
            return name
        
        raise AssertionError("Should not get here")

smi2name = Smi2Name()

def test():
    for smi, name, errmsg in (
        ("C", "methane", None),
        ("U", None, "Cannot parse SMILES at position 1"),
        ("CC1", None, "Unclosed ring at position 3"),
        ("C"*32764 + "(C)", None, "Unclosed branch"),
        ("C\nC", None, "Newline not allowed in SMILES"),
        ("CCCC(C", None, "Unclosed branch at position 6"),
        ("CCCCCC)C", None, "Cannot parse SMILES at position 7"),
        ("[U]", "uranium", None),
        ("", "vacuum", None),
        ("C1CC23CC4CC3C1C(C2)CC4", None, "Unsupported structure"),
        ("C#N", "hydrogen cyanide", None)):

        computed_name = computed_errmsg = None
        try:
            computed_name = smi2name(smi)
        except NamingError, err:
            computed_errmsg = str(err)

        if (name != computed_name or
            errmsg != computed_errmsg):
            raise AssertionError("SMILES: %r expected (%r %r) got (%r %r)"
                                 % (smi, name, errmsg,
                                    computed_name, computed_errmsg))
    print "All tests passed."

if __name__ == "__main__":
    test()

How's the performance? I reran the timing code and found

Total time: 25.11
Time per compound: 0.00
2.280u 2.720s 0:26.34 18.9%     0+0k 0+0io 0pf+0w
The command-line version took only 16 seconds, though there was no way to figure out which structures caused SMILES errors. The raw pipeline version took 21 seconds. This one takes about 26 seconds, probably because of the time spent restarting the server after failures.

Nearly all programming is a balance of tradeoffs. How robust should things be? How fast? How should it deal with different types of errors? Knowing the answers comes from experience because it's different for different situations. For general purpose use I lean towards robustness with informative error messages.

There are a few more tweaks needed to make this a general purpose module. I mentioned earlier that mol2name might be located in some place other than the default. After all, I used a version I compiled myself to have an extra fflush in it. The location needs to be a configurable value. I've found it best to include the executable name as part of the constructor and if it isn't needed to get it from the module's default settings.


class Smi2Name:
    def __init__(self, executable = None):
        # a subprocess.Popen connected to mol2nam
        self._mol2nam = None
        if executable is None:
            executable = MOL2NAM
        self.executable = executable

    def _get_mol2nam(self):
        if self._mol2nam is None:
            mol2nam = subprocess.Popen( (self.executable, "-"),
                                        stdin = subprocess.PIPE,
                                        stdout = subprocess.PIPE,
                                        stderr = subprocess.PIPE,
                                        close_fds = True)

To be even more flexible it could take an optional dictionary of environment settings in case the new program needs a different OE_DIR environment variable. But that's well into the land of being flexible for its own sake. One guideline in software development is YAGNI for "You Aren't Gonna Need It". I can conceive of wanting to compare the names generated by two different programs but unless the second program conforms very much to the same ad hoc protocol this wrapper won't work. Even if the new one does conform, there's nothing preventing using an executable which is a shell script that sets the correct environment variables, adjusts any command-line options, and calls the actual program.

My Smi2Name class implements the __call__ special method so I can use an instance of it as a function-like object. That's the code which looks like this.

smi2name = Smi2Name()
I'm actually slightly against the practice because I think it's more cute than useful. I prefer using normal function names. That's easy enough to fix: change "def __call__" into "def smi2name" and change the previous code snippet to use a bound method:
smi2name = Smi2Name().smi2name

One problem cropped up when I introduced the change to let different instances use different executables. In the new code the name is decided when the object is created rather than when the subprocess is started. Consider this sequence of events:

import smi2name
smi2name.MOL2NAM = "/does/not/exist"
print smi2name.smi2name("C")
The object smi2name.smi2name is an instance of Smi2Name created when the module is imported. The old version used the module's MOL2NAM to start the subprocess, which happens only once there was a compound to process. The new version gets the name when the object is created, and never again checks the module's MOL2NAM. So the old code would raise an OSError saying "No such file or directory" with the above sequence while the new code works, because it uses the original setting.

As a library developer you need to decide how to handle this. Should there be the callable object "smi2name.smi2name" at all? I think so because most people just want the name and don't care about the mechanics of creating the transformation object. A future implementation might use the OECreateIUPACName function from OEChem instead of going through an external program for it. By using the function (okay, "callable") interface, that change can be made without affect code that uses the library.

Perhaps the module should specify that "if smi2name.MOL2NAM is changed then smi2name.smi2name must also be recreated" and leave it up to the programmer to follow those instructions. There are a few problems with that. People are forgetful and will likely change one but not the other, it's somewhat tricky to test, and it's hard to make sure that everything gets updated to use the new object.

That last requires a bit of explanation. Consider the following code.

import smi2name

def test(namer = smi2name.smi2name):
  for smi, name in ( ("C", "methane"), ("O=O", "molecular oxygen") ):
    new_name = namer(smi)
    if name != new_name:
      raise AssertionError( (smi, name, new_name) )

smi2name.MOL2NAM = "/usr/local/bin/mol2nam"
smi2name.smi2name = smi2name.Smi2Name("/usr/local/bin/mol2nam")
test()
The test function tests that a given function generates the correct names for a given SMILES string. By default it uses smi2name.smi2name. It gets a reference to that object when the function is created. Later on another object is given the same name but the test function keeps a reference to the original object. This code will not work as expected because the test will use the original executable and not the one in /usr/local/bin.

The pretty strong rule here is that code should not reach into a module and change things. We'll need to add a layer to isolate the public interface from the parts that can change. That means the trick of using a callable object or bound method won't work. Instead, I'll use a function. The function needs to store the actual Smi2Name object in use so I'll keep it in a module variable. At the start it is None and only when the function is called will it instantiate the object. This lets users change the MOL2NAM setting before using the module.

# Defer instantiation of the wrapper until it's needed.
# This lets other code change MOL2NAM if needed.
_smi2name = None
def smi2name(smiles):
    global _smi2name
    if _smi2name is None:
        _smi2name = Smi2Name().smi2name
    return _smi2name(smiles)

I said that "code should not reach into a module and change things." That should include changing the value of MOL2NAM yet here I am making provision for it. If I follow that rule I should have another function used to change the default setting, something like

def set_default_mol2nam(s):
    global _smi2name, MOL2NAM
    MOL2NAM = s
    _smi2name = None
It also resets the Smi2Name instance used by the smi2name function so the next time it's called it will use the updated settings. I could write this function but I won't because I feel that this also is too ornate. It shouldn't ever be called. If a program needs to use a different executable the programmer should use the Smi2Name class directly.

The tension here is the different concepts of "ease of use." I think most people expect a SMILES to systematic name converter to be a function and don't much care how the function is implemented. I want to make this the simplest case. But the implementation may require some configuration information so there needs to be a way to tweak it. A much smaller number of people (perhaps only two in this case) will want to support several different configurations. For them, creating an instance of a class with the right properties is not hard.

Here's the final version of the code. The testing code is unchanged so if you want it then copy it from the earlier listing

import re, select
import subprocess

MOL2NAM = "ogham/mol2nam"

class NamingError(Exception):
    pass

# Used to find the character position that cause the problem
_error_pos_pat = re.compile(r"^Warning: ( *)\^", re.MULTILINE)

def _find_error(text):
    errmsg = "Cannot parse SMILES"
    if "\nWarning: Unclosed branch." in text:
        errmsg = "Unclosed branch"
    elif "\nWarning: Unclosed ring." in text:
        errmsg = "Unclosed ring"

    m = _error_pos_pat.search(text)
    if m:
        errpos = len(m.group(1)) + 1
        errmsg = errmsg + " at position %d" % errpos
        
    return errmsg

class Smi2Name:
    def __init__(self, executable = None):
        # a subprocess.Popen connected to mol2nam
        self._mol2nam = None
        if executable is None:
            executable = MOL2NAM
        self.executable = executable

    def _get_mol2nam(self):
        if self._mol2nam is None:
            mol2nam = subprocess.Popen( (self.executable, "-"),
                                        stdin = subprocess.PIPE,
                                        stdout = subprocess.PIPE,
                                        stderr = subprocess.PIPE,
                                        close_fds = True)
            # skip the three header lines
            mol2nam.stderr.readline()
            mol2nam.stderr.readline()
            mol2nam.stderr.readline()
            self._mol2nam = mol2nam
            
        return self._mol2nam
    
    def smi2name(self, smiles):
        """convert a SMILES string into an IUPAC name"""
        if smiles == "":
            return "vacuum"
        elif "\n" in smiles:
            raise NamingError("Newline not allowed in SMILES")
        mol2nam = self._get_mol2nam()
        mol2nam.stdin.write(smiles + "\n")
        mol2nam.stdin.flush()
        rlist, _, _ = select.select([mol2nam.stdout, mol2nam.stderr], [], [])
        if mol2nam.stderr in rlist:
            # Tells mol2nam to quit
            mol2nam.stdin.close()
            stderr_text = mol2nam.stderr.read()

            # Doing this will restart the subprocess the next time through
            self._mol2nam = None
            
            raise NamingError(_find_error(stderr_text))
            
        if mol2nam.stdout in rlist:
            name = mol2nam.stdout.readline().rstrip()
            if "BLAH" in name:
                raise NamingError("Unsupported structure")
            return name
        
        raise AssertionError("Should not get here")


# Defer instantiation of the wrapper until it's needed.
# This lets other code change MOL2NAM if needed, but changes
# will only work if done before calling this function.
_smi2name = None
def smi2name(smiles):
    """convert a SMILES string into an IUPAC name"""
    global _smi2name
    if _smi2name is None:
        _smi2name = Smi2Name().smi2name
    return _smi2name(smiles)


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB