Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2005/04/21/using_xmlrpc

Using XML-RPC

Suppose you want to make the systematic naming function available to other programs over the network. There is a huge number of ways to do it. You can program directly to the socket layer or use one of the many communications packages. A short list of the language independent libraries includes CORBA, SOAP, PVM, MPI and XML-RPC and the Python specific ones include Pyro, Twisted's Prospective Broker, or roll your own with Python's pickle or marshal protocols.

For most things I suggest using XML-RPC. It's a straight-forward spec and it's been around for a while so the various bugs have been worked out making it stable and relatively language neutral. As a big plus, Python ships with client and server XML-RPC libraries making it very simple to use.

For reference, here's the smi2name.py module I'll use for this essay. It's my working version of the subprocess-based version I developed in an earlier essay. The major difference is I decided to add the check for known-to-be-illegal characters as part of the code. You may recall that I go back and forth on where to put that test. It depends on where and how the code is going to be used. I've decided it's going to be close enough to untrusted input that the extra test is appropriate. I've also added code to detect the a few new error messages that might arise from bad SMILES strings

import re, select
import subprocess
import os, signal

MOL2NAM = "/Users/dalke/tmp/ogham/mol2nam"

class NamingError(Exception):
    pass

# Used to find the character position that cause the problem
_error_pos_pat = re.compile(r"^Warning: ( *)\^", re.MULTILINE)

# Check for characters other than printable ASCII
_unexpected_char_pat = re.compile(r"[^\040-\0176]")

def _find_error(text):
    errmsg = "Cannot parse SMILES"
    if "\nWarning: Unclosed branch." in text:
        errmsg = "Unclosed branch"
    elif "\nWarning: Unclosed ring." in text:
        errmsg = "Unclosed ring"
    elif text.startswith("Warning: Unable to Kekulize SMILES"):
        # Strange: it's the first line of the error message ...
        errmsg = "Unable to Kekulize SMILES"
    elif "\nWarning: Incorrect reaction role" in text:
        errmsg = "Incorrect reaction role"

    m = _error_pos_pat.search(text)
    if m:
        errpos = len(m.group(1)) + 1
        errmsg = errmsg + " at position %d" % errpos
        
    return errmsg

class Smi2Name:
    def __init__(self, executable = None, timeout = None):
        # a subprocess.Popen connected to mol2nam
        self._mol2nam = None
        if executable is None:
            executable = MOL2NAM
        self.executable = executable
        self.timeout = timeout

    def _get_mol2nam(self):
        if self._mol2nam is None:
            mol2nam = subprocess.Popen( (self.executable, "-"),
                                        stdin = subprocess.PIPE,
                                        stdout = subprocess.PIPE,
                                        stderr = subprocess.PIPE,
                                        close_fds = True)
            # skip the three header lines
            mol2nam.stderr.readline()
            mol2nam.stderr.readline()
            mol2nam.stderr.readline()
            self._mol2nam = mol2nam
            
        return self._mol2nam
    
    def smi2name(self, smiles):
        """convert a SMILES string into an IUPAC name"""
        if smiles == "":
            return "vacuum"
        m = _unexpected_char_pat.search(smiles)
        if m:
            raise NamingError("Unexpected character at position %d" %
                              (m.start(0)+1,))
        mol2nam = self._get_mol2nam()
        try:
            mol2nam.stdin.write(smiles + "\n")
        except IOError:
            # coprocess died since the last call?  Restart the connection
            self._mol2nam = None
            mol2nam = self._get_mol2nam()
            mol2nam.stdin.write(smiles + "\n")
        mol2nam.stdin.flush()
        rlist, _, _ = select.select([mol2nam.stdout, mol2nam.stderr],
                                    [], [], self.timeout)
        if mol2nam.stderr in rlist:
            # Tells mol2nam to quit
            mol2nam.stdin.close()
            stderr_text = mol2nam.stderr.read()

            # Doing this will restart the subprocess the next time through
            self._mol2nam = None
            
            raise NamingError(_find_error(stderr_text))
            
        if mol2nam.stdout in rlist:
            name = mol2nam.stdout.readline().rstrip()
            if "BLAH" in name:
                raise NamingError("Unsupported structure")
            return name

        # Timeout reached.  Kill the child and restart.
        try:
            os.kill(mol2nam.pid, signal.SIGTERM)
        except OSError:
            # Already died?
            pass
        self._mol2nam = None
        raise NamingError("timeout reached")

# Defer instantiation of the wrapper until it's needed.
# This lets other code change MOL2NAM if needed, but changes
# will only work if done before calling this function.
_smi2name = None
def smi2name(smiles):
    """convert a SMILES string into an IUPAC name"""
    global _smi2name
    if _smi2name is None:
        _smi2name = Smi2Name().smi2name
    return _smi2name(smiles)


def test():
    for smi, name, errmsg in (
        ("C", "methane", None),
        ("C"+chr(127)+"S", None, "Unexpected character at position 2"),
        ("CC"+chr(3), None, "Unexpected character at position 3"),
        ("S", "hydrogen sulfide", None),
        ("U", None, "Cannot parse SMILES at position 1"),
        ("CC1", None, "Unclosed ring at position 3"),
        ("C", "methane", None),
        ("C"*1000 , "kiliane", None),
        ("C"*32764 + "(C)", None, "Unclosed branch"),
        ("C\nC", None, "Unexpected character at position 2"),
        ("CCCC(C", None, "Unclosed branch at position 6"),
        ("CCCCCC)C", None, "Cannot parse SMILES at position 7"),
        ("[U]", "uranium", None),
        ("", "vacuum", None),
        ("c1ccccc1", "benzene", None),
        ("c1cccccc1", None, "Unable to Kekulize SMILES"),
        ("O>C>N", "oxidane; carbane; azane", None),
        ("OC>C.C", None, "Incorrect reaction role at position 6"),
        ("O>CC>N>U", None, "Cannot parse SMILES at position 7"),
        ("C1CC23CC4CC3C1C(C2)CC4", None, "Unsupported structure"),
        ("C#N", "hydrogen cyanide", None)):

        computed_name = computed_errmsg = None
        try:
            computed_name = smi2name(smi)
        except NamingError, err:
            computed_errmsg = str(err)

        if (name != computed_name or
            errmsg != computed_errmsg):
            raise AssertionError("SMILES: %r expected (%r %r) got (%r %r)"
                                 % (smi, name, errmsg,
                                    computed_name, computed_errmsg))
    print "All tests passed."

if __name__ == "__main__":
    test()

And here's the first version of an XML-RPC server for it, using the standard SimpleXMLRPCServer module. Note that it's listening on port 8000 of the local machine.

import SimpleXMLRPCServer
import smi2name

server = SimpleXMLRPCServer.SimpleXMLRPCServer(("localhost", 8000))
server.register_function(smi2name.smi2name, "smi2name")
server.serve_forever()
I called it smi2name_server.py because I'm creative that way. Run it from the command-line like this:
% python smi2name_server.py
It'll just sit there waiting for requests.

In another shell window start Python and import the XML-RPC client library. I'll make a Server instance, which makes a wrapper to the XML-RPC server on the given URL.

>>> import xmlrpclib
>>> server = xmlrpclib.Server("http://localhost:8000/")
>>> server.smi2name("C")
'methane'
>>> 
If it worked for you then the server window will print a statement like this
localhost - - [21/Apr/2005 10:25:03] "POST / HTTP/1.0" 200 -
Because I don't find this message all that useful, later on I'll show how to disable it.

If you didn't set MOL2NAM to the right location then you probably got an exception on the client-side like this

Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/xmlrpclib.py", line 1029, in __call__
    return self.__send(self.__name, args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/xmlrpclib.py", line 1316, in __request
    verbose=self.__verbose
  File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/xmlrpclib.py", line 1080, in request
    return self._parse_response(h.getfile(), sock)
  File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/xmlrpclib.py", line 1219, in _parse_response
    return u.close()
  File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/xmlrpclib.py", line 742, in close
    raise Fault(**self._stack[0])
xmlrpclib.Fault: <Fault 1: 'exceptions.OSError:[Errno 2] No such file or directory'>
By default, exceptions on the XML-RPC server get sent back to the client and converted into a local exception.

To kill the server hit control-C in its window. You may need to hit it twice; I don't know why. You do not need to exit the client because xmlrpclib uses a new HTTP connection for every request. It can't tell the difference if the server shuts down then restarts, though code that uses transfered data may be able to tell the difference.

The problem I showed is actually two problems. The first is the misconfiguration of MOL2NAM but the second is that the error isn't discovered until someone uses the service. It's best to fail early, but not too early. As a library it's best to fail at the first use, because the library might not be used. But in this server where everything is meant to be used it's best to fail when the server starts, to indicate that it's not functional.

I considered just checking if the executable file existed but decided it was best to just call the function and see if it returns a correct value. Here's the new version of the server code.

import SimpleXMLRPCServer
import smi2name

# Test that the library works
name = smi2name.smi2name("C")
if name != "methane":
    raise AssertionError("'C' returns %r" % (name,))

server = SimpleXMLRPCServer.SimpleXMLRPCServer(("localhost", 8000))
server.register_function(smi2name.smi2name, "smi2name")
server.serve_forever()
and when it's run with a misconfigured MOL2NAM setting
% python smi2name_server.py
Traceback (most recent call last):
  File "smi2name_server.py", line 5, in ?
    name = smi2name.smi2name("C")
  File "/Users/dalke/novartis/smi2name.py", line 107, in smi2name
    return _smi2name(smiles)
  File "/Users/dalke/novartis/smi2name.py", line 62, in smi2name
    mol2nam = self._get_mol2nam()
  File "/Users/dalke/novartis/smi2name.py", line 45, in _get_mol2nam
    close_fds = True)
  File "/Users/dalke/novartis/subprocess.py", line 600, in __init__
    errread, errwrite)
  File "/Users/dalke/novartis/subprocess.py", line 1053, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory
I could also have used smi2name.test() but decided that that would be overkill. Also, test code like that is usually not meant to be part of the public API to a module. Perhaps I should have named it _test().

After fixing the MOL2SMI setting and starting the server I went back to the Python interactive window with the xmlrpclib client already running:

>>> server.smi2name("CC")
'ethane'
>>> server.smi2name("c1ccccc1O")
'phenol'
>>> 
Congratulations, you have a working server.

Sometimes if you quit the server and restart it you'll get a message like the following:

% python smi2name_server.py
Traceback (most recent call last):
  File "smi2name_server.py", line 9, in ?
    server = SimpleXMLRPCServer.SimpleXMLRPCServer(("localhost", 8000))
  File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/SimpleXMLRPCServer.py", line 450, in __init__
    SocketServer.TCPServer.__init__(self, addr, requestHandler)
  File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/SocketServer.py", line 330, in __init__
    self.server_bind()
  File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/SocketServer.py", line 341, in server_bind
    self.socket.bind(self.server_address)
  File "<string>", line 1, in bind
socket.error: (48, 'Address already in use')
% 
This happens because of certain guarantees made by the TCP specification. Even after the connection is closed the operating system keeps it open for a bit longer in case, for instance, the client asks the server to resend the close message. The operating system will release the socket after a short time; from about 30 second to 4 minutes, depending on various settings.

I've tried to figure out just why things didn't close nicely but haven't managed to track it down. I think it's a timing problem when the server closes the connection before the client.

If you need the ability to restart you should do a few things. First, always shut down the server. This won't fix the problem but it's a good practice. I'll make the call in a try/finally block to ensure that it's always called.

try:
    server.serve_forever()
finally:
    server.server_close()

Second, there's a configuration option called SO_REUSEADDR which tells the operating system to allow code to connect to a socket even if it's waiting for other potential packets. The SimpleXMLRPCServer class has a class variable named allow_reuse_address which when True tells the instance to set that option. Because it's used during the constructor and there's no constructor argument the options are to implement a new class whose constructor sets that value first then calls the base class constructor, or a new class which sets that class variable. I chose the second of these. Note also that I disable the logging because I didn't find the information useful.

import SimpleXMLRPCServer
import smi2name

class Server(SimpleXMLRPCServer.SimpleXMLRPCServer):
    allow_reuse_address = True

# Test that the library works
name = smi2name.smi2name("C")
if name != "methane":
    raise AssertionError("'C' returns %r" % (name,))

server = Server(("localhost", 8000), logRequests = False)
server.register_function(smi2name.smi2name, "smi2name")
try:
    server.serve_forever()
finally:
    server.server_close()

Using SO_REUSEADDR does have its downsides. As that page I mentioned earlier points out, it can cause other sorts of errors when trying to reconnect from the same machine and can cause security problems on some operating systems.

The above code is enough for personal use. Configuration changes require editing code. If it's used by more people and on different machines then it should be a bit more configurable on the command-line. To parse the command-line options use the optparse module from Python's standard library. Here's a version that lets users pick which host interface, port number, and mol2nam executable to use. To implement that last one I create a new Mol2Smi instance, which is prefered over changing smi2nam.MOL2NAM.

import SimpleXMLRPCServer
import optparse

import smi2name

class Server(SimpleXMLRPCServer.SimpleXMLRPCServer):
    allow_reuse_address = True


def run_server(addr, executable):
    smi2name_func = smi2name.Smi2Name(executable).smi2name
    # Test that the library works
    name = smi2name_func("C")
    if name != "methane":
        raise AssertionError("'C' returns %r" % (name,))
    
    server = Server(addr, logRequests = False)
    server.register_function(smi2name_func, "smi2name")

    print "Starting smi2nam XML-RPC server at",
    print repr("http://%s:%d/" % (addr[0], addr[1]))
    try:
        server.serve_forever()
    finally:
        server.server_close()

def main():
    parser = optparse.OptionParser(conflict_handler="resolve")
    parser.add_option("-h", "--host", dest="host", default="localhost",
                      help="host name of network interface")
    parser.add_option("-p", "--port", dest="port", default=8000, type="int", 
                      help="port number to use")
    parser.add_option("-e", "--executable", dest="executable",
                      default=smi2name.MOL2NAM,
                      help="path to mol2nam executable")
    
    (options, args) = parser.parse_args()
    if args:
        parser.error("unknown option %r" % (args[0],))

    run_server( (options.host, options.port), options.executable )

if __name__ == "__main__":
    main()
and here is the help text from using --help.
% python smi2name_server.py --help
usage: smi2name_server.py [options]

options:
  --help                show this help message and exit
  -hHOST, --host=HOST   host name of network interface
  -pPORT, --port=PORT   port number to use
  -eEXECUTABLE, --executable=EXECUTABLE
                        path to mol2nam executable
% 
The conflict_handler="resolve" is needed because by default "-h" is another command-line option for help.

This is the starting off point for many features. It's easy to see how to add new services. If the server gets heavily used though then there will be problem. It is implemented with a single thread, which means it can only process one request at a time. The operating system will queue up a small number of requests (about three) but at some that will get filled up as well.

There are several ways to handle that. You can use multithreading, you can spawn off a new process to handle each request, or you can use a reactor-style framework like Twisted. If it's a multiprocessor box you might want to start several instances of mol2nam all used by one server. Or you can shift the problem upstream and have something like pythondirector. Clients point to the pythondirector instance which forwards the request to the next available server. If that server fails or is busy it tries the next server until one is available or there aren't any servers left to try.

The choice of what approach is complicated and depends on many factors. But don't worry about deciding upon the solution until you're sure you'll have a problem.

By the way, even this code will hang in a few strange ways. Suppose the executable points to the origianal version of mol2nam (without the flush) or to something like /bin/cat which accepts the given input but buffers its output. The wrapper will sit, blocked, waiting in _get_mol2nam() to read the header line from stderr. That can be fixed with careful use of select, but I don't think it's important enough to worry about.


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB