Dalke Scientific Software: More science. Less time. Products

Diary RSS | All of Andrew's writings | Diary archive

Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure. Code to taste. Best served at room temperature.

I/O location and error handling #

I've been interested in a thorny question of how to handle errors and location information in file I/O, especially as it applies to chemical structure file formats. I'll organize my ideas around a parser for the /etc/passwd file on my Mac.

Some of the lines from my passwd file are:

# See the opendirectoryd(8) man page for additional information about
# Open Directory.
##
nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false
root:*:0:0:System Administrator:/var/root:/bin/sh
Comment lines start with a "#". Each record is on a single line, with colon separated fields.

A simple parser

This is very easy to parse. Here's an example, along with a driver:

from __future__ import print_function

class PasswdEntry(object):
    def __init__(self, name, passwd, uid, gid, gecos, dir, shell):
        self.name = name
        self.passwd = passwd
        self.uid = uid
        self.gid = gid
        self.gecos = gecos
        self.dir = dir
        self.shell = shell
        
def read_passwd_entries(infile):
    for line in infile:
        line = line.rstrip("\n")  # remove trailing newline
        # Ignore comments and blank lines
        if line[:1] == "#" or line.strip() == "": 
            continue
     
        name, passwd, uid, gid, gecos, dir, shell = line.split(":")
        yield PasswdEntry(name, passwd, uid, gid, gecos, dir, shell)

def main():
    import sys
    filename = "/etc/passwd"
    if sys.argv[1:]:
      filename = sys.argv[1]
    with open(filename) as passwd_file:
        for entry in read_passwd_entries(passwd_file):
            print(entry.name, "=>", entry.shell)

if __name__ == "__main__":
    main()

The output when I run this starts with:

nobody => /usr/bin/false
root => /bin/sh
daemon => /usr/bin/false
Great! I have working code which handles the wide majority of things I want to do with a passwd file.

Errors should match the API level

But it doesn't handle everything. For example, what happens if a file contains a poorly formatted line? When I test with the following line:

dalke:*:-2:-2:Andrew Dalke:/var/empty
I get the output:
% python entries.py bad_passwd 
nobody => /usr/bin/false
Traceback (most recent call last):
  File "entries.py", line 33, in <module>
    main()
  File "entries.py", line 29, in main
    for entry in read_passwd_entries(passwd_file):
  File "entries.py", line 20, in read_passwd_entries
    name, passwd, uid, gid, gecos, dir, shell = line.split(":")
ValueError: need more than 6 values to unpack
This isn't very helpful. Which record failed, and which line do I need to fix?

It isn't hard to modify read_passwd_entries() to make the error message match the API level:

def read_passwd_entries(infile):
    for lineno, line in enumerate(infile, 1):
        line = line.rstrip("\n")  # remove trailing newline
        # Ignore comments and blank lines
        if line[:1] == "#" or line.strip() == "": 
            continue

        try:
            name, passwd, uid, gid, gecos, dir, shell = line.split(":")
        except ValueError:
            name = getattr(infile, "name", None)
            if name is not None:
                where = " of %r" % (name,)
            else:
                where = ""
            raise ValueError("Cannot parse password entry on line %d%s: %r"
                             % (lineno, where, line))
        yield PasswdEntry(name, passwd, uid, gid, gecos, dir, shell)
The error behavior is now:
% python entries.py bad_passwd
nobody => /usr/bin/false
Traceback (most recent call last):
  File "entries.py", line 42, in <module>
    main()
  File "entries.py", line 38, in main
    for entry in read_passwd_entries(passwd_file):
  File "entries.py", line 29, in read_passwd_entries
    % (lineno, where, line))
ValueError: Cannot parse password entry on line 12 of 'bad_passwd': 'dalke:*:-2:-2:Andrew Dalke:/var/empty'

Specify error handling policy

That's beter, but sometimes I want the parser to ignore an error record and continue processing, instead of stopping. This is frequently needed in chemistry file formats. While the formats are pretty well defined, different chemistry toolkits have different chemistry models and different levels of strictness. RDKit, for example, does not support hexavalent carbon, while OEChem does accept non-realistic valences.

The failure rates in structure files are low, with failures in perhaps 1 in 10,000 structures, depending on the data source and toolkit combination. The usual policy is to report a warning message and continue.

The solution I like is to have a parameter which describes the error handling policy. By default I'll say it's "strict", which stops processing. The "warn" policy prints a message to stderr, and "ignore" just skips the record. (It's not hard to think of other policies, or to support user-defined error handler objects in addition to strings.)

I modified the parser code to handle these three policies, and updated the driver so you could specify the policies on the command-line. Here's the complete program:

from __future__ import print_function

import sys
import argparse

class PasswdEntry(object):
    def __init__(self, name, passwd, uid, gid, gecos, dir, shell):
        self.name = name
        self.passwd = passwd
        self.uid = uid
        self.gid = gid
        self.gecos = gecos
        self.dir = dir
        self.shell = shell

# Define different error handlers, and a common error formatter.

def _format_msg(lineno, source, line):
    if source is None:
        where = "on line %d" % (lineno, source)
    else:
        where = "on line %d of %r" % (lineno, source)
    return "Cannot parse record %s: %r" % (where, line)

def ignore_handler(lineno, source, line):
    pass

def warn_handler(lineno, source, line):
    msg = _format_msg(lineno, source, line)
    sys.stderr.write(msg + "\n")

def strict_handler(lineno, source, line):
    msg = _format_msg(lineno, source, line)
    raise ValueError(msg)

error_handlers = {
    "ignore": ignore_handler,
    "warn": warn_handler,
    "strict": strict_handler,
}
        
def read_passwd_entries(infile, errors="warn"):
    # Get the error handler for the given policy.
    # (A more sophisticated solution might support a user-defined
    # error handler as well as a string.)
    try:
        error_handler = error_handlers[errors]
    except KeyError:
        raise ValueError("Unsupported errors value %r" % (errors,))
    
    for lineno, line in enumerate(infile, 1):
        line = line.rstrip("\n")  # remove trailing newline
        # Ignore comments and blank lines
        if line[:1] == "#" or line.strip() == "": 
            continue

        try:
            name, passwd, uid, gid, gecos, dir, shell = line.split(":")
        except ValueError:
            # Handle the failure
            source_name = getattr(infile, "name", None)
            error_handler(lineno, source_name, line)
            # If we get here then continue to the next record
            continue
        
        yield PasswdEntry(name, passwd, uid, gid, gecos, dir, shell)

# Driver code to help test the library

parser = argparse.ArgumentParser(
    description = "List username/shell entries in a password file")
parser.add_argument("--errors", choices = ["ignore", "warn", "strict"],
                    default = "strict",
                    help = "Specify the error handling policy")
parser.add_argument("filename", nargs="?", default=["/etc/passwd"],
                    help = "Password file to parse")
                  
def main():
    args = parser.parse_args()
    filename = args.filename[0]
    try:
        with open(filename) as passwd_file:
            try:
                for entry in read_passwd_entries(passwd_file, errors=args.errors):
                    print(entry.name, "=>", entry.shell)
            except ValueError as err:
                raise SystemExit("ERROR with password entry: %s" % (err,))
    except IOError as err:
        raise SystemExit("ERROR with password file: %s" % (err,))
        

if __name__ == "__main__":
    main()

Let's see it in action. First, the default, which raises a ValueError. The main driver catches the ValueError, reports the error message, and exits:

% python entries.py bad_passwd 
nobody => /usr/bin/false
ERROR with password entry: Cannot parse record on line 12 of 'bad_passwd': 'dalke:*:-2:-2:Andrew Dalke:/var/empty'
Next, I'll specify the "warn" handler. (The warning message comes first because stderr does not buffer while stdout does.)
% python entries.py bad_passwd --errors warn | head -4
Cannot parse record on line 12 of 'bad_passwd': 'dalke:*:-2:-2:Andrew Dalke:/var/empty'
nobody => /usr/bin/false
root => /bin/sh
daemon => /usr/bin/false
_uucp => /usr/sbin/uucico
Finally, ignore all errors and keep on processing:
% python entries.py bad_passwd --errors ignore | head -4
nobody => /usr/bin/false
root => /bin/sh
daemon => /usr/bin/false
_uucp => /usr/sbin/uucico

API access to location information

Unfortunately, there are still a few things the parser API doesn't support. For example, which line contains the "root" record? The parser tracks the line number, but there's no way for the caller to get access to that information.

My solution was to develop a "Location" object. You might have noticed that source filename, current line number, and line content were all passed around together. This is often a hint that those parameters should be part of the same data type, rather than individually. I tried it out, and found the result much easier to understand.

Here's the new Location class:

class Location(object):
    def __init__(self, source=None, lineno=None, line=None):
        self.source = source
        self.lineno = lineno
        self.line = line

    def where(self):
        source = self.source
        if source is None:
            return "line %s" % self.lineno
        else:
            return "line %d of %r" % (self.lineno, source)
It's also nice that I could abstract the "where()" code into a function, so it's the only place which needs to worry about an unknown source name.

The error handlers are now easier, because they only take a single location argument, and because the "where()" method hides some of the complexity:

# Define different error handlers, and a common error formatter.

def _format_msg(location):
    return "Cannot parse record %s: %s" % (location.where(), location.line)

def ignore_handler(location):
    pass

def warn_handler(location):
    msg = _format_msg(location)
    sys.stderr.write(msg + "\n")

def strict_handler(location):
    msg = _format_msg(location)
    raise ValueError(msg)

The parser code has a few changes. I want people to be able to pass in their own location tracker, or if not specified, to create my own local one. More importantly, I need to see the location's "lineno" and "line" properties before passing control either to the error handler or back to the caller. I also decided that the final value of "location.lineno" should be the number of lines in the file, and if the file is empty then it should be 0.

Here's the updated parser code:

def read_passwd_entries(infile, errors="warn", location=None):        
    # Get the error handler for the given policy.
    # (A more sophisticated solution might support a user-defined
    # error handler as well as a string.)
    try:
        error_handler = error_handlers[errors]
    except KeyError:
        raise ValueError("Unsupported errors value %r" % (errors,))

    if location is None:
        location = Location(getattr(infile, "name", None))
    
    lineno = 0  # define lineno even if the file is empty

    for lineno, line in enumerate(infile, 1):
        line = line.rstrip("\n")  # remove trailing newline
        # Ignore comments and blank lines
        if line[:1] == "#" or line.strip() == "": 
            continue

        # Track where we are
        location.lineno = lineno
        location.line = line

        try:
            name, passwd, uid, gid, gecos, dir, shell = line.split(":")
        except ValueError:
            # Handle the failure
            error_handler(location)
            # If we get here then continue to the next record
            continue
        
        yield PasswdEntry(name, passwd, uid, gid, gecos, dir, shell)

    # Save the final line number
    location.lineno = location
It's getting more complex, but I think you can still follow the control flow.

I made some changes to the driver code to test the new location API. If you specify "--with-lineno" then each output line will start with the line number for the given record.

Here's the updated driver code:

parser = argparse.ArgumentParser(
    description = "List username/shell entries in a password file")
parser.add_argument("--errors", choices = ["ignore", "warn", "strict"],
                    default = "strict",
                    help = "Specify the error handling policy")
parser.add_argument("--with-lineno", action="store_true",
                    help="include line numbers in the output")
parser.add_argument("filename", nargs="?", default=["/etc/passwd"],
                    help = "Password file to parse")
                  
def main():
    args = parser.parse_args()
    filename = args.filename[0]
    location = Location(filename)
    if args.with_lineno:
        output_fmt = "{location.lineno}: {entry.name} => {entry.shell}"
    else:
        output_fmt = "{entry.name} => {entry.shell}"
    try:
        with open(filename) as passwd_file:
            try:
                for entry in read_passwd_entries(
                        passwd_file, errors=args.errors, location=location):
                    print(output_fmt.format(entry=entry, location=location))
            except ValueError as err:
                raise SystemExit("ERROR with password entry: %s" % (err,))
    except IOError as err:
        raise SystemExit("ERROR with password file: %s" % (err,))

Now to see if the new code works. First, run it with the defaults to get the same output as before:

% python entries.py | head -4
nobody => /usr/bin/false
root => /bin/sh
daemon => /usr/bin/false
_uucp => /usr/sbin/uucico
and then with option to show the line numbers:
% python entries.py --with-lineno | head -4
11: nobody => /usr/bin/false
12: root => /bin/sh
13: daemon => /usr/bin/false
14: _uucp => /usr/sbin/uucico
Yay!

A PasswdReader iterator, with location

I think this API is still complicated. The caller must create the Location instance and pass it to read_passwd_entries() in order to the location information. The thing is, even if the location is None, the function ends up creating a Location in order to have good error reporting. What about making that otherwise internal location public, so I can use it as the default location, instead of always specifying it when I need it?

I did this with a bit of wrapping. The read_passwd_entries() function API returns an iterator of PasswdEntry instances. The iterator is currently implemented using a generator, but it doesn't have to be an generatore. I'll instead have it return a PasswdReader instance, where PasswdReader.location gives the location, and iterating over the PasswdReader gives the PasswdEntry elements from the underlying generator.

The PasswdReader iterator class, which takes the location and the underlying generator, is as follows:

class PasswdReader(object):
    def __init__(self, location, reader):
        self.location = location
        self._reader = reader

    def __iter__(self):
        return self._reader

    # For Python 2
    def next(self):
        return self._reader.next()

    # For Python 3
    def __next__(self):
        return next(self._reader)
I also modified read_passwd_entries() so it returns this new iterator instead of the original generator. I've found that the easiest way is to break the original function into two parts. The first does parameter validation and normalization, and the second is the actual generator:
def read_passwd_entries (infile, errors="warn", location=None):
    # Get the error handler for the given policy.
    # (A more sophisticated solution might support a user-defined
    # error handler as well as a string.)
    try:
        error_handler = error_handlers[errors]
    except KeyError:
        raise ValueError("Unsupported errors value %r" % (errors,))
    
    if location is None:
        location = Location(getattr(infile, "name", None))

    return PasswdReader(location,
                        _read_passwd_entries(infile, error_handler, location))
    
def _read_passwd_entries(infile, error_handler, location):
    lineno = 0  # define lineno even if the file is empty
    for lineno, line in enumerate(infile, 1):
       ...
The main reason to split it up this way is to have eager evaluation for the parameter checking, and lazy evaluation for the actual reader.

PasswdReader iterator as context manager

The end result is a simpler driver code, though it's still not as simple as I would like. The old code was something like:

    location = Location(filename)
    ...
        with open(filename) as passwd_file:
            try:
                for entry in read_passwd_entries(
                        passwd_file, errors=args.errors, location=location):
                    print(output_fmt.format(entry=entry, location=location))
            except ValueError as err:
                raise SystemExit("ERROR with password entry: %s" % (err,))
while the new driver only needs to get the "location" field. The critical core in the new driver is:
        with open(filename) as passwd_file:
            reader = read_passwd_entries(passwd_file, errors=args.errors)
            location = reader.location
            try:
                for entry in reader:
                    print(output_fmt.format(entry=entry, location=location))
            except ValueError as err:
                raise SystemExit("ERROR with password entry: %s" % (err,))

To make it simpler still, I'll include the open() functionality as part of the read_passwd_entries() API. That is, if the first parameter is a string then I'll assume it's a file name and open the file myself, otherwise I'll assume it's a Python file object. (In CS terms, I'll make the function polymorphic on the first parameter.)

That is, I want the driver code to look like this:

        with read_passwd_entries(filename, errors=args.errors) as reader:
            location = reader.location
            try:
                for entry in reader:
                    print(output_fmt.format(entry=entry, location=location))
            except ValueError as err:
                raise SystemExit("ERROR with password entry: %s" % (err,))
That means that the object returned from read_passwd_entries() must also support the context manager API. The simplest (and wrong) version adds dummy __enter__ and __exit__ methods to the PasswdReader, like this:
# WARNING: incomplete code
class PasswdReader(object):
    def __init__(self, location, reader):
        self.location = location
        self._reader = reader

    ...
    def __enter__(self):
        return self

    def __exit__(self, *args):
        pass
This is wrong because if the reader is given a filename and calls open() then the context manager must call the corresponding close() in the __exit__, while if the reader is given a file object then the __exit__ should not do anything to the file.

The more correct class takes a "close" parameter which, if not None, is called during __exit__:

class PasswdReader(object):
    def __init__(self, location, reader, close):
        self.location = location
        self._reader = reader
        self._close = close

    ...

    def __enter__(self):
        return self

    def __exit__(self, *args):
        if self._close is not None:
            self._close()
            self._close = None

I also need to change read_passwd_entries() to do the right thing based on the type of the first parameter. I've renamed the parameter to "source", and added a bit of special checking to handle the correct type check for both Python 2 and Python 3. Here are the core changes:

try:
    _basestring = basestring
except NameError:
    _basestring = str

def read_passwd_entries(source, errors="warn", location=None):
    if isinstance(source, _basestring):
        infile = open(source)
        source_name = source
        close = infile.close
    else:
        infile = source
        source_name = getattr(infile, "name", None)
        close = None
    
    # Get the error handler for the given policy.
    # (A more sophisticated solution might support a user-defined
    # error handler as well as a string.)
    try:
        error_handler = error_handlers[errors]
    except KeyError:
        raise ValueError("Unsupported errors value %r" % (errors,))
    
    if location is None:
        location = Location(source_name)

    return PasswdReader(location,
                        _read_passwd_entries(infile, error_handler, location),
                        close)

The final version

Enough has changed that I'll show the full code instead of you trying to piece the parts together yourself.

from __future__ import print_function

import sys
import argparse

class PasswdReader(object):
    def __init__(self, location, reader, close):
        self.location = location
        self._reader = reader
        self._close = close

    def __iter__(self):
        return self._reader

    # For Python 2
    def next(self):
        return self._reader.next()

    # For Python 3
    def __next__(self):
        return next(self._reader)

    def __enter__(self):
        return self

    def __exit__(self, *args):
        if self._close is not None:
            self._close()
            self._close = None

class PasswdEntry(object):
    def __init__(self, name, passwd, uid, gid, gecos, dir, shell):
        self.name = name
        self.passwd = passwd
        self.uid = uid
        self.gid = gid
        self.gecos = gecos
        self.dir = dir
        self.shell = shell

class Location(object):
    def __init__(self, source=None, lineno=None, line=None):
        self.source = source
        self.lineno = lineno
        self.line = line

    def where(self):
        source = self.source
        if source is None:
            return "line %s" % self.lineno
        else:
            return "line %d of %r" % (self.lineno, source)
            

# Define different error handlers, and a common error formatter.

def _format_msg(location):
    return "Cannot parse record %s: %s" % (location.where(), location.line)

def ignore_handler(location):
    pass

def warn_handler(location):
    msg = _format_msg(location)
    sys.stderr.write(msg + "\n")

def strict_handler(location):
    msg = _format_msg(location)
    raise ValueError(msg)

error_handlers = {
    "ignore": ignore_handler,
    "warn": warn_handler,
    "strict": strict_handler,
}

# Main API call to read from a passwd file

try:
    _basestring = basestring
except NameError:
    _basestring = str

def read_passwd_entries(source, errors="warn", location=None):
    if isinstance(source, _basestring):
        infile = open(source)
        source_name = source
        close = infile.close
    else:
        infile = source
        source_name = getattr(infile, "name", None)
        close = None
    
    # Get the error handler for the given policy.
    # (A more sophisticated solution might support a user-defined
    # error handler as well as a string.)
    try:
        error_handler = error_handlers[errors]
    except KeyError:
        raise ValueError("Unsupported errors value %r" % (errors,))
    
    if location is None:
        location = Location(source_name)

    return PasswdReader(location,
                        _read_passwd_entries(infile, error_handler, location),
                        close)

# The actual passwd file parser, as a generator used by the PasswdReader.

def _read_passwd_entries(infile, error_handler, location):
    lineno = 0  # define lineno even if the file is empty
    for lineno, line in enumerate(infile, 1):
        line = line.rstrip("\n")  # remove trailing newline
        # Ignore comments and blank lines
        if line[:1] == "#" or line.strip() == "": 
            continue

        # Track where we are
        location.lineno = lineno
        location.line = line

        try:
            name, passwd, uid, gid, gecos, dir, shell = line.split(":")
        except ValueError:
            # Handle the failure
            error_handler(location)
            # If we get here then continue to the next record
            continue
        
        yield PasswdEntry(name, passwd, uid, gid, gecos, dir, shell)

    # Save the final line number
    location.lineno = lineno

# Driver code to help test the library

parser = argparse.ArgumentParser(
    description = "List username/shell entries in a password file")
parser.add_argument("--errors", choices = ["ignore", "warn", "strict"],
                    default = "strict",
                    help = "Specify the error handling policy")
parser.add_argument("--with-lineno", action="store_true",
                    help="include line numbers in the output")
parser.add_argument("filename", nargs="?", default=["/etc/passwd"],
                    help = "Password file to parse")
                  
def main():
    args = parser.parse_args()
    filename = args.filename[0]
    if args.with_lineno:
        output_fmt = "{location.lineno}: {entry.name} => {entry.shell}"
    else:
        output_fmt = "{entry.name} => {entry.shell}"
    try:
        with read_passwd_entries(filename, errors=args.errors) as reader:
            location = reader.location
            try:
                for entry in reader:
                    print(output_fmt.format(entry=entry, location=location))
            except ValueError as err:
                raise SystemExit("ERROR with password entry: %s" % (err,))
    except IOError as err:
        raise SystemExit("ERROR with password file: %s" % (err,))

if __name__ == "__main__":
    main()
The result is a lot more complex than the original 8 line parser, because it's a lot more configurable and provides more information. You can also see how to expand it futher, like tracking the current record as another location field. In my chemistry parsers, I also track byte positions, so users can determine out that record XYZ123 is between bytes M and N.

Location through accessor functions

Unfortunately, this more complex API, with location tracking, is also a bit slower than the simple parser, and trickier to write. For example, the following lines:

        # Track where we are
        location.lineno = lineno
        location.line = line
always occur, even when the location information isn't needed, because the parser doesn't know if the API caller will need the information. In addition to the extra processing overhead, I found that a more complex format might have multiple branches for the different processing and error paths, each of which need to set location information correctly. This is error prone.

There are a few ways to minimize or shift that overhead, at the expensive of yet more code. In my chemical structure parsers, I expect that most people won't want any of this information, so I made the location code more expensive to get and less expensive to track.

What I did was to change the Location class so I can register a handler for each attribute. To get the value for "lineno", the location calls the associated handler. This function can be defined inside of the scope of the actual reader function, so it has access to the internal "lineno" variable. This means that reader function doesn't need to set a location.lineno, at the expense of an extra function call lookup to get the value.

I'll sketch what I mean, so you can get an idea. This essay is long enough as it is, so I'll only give a flavor and not show the full implementation.

Here's a variation of the Location class, with the ability to register the "lineno" property:

class Location(object):
    def __init__(self, filename=None):
        self.filename = filename
        self._get_lineno = lambda: None
    
    def register(self, **kwargs):
        if "get_lineno" in kwargs:
            self._get_lineno = kwargs["get_lineno"]
    
    def save(self, **kwargs):
        if "lineno" in kwargs:
            self._get_lineno = lambda value=kwargs["value"]: value
    
    @property
    def lineno(self):
        return self._get_lineno()
The updated reader code, which registers the handler and cleans up aftwards, is:
def read_passwd_entries (source, errors="warn", location=None):
    ...
    reader = PasswdReader(location,
                        _read_passwd_entries(infile, error_handler, location),
                        close)
    next(reader) # prime the reader
    return reader

def _read_passwd_entries (infile, error_handler, location):
    lineno = 0
    def get_lineno():
        return lineno
    
    location.register(get_lineno = get_lineno)
    yield "ready" # ready to start reading
    
    try:
        for lineno, line in enumerate(infile, 1):
            ....
    finally:
        location.save(lineno=lineno)

Property registration by forcing the generator to start

The biggest complication with this approach is the timing order of location registration. Registration must occur inside of the generator, in order to get access to the local scope. However, the code in the generator won't be executed until the first item is needed.

Consider for example:

with read_passwd_entries(filename) as reader:
    print(reader.lineno)
You should expect this to print either 0 or 1 (I think it should print 0). But if the generator hasn't started then the get_lineno isn't registered, so the default value of None will be returned.

Instead, I prime the generator in read_passwd_entries() by doing:

    next(reader)
This forces the generator to execute, up the first "yield" statement, which is:
    yield "ready" # ready to start reading
(The value of "ready" is thrown away.)

Removing reference cycles

The other complication is the cyclical memory references. The _read_passwd_entries() generator has a reference to the location instance as a local variable, the location instance has a reference to the get_lineno() function, the get_lineno() function knows the outer scope, which is in the _read_passwd_entries() generator. While Python has garbage collection which can handle those cycles, some of the location properties can be complex objects, like file handles and molecule objects, which use resoures that aren't fully exposed to Python.

The try/finally block exists to break the cycle. The finally removes the link to get_lineno() and replaces it with the final value for lineno.

Influence from XML processing

My ideas are influenced by the SAX2 API, which you can see by looking at its Locator class, and to a lesser extent the ErrorHandler class. (I once implemented an entire bioinformatics parsing system called Martel, based on SAX events.)

One of the biggest differences is that XML and SAX deals with potentially deep tree structures, while I only work with a linear sequence of records, turned into molecules. My thoughts about an API which is better for this data were originally guided by Fredrik Lundh's iterparse.



PNG checksum #

The chemfp FPB fingerprint format is a chunk-based format. An FPB file starts with an 8 byte signature followed by a sequence of chunks. Each chunk starts with a four character chunk code followed by an 8 byte length, and then that many bytes for the data itself.

The general idea for this type of format dates back to at least the IFF format from 1985. I first came across it when I read the PNG specification. I liked it so much that I decided to model FPB along those lines.

My first attempt was in 2008(!). I've iterated a few times since then. The result is more IFF-like than PNG-like. This essay describes the reasons for those changes.

4 bytes vs. 8 bytes.

IFF and the PNG specification use an 4 byte chunk length. If the image data doesn't fit into a single chunk then it can be split across several chunks. Actually, quoting from the PNG spec, Multiple IDAT chunks are allowed so that encoders can work in a fixed amount of memory; typically the chunk size will correspond to the encoder's buffer size. Chunks are also useful because PNG viewer can update the image display after each chunk, even though the file hasn't been completely downloaded.

4 bytes is only enough for 4 GB. My design goal is 100 million fingerprints. Even 1024 bit fingerprints still take up 12 GB of data. There's very little I can do with only a chunk of the entire data set, and for search performance and ease of programming I want all of the fingerprints in a single contiguous region.

That's why I used an 8 byte chunk length. (I could have used a 6 byte length instead of 8, which sets an upper limit of 256 TB. That's enough space for 275 billion 1024-bit fingerprints. While that's more than the 166 billion from GDB-17, I figured I would be decadent and use two more bytes.)

chunk CRC

A curious thing about the PNG format, compared to most other IFF-like formats, is the 4-byte CRC, which occurs as the fourth component of the chunk, after the chunk data. Here's how the specification describes it:

A 4-byte CRC (Cyclic Redundancy Check) calculated on the preceding bytes in the chunk, including the chunk type code and chunk data fields, but not including the length field. The CRC is always present, even for chunks containing no data. See CRC algorithm.

Part of me likes this extra integrity check. The question I had to ask my self was, should I carry that idea over to the FPB format?

Why chunk CRCs instead of a file CRC?

The CRC is designed to detect certain types of transmission errors. ("Transmission" includes data storage, which is data transmission to the future.) It can, for example, detect if a burst of up to 32 zeros or 32 ones corrupted the transmission. As a trivial reduction, it can also tell if there was a single corrupted bit.

But why does each chunk have a CRC? Why not the file itself? For that I consulted the PNG mail archive. The short answer: Info-ZIP experience and Usenet.

First off, why does it have a CRC at all? Based on my reading, Greg "Cave Newt" Roelogs proposed a CRC in the first place. Roelogs became the UnZip maintainer with release 4.0, in December 1990, and became involved with PNG development in January 1995. PNG's reference compression code comes from Zip and UnZip. Zip is an archive format, and the CRC helps verify long-term integrity of the zip file.

Roelogs proposed a per-file CRC, but there was a lot of discussion about where to put it. Ed Hamrick suggested Why not just put a CRC as the last field of every chunk?, and Lee Daniel Crocker listed various advantages to a per-chunk CRC:

 1. Streamability.  Streaming applications that read/write chunks do
    not have to maintain state information across chunks. ...

 2. Simpler file changes. ...

 3. You can make changes to comments, gamma, and other chunks with a
    binary editor, and programs won't think the whole file is bad. ...

 4. More data integrity. While a CRC-32 is likely to catch changes in
    even a very large file, it is mathematically guranteed to catch 1-
    and 2-bit errors in an 8k chunk (I don't the book with me right now,
    but I think those are the right numbers).

 5. You'll never hear me argue about where to put the CRC again, and
    the nesting problem with future extensions is gone, and Newt gets to
    keep his "chunk concept".  Hell, if a 0.0005 increase in file size
    can just cut the message traffic on this list down, it's worth it.

 6. You can cut the file size issue in half by using 16-bit chunk
    lengths (a separate argument I lost earlier but now have more
    weight behind), and eliminate it entirely by doing that and using
    CRC-16s.  I personally favor the former but not the latter, and
    don't really care that much either way.  That will even end the
    continual "this is just IFF" noise on comp.graphics too.
There was a bit of back-and-forth about using CRC-16 vs CRC-32 vs. something stronger, like MD5. It was Alexander Lehmann who described the appropriate engineering tradeoffs for PNG:
IFF doesn't contain any checksumming, validity checks can be done with length checking (unexpected EOF), but that is not sufficient for network (especially multipart USENET) transmission.

If an uuencoded file contains truncated lines, this goes unnoticed by length checks, unless the missing chars are contained in a header or length chunk. (unless the image data is decompressed, but I would prefer a file checker that doesn't burn excess cpu cycles just to check if the uncompressed data has the correct size)

So there you go: chunk CRCs were added to help detect transmission errors over multi-part Usenet transmissions without needing to wait until the end to detect failures.

Multi-part Usenet transmissions?

If you don't recall the glory days of Usenet, it is a distributed newsgroup system designed around text. Post sizes were limited. If you wanted to send a large image, or application binary, or directory archive, you needed to split it up into multiple parts, encode it (usually uuencoded) and post each part separately.

For example, the first public release of Python was sent to alt.sources in 20 files. The first was titled "Python 0.9.1 part 01/21", followed by 02, etc.

These could be decoded and reassembled either manually or with various automated tools. Some newsreaders could even figure this out automatically. But sometimes things went missing or were incomplete.

Given this sort of noisy environment, I think you can understand better why a chunk CRC might be useful.

Is CRC good enough?

I wondered if CRC was good enough? Should I use something stronger than CRC? How many PNG failures occur?

We know that bit errors exist. Artem Dinaburg, in Bitsquatting: DNS Hijacking without exploitation discusses one of the effects of single-bit errors in DNS resolution, with links to previous work including a Verisign paper estimating the number of network errors in DNS queries (a bit error rate of 10-9 is about what we'd expect for typical data communication circuits), and how to use bit errors to break out of a virtual machine.

But those are typically taken care of by the network layer. On the other hand, Bram Cohen, in his Stanford EE380/2005 talk "Under the hood of BitTorrent" (video no longer available, or at least, I can't find it) mentioned how TCP's checksum isn't good enough once you start swapping terabytes of data over the Internet, so BitTorrent had to add additional validation.

I asked on the PNG mailing list, but no one recalled seeing an error in over 10 years.

Going back to the original design work, Roelogs commented that logical though a per-chunk CRC might be, it's overkill – and that's coming from the guy who proposed (demanded) CRCs in the first place. I agree. There's no need to have as many CRC validations as PNG has.

My data files are only 10 GB at best, and more like 0.5 GB. I concluded that was nowhere near the point where I had to worry about sophisticated or advanced error detection.

Single bit failures are possible in the PNG format

By the way, there's a subtle framing issue with a chunk CRC. The length field determines where to find the CRC, but is not part of the CRC. If one of the length bits changes then there's a chance - a very small chance - that the bytes at the new position have a valid CRC for the interveaning bytes.

There's an even smaller chance that the new chunk will contain valid data for that chunk type, so this isn't something to worry about for real-world data.

On the other hand, if you think about it for a bit you'll realize that it's possible to construct a valid PNG such that a single bit change will still generate a valid PNG. Sketch: add a private ancillary chunk whose payload is the alternate PNG, plus a partial final chunk designed to skip the rest of the original PNG. It will need to be a little clever to include the bytes needed to resynchronize with another CRC, but that isn't very hard.

Sadly, I couldn't figure out a way to use this for evil.

When to validate the CRC? What to do with it fails?

What happens when the CRC fails?

Most PNG readers exist to display the image. They need to read all of the image data, which means reading and decoding nearly all of the chunks in a data file. The overhead for CRC validation is trivial. The main question for the reader is simply, what to do when it fails?

If you use libpng then you call png_set_crc_action() to tell it what to do on validation failure. The actions are "warn, and use the data", "warn and discard the data", "use the data without warning", "warn and discard the data" (for non-critical chunks), and "exit with an error." All reasonable actions for different circumstances.

My fingerprint data is a bit different than image data. For example, a 1.3 million fingerprints data file, with 1024 bit fingerprints, takes 217 MB. wc -l on that file takes 0.2 seconds, which sets the minimum time needed to compute the CRC.

Not only is my data set larger than a typical image, but I don't always need to use all of the data. Simsearch uses the sublinear search algorithm described by Swamidass and Baldi, which greatly reduces the search space for high threshold searches. As a result, typical search time is around 1/100th of a second, well less than the time needed to verify any CRC.

This tells me that FPB-based applications will rarely want to validate the fingerprint print chunk, much less the entire file. Experience tells me that if validation fields are rarely used, then they are rarely tested, and more likely to be error-prone. In fact, that's likely why there's a "warn, and use the data" option for libpng.

No chunk CRC for FPB

I concluded that it's not worthwhile to follow the PNG lead and store a CRC with each chunk. Generating the chunk CRC is a bit of hassle, with very little gain.

Instead, I'll wait to see real-world failure cases before deciding if I need to have any sort of validation, or possibly even error correction. Should this happen, I'll define a new chunk type which can store the validation data about the preceeding chunks, and place it just before the end of the FPB file.



chemfp's FPB format #

Chemfp 1.2 supports new a fingerprint file format, called the "FPB" format. It's designed so the fingerprints can be memory-mapped directly to chemfp's internal data structures. This makes it very fast, but also internally complicated. Unlike the FPS format, which is designed as an exchange fingerprints between diverse programs, the FPB format is an binary application format. Internally it's a chunk-based container file format similar to PNG, Interchange File Format, and similar type-length-value formats. I'll talk more about the details in a future essay.

chemfp business model

Chemfp is a package for cheminformatics fingerprint generation and high-speed Tanimoto search. Version 1.1 is available for free, under the MIT license. Version 1.2 is the first release with my new business model. It's "free software for a fee." It's still under the MIT license, but you need to pay to get a copy of it.

Previously the commercial and no cost versions were the same version, but who wants to pay for something that's available for nothing? Many free software projects suffer from a resource problem because it's hard to get funding when you don't charge anything for the software. But people will pay to get access to useful features, which will goes into support and additional development. If all goes well, I'll release older commercial versions as no cost versions after a few years.

FPS files take a couple seconds to load

Perhaps the most widely useful new feature in chemfp-1.2 is the FPB file format, which complements the FPS file format. The FPS format is a human-readable text format which is easy to generate and parse, but it's not designed to be fast to read and write. I'll show you want I mean using ChEMBL 18 as my target database, and Open Babel's 1021-bit FP2 fingerprints.

I'll create the fingerprints in FPS format:

% ob2fps chembl_18.sdf.gz --id-tag chembl_id -o chembl_18_FP2.fps
% head -7 chembl_18_FP2.fps
#FPS1
#num_bits=1021
#type=OpenBabel-FP2/1
#software=OpenBabel/2.3.90
#source=chembl_18.sdf.gz
#date=2014-07-08T22:07:54
20000120200a0010402006040c00000064000000000220c80080104c03104c01000041
0000488021808002180a000000000020001800200084348082000802010c000c000320
020409000000080000041000017004cb10009340000000010000888012000001004010
20000420020029100f8010000900800008010002000300      CHEMBL153534
then do a similarity search, asking it to report the search times to stderr:
% simsearch --query 'Cn1cnc2c1c(=O)n(c(=O)n2C)C' chembl_18_FP2.fps --time
#Simsearch/1
#num_bits=1021
#type=Tanimoto k=3 threshold=0.7
#software=chemfp/1.2b2
#targets=chembl_18_FP2.fps
#query_sources=chembl_18.sdf.gz
#target_sources=chembl_18.sdf.gz
3	Query1	CHEMBL113	1.00000	CHEMBL1767	0.97222	CHEMBL74063	0.97183
open 0.00 search 2.15 total 2.15
The "--query" command-line parameter is new in chemfp-1.2. It takes a SMILES string by default. Simsearch looks at the fingerprint type line of the header to get the appropriate toolkit and generate the corresponding fingerprint for that structure query record.

Why aren't FPS searches faster?

Similarity search in chemfp is supposed to be fast. Why does it take over 2 seconds to search 1,352,681 records? Answer: nearly all of the time is spent reading the data and parsing the FPS file. Just doing a "wc -l" on the file takes 0.5 seconds, so that sets the upper bound on performance, unless I switch to an SSD.

This is why Noel O'Boyle's head-to-head timing comparison against Open Babel's own "fastsearch" finds that they have similar search times for single query searches; both simsearch and fastsearch are mostly I/O and parser bound.

Use 'fpcat' to convert from FPS to FPB format

I'll do the same test, but with the FPB format. I could ask ob2fps to write an FPB file instead of and FPS file, by simply changing the extension for the output file, like this:

% ob2fps chembl_18.sdf.gz --id-tag chembl_id -o chembl_18_FP2.fpb
However, this will re-parse the structure and recompute the fingerprints, which takes a long time.

Since I already have the fingerprints in the FPS file, I'll instead use the new "fpcat" program to convert from FPS format to FPB format.

% fpcat chembl_18_FP2.fps -o chembl_18_FP2.fpb
The conversion took about 12 seconds to run. The FPB format is pre-sorted and indexed by population count, to enable sublinear similarity search directly on the file, and the fingerprints are word aligned for optimal popcount calculations.

You can get a sense of the popcount ordering by using "fpcat" to view the contents of the FPB file as an FPS file:

% fpcat chembl_18_FP2.fpb | head -6
#FPS1
#num_bits=1021
#type=OpenBabel-FP2/1
#software=OpenBabel/2.3.90
0000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000	CHEMBL17564
0000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000	CHEMBL1098659
% fpcat chembl_18_FP2.fpb | tail -1
373e2327965282fe3795e7443613480cb59dd5d47164f8cc11ac2bbe91be9f873c09f1
dfaff7b0ffb09eb7fb21993243e3dbea4038a0e011a60be22f229e2634ced97e0a0e7c
9b832ffbb502d6a8139e08bccffaf5bb3ba8f36edf23814fe2953ff77738e10615f32a
e09040f7d42cbf510f15df765b3a6279f802471a86cb04	CHEMBL2368798

FPB searches are much faster

Simsearch accepts FPS and FPB files. By default it figures out the file type by looking at the file extension. I'll pass in the .fpb version:

% time simsearch --query 'Cn1cnc2c1c(=O)n(c(=O)n2C)C' chembl_18_FP2.fpb --time
#Simsearch/1
#num_bits=1021
#type=Tanimoto k=3 threshold=0.7
#software=chemfp/1.2b2
#targets=chembl_18_FP2.fpb
3	Query1	CHEMBL113	1.00000	CHEMBL1767	0.97222	CHEMBL74063	0.97183
open 0.00 search 0.00 total 0.00
0.305u 0.070s 0:00.41 90.2%	0+0k 0+0io 0pf+0w
Yes, FPB search is less than 1/100th of a second. I wrapped everything in the "time" command to show you that the whole search takes 0.4 seconds. Much of that extra time (about 0.25 seconds) is waiting for my Open Babel and my hard disk to load the available Open Babel file formats, but there's also overhead for starting Python and importing chemfp's own files.

The slowest part is loading Python and Open Babel

In fact, I'll break it down so you can get a sense of how long each part takes:

# Python startup
% time python -c "pass"
0.011u 0.007s 0:00.01 100.0%	0+0k 0+8io 0pf+0w

# Open Babel extension overhead
% time python -c "import openbabel"
0.027u 0.013s 0:00.04 75.0%	0+0k 0+3io 0pf+0w

# Overhead for Open Babel to load the available formats
% time python -c "import openbabel; openbabel.OBConversion()"
0.233u 0.021s 0:00.25 100.0%	0+0k 0+0io 0pf+0w

# Chemfp import overhead to use Open Babel
% time python -c "from chemfp import openbabel_toolkit"
0.281u 0.032s 0:00.31 100.0%	0+0k 0+26io 0pf+0w
In other words, about 0.3 seconds of the 0.4 seconds is used to get Python and Open Babel to the point where chemfp can start working.

When is the FPB format useful?

If you already have a fingerprint (so no toolkit overhead), or have an SSD, then the total search time on the command-line is less than 0.1 seconds.

For a command-line user, this is great because you can easily integrate similarity searches into scripts and other command-line tools.

It's also useful for web development. Of course, a web server in production rarely restarts, so the load time isn't critical. But as you develop the server you end up restarting it often. Many web application frameworks, including Django, will auto-reload the application server every time a file changed. It's annoying to wait even two seconds for 1.3 million records; imagine how much more annoying it is to handle a few 5 million record fingerprint sets.

Switch to the FPB format, and reloads become fast again. Even better, because FPB files are memory-mapped, the operating system can share the same memory between multiple processes. This means you can run multiple servers on the same machine without using extra memory.

Combine these together and you'll see that even CGI scripts can now include similarity search functionality with good performance. (Yes, there are still good reasons for using CGI scripts. The last one I developed was only two years ago, which was supposed to be a drop-in replacement for a system developed 10 years previous.)

More about chemfp

If you're interested in chemfp, see the product page and download version 1.1 to evaluate the performance. If you're interested in paying for access to version 1.2, email me.



The origin of the connection table #

Instead of doing real work over the last month, I ended up trying to understand more about the origin of the connection table, and the origin of the phrase "connection table."

(Regarding real work, chemfp-1.2 is in beta testing, and will be released this month. If you're interested in high-performance Tanimoto fingerprint search, take a look at it. Version 1.1 is available right now, at no cost, for you to download, evaluate, and use, before deciding to buy a copy of version 1.2.)

Importance of Calvin Mooers' connection table

In my previous essay, I described some of work of Calvin Mooers in the area of chemical documentation now called cheminformatics. While I don't think he implemented any of his ideas, they were quite influential. My research has lead me to believe that he developed the first practical connection table.

Here are examples from the literature which support my belief:

All of these refer to the Mooers' paper "Ciphering Structural Formulas – the Zatopleg System", Zator Technical Bulletin No. 59 (1951). Unfortunately, I haven't yet been able to get ahold of that paper, but a patent reference in US3476311 and the Meyer and Wenke (1962) citation above give enough of a description to be certain that it's an entirely reasonable and usable connection table format.

George Wheland as the creator of the connection table

My research lead me to believe that Mooers invented the connection table. If you look around, you'll see that some people say it was George Willard Wheland who came up with the connection table, in 1949. These include:

I thought this was odd, since I haven't come across any historic references to Wheland's connection table. It's not unusual though to find that several people come up with an idea independently, and where the earlier publication isn't really discovered until after the later publication becomes widely known. One of the best known examples is the Cooley-Tukey fast Fourier transform, which was published in 1965. Only later was it identified that Carl Friedrich Gauss developed and used the same algorithm in 1805.

What is Wheland's connection table?

The Wheland connection table reference is from the text book Advanced organic chemistry, p87. Thanks to the University of Michican and the HathiTrust for scanning that book and making it available.

The connection table is expressed as the upper-right triangle connection matrix, where position (i, j) contains the covalent bond order, or is 0 if there is no bond. Here's an image of the two example tables:

I don't like this. While I can see how it might be a connection table, it's not a good one. For one, it takes N*(N-1)/2 memory, so a 1,000 atom molecule will take 0.5 MB of memory, almost all of which are zeros. This is possible on modern machines, but there's no way a 1950s/1960s era programmer would choose this approach.

(You can see that concern about space in Lynch's essay, cited above, where they evaluated the Meyer and Wenke (1962) paper, also cited above:

Ernst Meyer at BASF had also introduced a form of connection table, which to our eyes at the time seemed highly redundant and space consuming (Meyer & Wenke, 1962).
Meyer's connection table, based on the Mooers connection table, allocates space for 4 bonds for each atom. This sets an upper limit to the atom valance, but also means a lot of 0s for halogens, oxygens, etc. In any case, it's still much more compact than the connection matrix.)

What's interesting is that Wheland also didn't think this matrix was useful, writing:

The irrelevance of geometrical considerations in the definition of a structure can be shown most conclusively by a discussion of some of the remaining, less convenient and less familiar, ways in which structures can be specified. One of these ways consists in giving a purely verbal description ...
    A further nongeometrical way of describing the structures of acetaldehyde and of ethylene oxide is slightly more illuminating than the verbal one; when this method is adopted, the two structures are expressed as in Tables 4-1 and 4-2, respectively. The numbers in the bodies of these tables represent the number of covalent bonds between the corresponding atoms at the left of the rows and at the tops of the columns. The two tables are easily seen to be different from one another, but to be completely equivalent to the respective conventional diagrams and verbals descriptions. ...
The phrase "some of the remaining" suggests that Wheland did not come up with these non-geometrical descriptions, but really interesting part is on page 88:
... The foregoing alternative ways of describing structures have not been given here with the idea that they would be of practical use, but rather with the hope that they would serve to emphasize the fact that structures, as such, need have no geometrical implications. Since exactly the same information which is contained in a conventional diagram can be given equally well (even though incomparably less conviently) by an obviously nongeometrical verbal description or table, then the diagram, in spite of its appearance, must also be actually nongeometrical. In other words, all geometrical features of the diagrams which are not contained in either the verbal description or the table must be disregarded as of no significance. (Later, when the discussion is of configuration rather than of structure, this extreme statement will require some modification. See Chapters 6-9.)
That sounds like Wheland didn't think it would be useful for practical matters.

That makes Lynch's quote, G. W. Wheland had been the first to show how this could be done (1949) all the more odd. Wheland didn't show how this could be done, in any practical sense, nor did Wheland claim that it was practical. And as I mentioned, I also get the suggestion that Wheland might not have created that representation.

Wheland (1949) should likely be Wheland (1946)

All of the four Wheland quotes I listed reference Wheland (1949). The connection matrix is on p87 in chapter 4. The copyright page says Chapters 1-10 copyrighted as Syllabus for Advanced Organic Chemistry 321 by The Univeristy of Chicago, 1946. Thus, if the earlier version also has the connection matrix, then the correct citation should likely be Wheland (1946).

Only Jarosław Tomczak also referenced Wheland (1946). Tomczak's paper also describes the structure in Table 4-1, so either Tomczak looked at Wheland's text book, or used a very good secondary reference. In either case, I commend the good scholarship.

The question I have is, does the earlier book include the connection matrix? According WorldCat, copies of that book are available from the University of Chicago, Wayne State University, and Mississippi State University libraries. Perchance a reader could get ahold of the book and verify it for me?

Who was Wheland?

It's easiest if I just copy the abstract from the chapter George W. Wheland: Forgotten Pioneer of Resonance Theory:

George W. Wheland, although little remembered by the general chemistry public today, is forever linked to resonance theory through three seminal papers written with Linus Pauling and through two substantial monographs (1944 and 1955) on resonance. At the University of Chicago he carried out research on organic acids and bases, while continuing to publish papers on quantum chemistry. He also wrote three editions of a highly regarded text on "Advanced Organic Chemistry." Sadly, his scientific career ended long before his death when he contracted multiple sclerosis. This chapter gives an overview of his career, writings, and research in quantum chemistry.

One of those papers with Pauling is "The nature of the chemical bond. V. The quantum-mechanical calculation of the resonance energy of benzene and naphthalene and the hydrocarbon free radicals." March 21, 1933. J. Chem. Phys. 1 (June 1933): 362-374. You should take a look at the first few pages to remind yourself that only 90 years ago we still weren't really sure about the structure of benzene.

Apparently here we have a case where the classical ideas of structural organic chemistry are inadequate to account for the observed properties of a considerable group of compounds. With the development of the quantum mechanics and its applications to problems of valence and molecular structure, it became evident to workers [as, for example, Slater in 1931] in this field that the resonance of benzene between the two equivalent Kekeulé structures was an essential feature of the structure of this molecule, accounting for the hexagonal symmetry of the ring and for its remarkable stability; and it seemed probable that the quantum mechanical treatment of aromatic molecules would lead to a completely satisfactory explanation of their existence and characteristic properties.

In the paper they described a simplification of Hückel's work which made it practical to extend valence bond theory to larger structures like naphthalene. I think someone who worked with the secular equation of Hückel could easily come up with the connection matrix, so it's easily possible the Wheland was the first to come up with the idea.

BTW, I don't know much about quantum chemistry, so "A Chemist's Guide to Valence Bond Theory" by Sason S. Shaik and Philippe C. Hiberty, was quite insightful to this outsider. The authors go into some of the history and impact of that paper on the field. I like the quote of Wheland which "explains the resonance hybrid with the biological analogy of mule = donkey + horse."

I really like how the authors describe the historical context of the debate between valence bond theory and molecular orbital theory. My chemistry knowledge is not very deep, and while I know some basic molecular orbital theory, my intuition is more aligned with the classical Lewis dot model of valence bond theory. Shaik and Hiberty describe how valence bond model was popular until the 1950s precisely because it can be seen as a quantum mechanics interpretation of the VB model, which would appeal better to most chemists of that era. One of the many things which helped MO gain ground was "the construction of intuitive MO theories"; I never got to that point in my chemical studies.

I also found the discussion about the "religious war-like rivalry" between valence bond theory and molecular orbital theory quite fascinating. Chemists are, after all, human.

Why is Wheland recognized as the creator of the connection table?

I think Wheland's connection matrix is a precursor to a connection table, but it's not really a usable connection table for chemical informatics. I don't know of any cheminformatics toolkit based on that data structure, either now or back when the field was still called "chemical documentation." Nor did Wheland suggest that it would be practical. Why then do people reference Wheland?

(To be clear; my "connection table" may be more restrictive than others might use. I mean something which is reasonable to use in a chemical information system. It's also possible that early systems did use a connection matrix form for substructure search.)

I know Mireille, so I started by sending her email. She couldn't recall the details, nor had a record of it in her notes, but she believes her knowledge came from Lynch's essay or from an essay from Eugene Garfield.

I sent email to what I think is Jarosław Tomczak's address, but haven't received a reply.

I sent an email to Bob Williams, who doesn't remember that detail after 16 years. Bob got help from Val Metanomski (now deceased), Eugene Garfield, and Mary Ellen Bowden, and suggested I contact them.

I emailed both of the latter. Gene replied, but wasn't able to resolve these details despite searching for a couple of hours. I haven't heard back from Bowden.

Working hypothesis: people are referencing an intermediate publication

My working hypothesis is that a book or essay was published between about 1980 and 1995 with the history of the chemical representations, including line notations and the connection table. This author, through good scholarship, came across Wheland's "highly regarded text", and described it as a connection table, using a broader definition of 'connection table' than I have. Many people read that book, and the ideas in it became part of the collective knowledge.

One such book might have been "Chemical Graph Theory: Introduction and Fundamentals", edited by Danail Bonchev and Dennis H. Rouvray. Quoting from chapter 3, "Nomenclature of Chemical Compounds" page 99, by Alan L. Goodson:

A connection table has been defined [6] as a uniquely ordered list of the node symbols of the structure (or graph) in which the value (atomic symbol) of each node and its attachment (bonding) to the other nodes of the total structure are described ...

Connection tables, notations, and nomenclatures are of value in different ways and an in-depth discussion of their use has been published recently [8]. Connection tables, being atom-by-atom computer records of chemical structures, are useful for machine registration of chemical structure and for substructure searching ...
Under that definition, Wheland's matrix counts as a connection table. I'm surprised though that no one before Wheland considered a matrix representation of a valence bond model.

Google Books' text search failed to find "Wheland" in Bonchev and Rouvray's book, so that's not the intermediate publication.

Reference 8 is R. Lees and A. F. Smith (Ed.) "Chemical Nomenclature Usage", Ellis Horwood, Chinchester (1983). A limited (search only) version is available from the HathiTrust, which is enough to determine that "Wheland" could not be found in the book.

So that's a dead end. Citation [6] is from Morgan, which I describe below. The chapter has an extensive list of references, which may be useful future leads, though they are more closely associated with nomenclature than a connection table.

Perhaps someone here has an idea of how Wheland became known as the creator of the connection table?

Who coined the phrase "connection table"?

I also wondered who was the first to use term "connection table". Lynch says "With Dyson we looked at the random matrix, that is, connection tables", and Meyer and Wenke refer to Mooers's connection table as "topological coding", so it doesn't seem like people used "connection table" during the 1950s and early 1960s.

The earliest reference I've found so far is in from the Cossum et al. quote and citation I mentioned earlier, which was received March 23, 1964.

With the help of Google Scholar, I found a slightly later publication which uses "connection table" in the title:

Cossum, W. E., M. E. Hardenbrook, and R. N. Wolfe, "Computer generation of atom-bond connection tables from hand-drawn chemical structures", Proceedings of the American Documentation Institute, 27th meeting, Philadelphia, Pennsylvania October 5-8, 1964; volume I, pp 269-275.

The term spread quickly. H. L. Morgan's "The Generation of a Unique Machine Description for Chemical Structures – A Technique Developed at Chemical Abstracts Service", J. Chem. Doc. (1965), received January 15, 1965, states on p. 108:

The structure description employed in the CAS registration process is a uniquely ordered list of the node symbols of the structure (or graph) in which the value (atomic symbol) of each node and its attachment (bonding) to the other nodes of the total structure are described. Such as list and description is called a "connection table." Since this paper is not concerned with structure input, the connection table which is described is that stored and manipulated by the computer. The form of the table which is used within the computer is not the most convenient form for input to the system; thus the input form is translated by the computer into the "compact connection table" developed by D. J. Gluck of du Pont2.
where reference 2 is D. J. Gluck, "A Chemical Structure, Storage and Search System. Development at Du Pont" J. Chem. Doc. 5, pp. 43-51 (1965). Unfortunately, I don't seem to have a copy of that paper, but as it's a 1965 citation, it doesn't antecede 1964.

On the other hand, take a look at this citation (which I'm not paying $31.50 to read):

G. M. Dyson, W. E. Cossum, M. F. Lynch, H. L. Morgan, "Mechanical manipulation of chemical structure: Molform computation and substructure searching of organic structures by the use of cipherdirected, extended and random matrices", Information Storage and Retrieval, v1, issues 2-3, July 1963, pp 49-99. DOI: 10.1016/0020-0271(63)90011-1.
The abstract is:
General methods have been devised for generating mechanically (a) from the IUPAC cipher, and (b) from random-numbered structures, three types of matrix from which the molecular formula (molform) of a structure can be computed, and machine searches made for any conceivable substructure or combination of substructures. The machine/matrix language is independent of the generating cipher; such cipher, and other ciphers depending on exact structural delineation can be regenerated from the machine/matrix. This enables the latter to be used as a feasible common language between notations.
Isn't that interesting! All of the authors who used "connection table" in 1964/1965 are in that 1963 paper, all of them were at Chemical Abstracts, and they use the terms "cipherdirected, extended and random matrices", and not "connection table."

This is a pretty clear indication that the term was coined in 1963, and likely by someone at CAS.

Possible antecedents

Google Scholar suggests a couple of antecedents to a 1963 date for "connection table", the most relevant being Marvin Minsky's Steps Toward Artificial Intelligence in Transactions on Human Factors in Electronics, HFE-2, March 1961, pp. 39-55, and reprinted in Computers and Thought, Ed. E.A. Feigenbaum and J. Feldman, pp. 453-524, McGraw-Hill, 1963.

The quote concerns template-based pattern recognition:

And to recognize the topological equivalence of pairs such as those below is likely beyond any practical kind of iterative local-improvement or hill-climbing matching procedure. (Such recognitions can be mechanized, though, by methods which follow lines, detect vertices, and build up a description in the form, say, of a vertex-connection table.)
I like the idea that one of the CAS group read Minsky's paper in 1963 (AI being a hot topic) and the name stuck. I have no way to verify this, but it's a fun conjecture.

More prosaically, Google Scholar also identified this quote from P.P. Gupta and M.W. Humphrey Davies, Proceedings of the IEE - Part A: Power Engineering, Volume 108, Issue 41, October 1961, pp. 383-398 DOI: 10.1049/pi-a.1961.0077:

The floating busbar is always numbered zero, and all the other junctions are then numbered consecutively. The following tables of data are then prepared and punched on a tape: (a) Connection table. ... The connection table defines the configuration of the network. ...
Again, I haven't paid the $23.94 to read this paper and verify the quote.

Minsky doesn't talk about colored edges (that is, "bonds") in the connection table, nor seemingly does Gupta and Davies, so these aren't connection tables in the chemistry sense. Nonetheless, it's also reasonable to conjecture that the term in chemistry was repurposed from electrical engineering.

Want to leave a comment?

If you know if an earlier use of "connection table", know more about the early history of connection tables, molecular graph representations, Zatopleg, or have anything else to say, please leave a comment, or send email to me at dalke@dalkescientific.com.



Calvin Mooers #

Mooers became an information scientist at the time when information science was just getting started. I came across his name in "An efficient design for chemical structure searching" by Feldman and Hodes, JCICS 1975 15 (3) pp 147-152. The paper is basically built on work Mooers had done a few decades previous, and includes a magic value of 0.69 as the "Mooers limit." I made a mental note to follow up on that later. This essay is a result of that followup, and is a small biography of Moeers' involvement with chemical documentation.

Influence of Mooers: connection tables, screens, and canonicalization

As I read more of the literature, I realized that Mooers had a big influence on the early decades of cheminformatics. He seems to have been the first person to use connection tables for coding molecular information on a computer, the first to describe substructure enumeration-based screens, and the first to consider a canonical representation of a molecule.

Here are some quotes to show you what I mean.

connection table and screens

Perhaps the founding paper of cheminformatics is Ray and Kirsch's "Finding Chemical Records by Digital Computers" in Science 25 October 1957, pp 814-819. It describes a project to use computers for substructure search of the patent database. It has the earliest use I know for "screen" (or "screening device"), as used for substructure search.

It contains two references to Mooers, the first concerning a connection table:

An example of a code suitable for machine searching was described by Mooers in the "Zatopleg" (5) system of ciphering structual formulas. Mooers' method of representing compounds provided the basis for representing the input data in the SEAC structure search routine described below. Methods for actually searching such data had to be developed.
where (5) is C. N. Mooers, Ciphering Structural Formulas – the Zatopleg System (Zator Co., Cambridge, Mass., 1951).

The second concerns a substructure screen, which Ray and Kirsch call "Mooers' N-tuple descriptors":

It has been suggested by Mooers (7) that, for purposes of retrieval, complex structures such as chemical diagrams can be represented in terms of a list of, say, all of the triples of atoms and bonds occurring within the structure. This, chloral (Fig. 3) would be described as consisting of combinations of the triples in the following list: Cl-C; C-C; -C-; -C=; C-H; C=O.
where (7) is C. N. Mooers, "Information retrival on structural content," in Information Theory (Academic Press, New York, 1956), pp. 121-134. (See below for the full citation; there are many books titled "Information Theory".)

Cossum, Krakiwsky, and Lynch in "Advances in Automatic Chemical Substructure Searching Techniques", J. Chem. Doc., 1965, 5 (1), pp 33-35 reaffirm that the Mooers' "code suitable for machine searching" is actually a connection table.

The connection table which we use is a development of that furst suggested by Mooers, and tested in search by Ray and Kirsch.

canonical connection table

People usually refer to Morgan's "The Generation of a Unique Machine Description for Chemical Structures J. Chem. Doc., 1965, 5 (2), pp 107-113 as the first paper to describe a molecular canonicalization algorithm. In the paper, Morgan writes:

Since it can be shown that the set is finite for any graph composed of a finite number of nodes, it is possible to select the unique table by generating all members of the set, lexicographically ordering the members of the set based on the characters involved in the description, and then selecting the first member of the resulting list as the unique table. This concept is a restatement of a technique proposed by C. N. Mooers for generating a unique cipher based on a process of making all possible "cuts" and comparing the resulting ciphers. [10, 11].
In this case, [10] is the same "Ciphering Structural Formulas – The Zatopleg System" as before, and [11] is C. N. Mooers, "Generation of Unique Ciphers for a Finite Network," Zator Technical Bulletin No. 49, Zator Co., 79 Milk St., Boston 9, Mass.

I can't help but conclude that Mooers' ideas had a big influence on the early days of cheminformatics.

Speaking of Morgan, I came across a reference to the Morgan canonicalization method in an essay by Charles Davis titled "Indexing and Index Editing at Chemical Abstracts before the Registry System":

Dyson, however, is remembered for having succeeded in lobbying for his system of linear notation, which won the approval of the International Union of Pure and Applied Chemistry (1961). This triumph over the more popular Wiswesser notation was something of a pyrrhic victory since linear notation ultimately would never be important to CAS. Other organizations, especially the Institute for Scientific Information, would go on to use Wiswesser notation, and it became an industrial standard for those who needed linear notation in their work (Davis & Rush, 1974a). However, the principal reason that CAS did not use either system was that during the early 1960s a young mathematician named Harry Morgan developed the famous algorithm that lead to the Registry System (Morgan, 1965; Davis & Rush, 1974b).

... Lynch's paper (M. F. Lynch, personal communication, 15 November 2002) expands on events during this era; moreover, he makes it clear that the Morgan algorithm was actually a revised version of the Gluck algorithm developed at DuPont.
When I read Gluck's paper about the Du Pont system, I wondered what algorithm they used in 1964 in order to canonicalize their molecules, since it was a year before Morgan's paper. Now I know! (For what it's worth, the priority rule is based on publication, not use, so we're still correct in citing Morgan, not Gluck.)

Who was Calvin Mooers?

Calvin Mooers and his wife gave an oral history of their - mostly Calvin's - lives. I used that as the scaffolding for this mini-biography.

He worked for the Naval Ordnance Laboratory during World War II. At the time, von Neumann was an advisor for the Navy. Von Neumann wanted a computer, and convinced the Navy to build one. The Navy decided to do so at NOL. Atanasoff, who invented the first electronic digital computer, was put in charge of the project. (Though apparently Atanasoff never talked about his earlier computer work while at NOL, nor was much of a decision-making manager.)

At about 26, he decided to go to graduate school at MIT. He wanted to apply some of the skills he had from working with computers, considered a few possibilities, and decided to work on library science. His wartime experience showed that library systems didn't handle the enormous amount of new publications which didn't fit into the existing classification system.

In his history he recounts Not long after (at MIT), I went to a lecture by Claude Shannon that he gave about 'information theory.' One of the conclusions of the lecture was that a random process had the statistics required for passing the highest quantity of information. Shannon and Weaver's The Mathematical Theory of Communication (1949) and Norbert Wiener's Cybernetics (1948) mark the dawn of the information theory age. Wiener was also at MIT, and Mooers in one of his papers notes that both books are very interesting.

His Master's thesis, Application of random codes to the gathering of statistical information, describes his "zatocoding" system of superimposed codes, which I'll get back to in a future essay. Appendix Z of the thesis was originally presented at an American Chemical Society meeting.

Indeed, he has a long association with the ACS. At MIT he met the chemist James Perry, who was interested in "chemical literature" and punched card-based information systems for chemical literature. (Punched cards were the new hotness, in part because of the 1946 article by Cox, Baily, and Casey in C&EN 1945, "Punched Cards for a Chemical Bibliography".) Perry arranged things so that Mooers could present his early ideas at the ACS in 1947. This interaction helped lead to Zatocoding. (See The State Of The Library Art, Volumne 4 for more about punched card systems from that era, including a summary of the Cox et al. method, mentions of Mooers' influence, and disagreements over the correct mathematical treatment of superimposed codes.)

He continued to be associated with the ACS. For example, he presented "Making Information Retrieval Pay" before the Chemical Literature Division of the ACS in September 1950, which described his audacious plan to index the Library of Congress catalog on a set of Zatocoded punch cards, and build a mechanical search engine that could conduct a search within a few minutes.

(I'll add that he is widely acknowledged for coining the term "information retrieval" and presenting it earlier that year. He also coined the term "descriptor":

Before 1948 the word did not exist, of course, and was not in the dictionary. It's now in the dictionary and most people don't know that it was my neologism. I made it up because I wanted the new word to mean exactly what I described and, unfortunately, that never happened. That is, the word descriptor now means almost anything.
According to the Wikipedia entry for index term, Moore's definition is in particular used about a preferred term from a thesaurus.)

He started a company to commercialize his ideas. The first sale was to Merck, Sharp and Dohme (the US Merck, known still as "MSD" outside of the US). I assume it was a punchcard-based indexing system for chemical record lookup. In any case, it was this period around 1950 where he did most of his thinking about applying computers to chemistry records, which resulted in his Zatopleg system.

I think he didn't continue with chemistry in large part because he wanted to work on larger problems of library and information management, and of thinking machines. In the previously mentioned "Making Information Retrieval Pay", he proposed a DOKEN ("documentary engine"), a mechanical retrieval engine capable of searching a 100 million record catalog in 2 minutes, or the 10 million catalogued items of the Library of Congress in 10 seconds. I don't think chemistry documentation was as interesting to him.

In any case, history shows he was right. It took another 10 years before cheminformatics as its own field really started, and there were and are a lot more library systems than molecular database systems.

For various reasons, his business didn't do so well. It seems like librarians didn't like his ideas much. Not only did he want to replace a lot of manual indexing systems with random numbers, but the random numbers didn't make much sense to a non-mathematician. I've read some of his papers, and his style is an unfortunate combination of abstruse and opinionated that could put people on edge. It also the combination that can energize people. The trick is to energize the people who will pay you money.

Other pioneers

My view is that while his specific background put him ahead of most others, in how to think about information theory and computing devices, there were many others very close behind him, and who were less prickly to deal with.

Hans Peter Luhn

For example, at the suggestion of the Hollerith Company, Dr. Dyson presented to IBM his ideas for using punched cards [containing molecules in Dyson notation]. Accordingly, he and Mr. Peter Luhn (of IBM) build, in 1949, a machine which would sort free field code cards. In this way, the Luhn scanner came into operation. (From the book "Survey of Chemical Notation Systems.") This led to [Luhn's] interest in literary data processing.

"Mr. Peter Luhn" here is Hans Peter Luhn, another pioneer of information science. Among other things, he invented the checksum algorithm used in every credit card. He kept in touch with Dyson. Lunh developed the KWIC permuted index in 1958 or 1960 (sources differ, and this is too much of a tangent to track down). Dyson, who by then was in charge of research and development at Chemistry Abstracts, invited Luhn to visit and to present this work. That's when CAS realized they should be looking towards computers, and then acquired IBM hardware. One result of this was the KWIC index for Chemical Titles, which served as a product to compete with the success of Eugene Garfield's ISI.

Mortimer Taube

Mortimer Taube, who was a librarian before becoming the chief of general reference and bibliography of the Library of Congress in 1945, started Documentation Inc in 1952. He had a much closer ties to the world of library science, and to government. Its first customer was the US military, and it provided library services to NASA when NASA was created in 1958. The underlying technology was based on "uniterms", which was presented in a paper titled "Coordinate Indexing of Scientific Fields", delivered at the Symposium on Mechanical Aids to Chemical Documentation, Division of Chemical Literature, American Chemical Society, New York, Sept. 4, 1951.

According to Heting Chu in the book "Information Representation and Retrieval in the Digital Age", Taube introduced boolean search systems to information retrieval, in the form of coordinate indexing. (On the other hand, other sources say that Mooers was the first propose using Boolean operations. I think the difference here is between "propose" and "convince customers to use.")

According to Mooers, in his autobiography (which means he has every right to tell it the way he wants to):

What [Taube] did was that he cooked up a simplified variety of my descriptors which he called Uniterms. He was a great salesman and a smooth talker and he charmed the librarians. He had worked as a librarian. So he set up Documentation, Inc. which made quite a commercial splash. Taube's message was that you don't have to worry about the fact that you can't understand Mooers, you do it the Uniterm way, you can understand it, and it's easy. So they flocked in his direction. Well, his methodology can be cynically characterized as follows: How do you index documents? You take a collection of documents in a certain field and you give them to somebody that is not really in that field. You sit him down with a colored pencil and ask him to go through the documents and to underline every term that he doesn't understand [laughs], and to use those underlined terms for index terms. You've heard of key terms, key words? Well, key words are the direct descendants of Mortimer Taube's Uniterms and have the same sort of loose-jointed semi-applicability to the field at hand.
If Mooers thinks that's "loose-jointed" then imagine what despair he might have had with folksonomy.

Chemistry and documentation

Did you notice how Mooers, Luhn, and Taube all had ties with the ACS?

I had always wondered why the "Journal of Chemical Documentation" had that name. I have a better understanding now. The American Documentation Institute started in 1937 with money from the Science Service of the National Academy of Sciences. In the post-war era, the number and rate of scientific publications grew enormously, and especially the number of technical reports. I recall from Eugene Garfield's essays that Chemical Abstracts at the time was years behind indexing the literature. This presented a market opportunity for his ISI (Institute for Scientific Information), which used computers to index the most popular chemical journals.

With funding from the Carnegie Corporation, they started the journal American Documentation. This seemed to be the focal point for a lot of papers the new field.

The ADI at this time consisted mostly people with scientific and technical backgrounds. This caused some animosity, as many of the newcomers believed that specific library training "was outdated or unnecessary", while others believed, as the ADI link quotes from elsewhere, "documentation was librarianship performed by amateurs." Also around this time, the American Chemical Society Division of Chemical Literature group started, as one of many more specialist groups. There was a large overlap in the readership between these various organizations.

Then the Soviets launched Sputnik in 1957. The US and other western governments started pouring money into science and technology. Quoting Mike Lynch:

There were great stirrings in science information at that time because of Sputnik, the challenge to the United States from the Soviet Union in October 1957. Sputnik's beep-beep tones took the world totally by surprise. When the dust had settled, it became apparent that the Soviets had published their intentions in the open literature, but the science information system in the West was in disarray. The system had not been considered sufficiently important nor was it well enough funded to keep up with the vast increases in the numbers of scientists employed and publishing in the postwar period. There was said to be a cocktail called Sputnik, one part vodka and three parts sour grapes.

The Journal of Chemical Documentation started shortly after Sputnik, in 1961. My guess, though I've not read any of the American Documentation articles, is that the topics in each field became too specalized to have a single, wide-ranging journal. Quoting the ADI link further, by the end of the 1950s, the 75 papers at the ICSI conference were a prominent display of the multi-faceted nature of the world of documentation and the rich research potential of the field, but they also showed the field lacked a clear synthesis and direction. That's a good reason to start a specialist journal.

(As an example of continued work in superimposed coding, published in that journal, Ronald Kline comments that:

[I]n the mid-1960s, academic researchers applied the mathematics of information theory as heavily as Mooers had done. In 1967 Pranas Zunde and Vladimir Slamecka ... used Shannon's entropy equations to calculate an optimal frequency distribution of descriptors by the number of postings that maximized the use of the index.
where the reference is Zunde, P., & Slamecka, V. (1967). Distribution of indexing terms for maximum efficiency of information transmission. American Documentation, 18, 104-108. Yet another paper for me to follow up on!)

I also guess that the money going into science research meant more money going towards developing computers meant more drug companies could have a computer, so there were enough people interested in a narrow topic to make the journal viable. Also, the ADI link points out that [b]y the early 1960s the term documentation was beginning to sound old-fashioned and inadequate in this new world of computers ... The ADI Council first considered a name change in 1963 but the official decision to change the name to American Society for Information Science (ASIS) was not made until 1968. J. Chem. Doc. followed the trend and became the Journal of Chemical Information and Computer Sciences in 1975, only to change its name again in January 2005 to the Journal of Chemical Information and Modeling.

That said, I still don't understand how the ACS, as compared to any other field, was so tightly coupled to these three key figures of information science. Anyone know?

Copyright, patents, and trademarks

I get the feeling the Mooers wanted to follow the ideal of an American inventer: a thinker who comes up with ideas, patents them, and makes money by licensing the right to use the patents.

For example, he tried patenting his Zatocoding system. According to his oral history, it took 23 years for the USPTO to grant the parent, which was past the time it was commercially viable.

I double-checked. The granted patent is "Battery controlled machine", US 3521034 A. (I linked to Google instead of the authoritative USPTO because the latter only has images in a hard-to-use interface, while the Google has an OCR'ed copy on a single page.) Note the text This is a continuation-in-part of application Ser. No. 392,444, filed Nov. 16, 1953, which was a continuation-in-part of application Ser. No. 774,620, filed Sept. 17, 1947.

In his oral history he comments that he:

... was becoming more and more critical of what I could do in the library field. That is, by 1960 there were now computers and "operators" like Herb[ert R. J.] Grosch at General Electric (GE) were moving in, and being the big boss of the computer at a company and were going around looking for business. And the library field was beginning to wake up to the fact [that] there might be something here. You don't take your business to a little hole in the wall like Mooers was operating. You take it to GE or you take it to MIT. There was an "operator" – Overhage at MIT – who set up a big project, INTREX, to solve all the problems for all time of libraries with computers at MIT. Herb Grosch was taking contracts at GE. This was the situation. The result of all of this was that in the mid 1960s, I more or less turned off my public interest to the information and library field, although I kept following it to some extent in private, and turned on my interests in programming languages and TRAC.
It's indeed hard for a lone inventor to compete against someone with close library and government connections (Taube) or big business connections (Luhn). That's where the limited monopolies of copyright and patent might help, but at the time no one, including lawyers, thought they could be used for software. Instead, he filed for a trademark on the name "TRAC", to the detest of many. Quoting from Mooers:
The first issue of Dr. Dobb's Journal, one of the early publications in the personal computer field, has a vitriolic editorial against Mooers and his rapacity in trying to charge people for his computing language.
Many people still have a fondness for TRAC, because Ted Nelson talks about TRAC in his widely-read Computer Lib. Here's one of the RESISTORS describing the fondness, along with a re-implementation of TRAC. (Which would have annoyed Mooers to no end, had he still been alive.)

Show me the documentation!

For someone so intersted in document retrieval, and whose ideas seem to inspire much of core cheminformatics, it's surprisingly hard to read his important papers. The most critical ones (for my interests) were published basically as white papers from his company. They aren't in WorldCat, or accessible by various Google general search engines. These are:

as well as Information retrival on structured content, pp. 121-134 from Information theory; papers read at a symposium on information theory held at the Royal Institution, London, September 12th to 16th, 1955 (Academic Press, New York, 1956). (It appears that the National Library of Sweden has a copy of this book, so I didn't put it on the list.)

I'll even go so far as to say that any paper published in the last 20 years, which referenced one of those first three citations, is likely making the reference second-hand through other papers, and not from actually reading the original paper. The only possibility I've found to get copies of the papers is to contact the curators at the Charles Babbage Institute. They have 39 boxes of his papers, including those three. I'm waiting for a reply to my email asking how to get a copy.

Zatopleg in the patent literature

Meanwhile, I have some sideways views through the patent records. These are:

US 4118788: Associative information retrieval (1978):

One prior art technique, originated in the 1940's, that is designed to permit associative retrieval in mechanical type systems rather than in conjunction with computers, is sometimes referred to as "Zatocoding". A complete description of a Zatocoding system, including some of the background mathematics, is contained in British Pat. No. 681,902 issued to Calvin Mooers on Oct. 29, 1952.

... While many of the features of the Zatocoding system, including the theory of superimposed coding, may be quite valuable in enabling associative retrieval, it nevertheless remains that the technique was generally oriented toward manual type storage systems and was never expanded so as to be useful in the environment of modern day computers.
(Thanks to an anonymous commenter for the link to the British patent. Follow the link to "Original Document" on the left column to see the text.)

More importantly for my interests, the patent literature is the only Internet-accessible source I've found which describes the Zatopleg system, in US3476311: Two-dimensional structure encoding (1969):

According to the Zatopleg system, a random number is attached to each atom, which number cannot be assigned more than once within a molecule. This would be followed by lists showing which other atoms each atom is linked to. Therefore a Zatopleg code for one atom of a molecule would consist of: first, an arbitrary number assigned to the atom within the molecule; second, an identification number for the kind of atom (e.g. its atomic number); and, third, the numbers of the atoms to which it is connected.
So close, but frustratingly incomplete.

Eugene Garfield's tribute

Eugene Garfield wrote a tribute to Mooers in The Scientist, Vol: 11(4)March 17, 1997. If you've made it this far through my text, you'll be able to understand nearly all of the context of that tribute; something I couldn't have done two weeks ago. (He uses the term "hashcoding" for zatocoding; I'll need to follow up on that as well.)

Garfield pointed out one last difficulty that Mooers had as an independent, for-profit researcher:

I remember resenting the fact that he was "selling" us on a commercial, for-profit product, which I inherently mistrusted. I hadn't yet overcome the idealistic notion that all good things were nonprofit, which was probably a reflection of youthful naïveté. I appreciated how difficult and frustrating it was to compete with the arrogance and market advantages of these nonprofit establishments. However, my later experiences with large government agencies and nonprofit institutions changed that view. When I had to survive as a private consultant and for-profit entrepreneur, as Calvin did, I appreciated how difficult and frustrating it was to compete with the arrogance and market advantages of these nonprofit establishments.

That's something that I, an independent, for-profit researcher should bear in mind.

Comments?

This is incomplete research, and I don't know where I'll go with it. There are a lot of books about the history of information science, and I know so very little about the topic. I know there are still many fans of Mooers out there; to them, I hope I did a good job.

My interest is in understanding the evolution of cheminformatics systems, especially machine-based systems. If you have any details or comments to add, please do so, or send me email to dalke at dalkescientific dot com.



8-bit SMILES #

SMILES strings are a compact way to store molecular topology. I would like to use them as the primary storage format for an in-memory compound database which supports substructure search. With a more compact format, I can fit more structures into RAM and also perhaps make better use of cache. There's some trade-off though; it takes time to convert a SMILES string into a usable data structure for subgraph isomorphism testing. I though of a way to make that conversion a bit faster. Rather than determine ring membership after creating the internal connection table, encode the ring membership in the high bit of the SMILES string, assuming the otherwise 7-bit clean SMILES strings is stored as 8-bit bytes.

I though of two forms of "8-bit SMILES." One is a purely internal format, where the high bit of byte i indicates that bond is is in a ring, and the other is an 8-bit exchange SMILES, where ring membership is portable between different implementation. The former is perhaps practical, the latter is more conceptually interesting than useful. At this point though it's mostly theoretical, as I don't have access to software where this might make a noticable difference.

The rest of this post describes what I just said, along with a lot of details.

Substructure search needs compound lookup for validation

A long-time reader of this blog will recall the many essays I've written about similarity search, and more specifically Tanimoto search. One result of that work is chemfp, a set of tools for generating and searching chemical fingerprints. Now I'm going to tackle substructure search, which is a much more complicated task.

Most computer-based substructure search systems break the problem up into two phases: a fast screening phase, which uses a pre-computed index to reject obvious mismatches, and a validation phase, which uses the slower subgraph isomorphism tests to remove false positives. The validation phase get identifiers from the screen, fetches the corresponding target molecules, and tests if the query subgraph exists in the target.

The easiest way to implement this is to load all of the molecules into memory. The "fetch" step then is simply an index or table lookup to get the corresponding molecule object, which is already in a form ready for subgraph isomorphism testings.

Loading a large database takes time

There are two main problems with that. First, the database needs to load all of the molecule objects before doing any searches. This may take some time, which is okay for a chemistry database server, where server starts are rare and where the startup cost is essentially invisible. It's more of a problem when you want to embed substructure search in other tools, like command-line programs or directly your web server.

The startup cost can be mitigated in a few ways. The input structures can be converted to a format which is faster to load; if the structures are stored in a random-access file format (I like cdb and sqlite, but of course any database would also work) instead of a streaming format like a SMILES or SD file then the records can be loaded only when needed; and once read the resulting molecule can kept in memory for later reuse.

A large database takes a lot of memory

The second main problem is that molecule objects take up space. My desktop has 6 GB of memory, and PubChem has about 50 million compounds. That means the average compound must take up less than 128 bytes for the identifier, molecule, and the screen. Most substructure screens are 1024 bits or larger, which is 128 bytes right there! (In a later post I'll point out one technique which might reduce the screen to 20 or 24 bytes.)

There are also ways to mitigate this overhead. The molecule's connection table representation can be made more compact. This can include shaving off bits for the atom and bond information. There's little need to have a 32 bit integer or even 8 bit integer for atomic charge when only 4 bits are needed to represent the charges -8 to +7. Nor is there need for an 8 byte pointer to each atom and bond object when a 2 byte table index is all that's needed for 64K atoms or bonds.

If you instead want a full connection table then you might instead store the connection table in a compressed form (doing this means you can't use pointers), and decompress it when needed. As a simple test, zlib compresses RDKit's binary molecule format by about 50%. You might also consider Google's Snappy or Blosc.

Unfortunately, it's hard to do this on your own. The chemistry toolkit developers follow the encapsulation principle, which means you as a user of the toolkit don't have direct access to the underlying data structure, but only have access through some API.

If you use OEChem then you're in luck - their API has a "DBMol", which compresses the internal representation. You need to call UnCompress before you can use it. In the other toolkits I know about, the closest you can get is to store the molecule in one of the support toolkit formats; preferably one that's compact and fast to load. For RDKit that's the binary molecule format:

>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles("C=O")
>>> s = mol.ToBinary()
>>> s
'\xef\xbe\xad  [...removed for clarity...] \x00\x00\x00\x16'
>>> len(s)
60
>>> mol2 = Chem.Mol(s)
>>> mol2.GetNumAtoms()
2
(The other toolkits have similar binary formats.)

SMILES as the internal format

As you might have noticed from the previous section, RDKit's binary format is large compared to the original SMILES string. The two byte ethane SMILES produced a 60 byte result. Using SMILES strings from ChEMBL-16, the average SMILES string was 52 bytes long while the zlib compressed molecule byte string was 215 bytes long and the uncompressed molecule byte string was 422 bytes long. (zlib compression of the SMILES string gives another ~15% compression, but if you go this route you should consider a special encoder using a prebuilt dictionary for SMILES, along the line of SMAZ.)

Assuming 55 bytes/SMILES, 24 bytes/screen, 10 bytes/id, and 32 bytes for hash table overhead means each record in the database takes only 121 bytes, which is just barely within my memory budget of 128 bytes. My goal can be achieved – if I can keep everything as SMILES, or a simliarly compact notation.

(Using a canonical SMILES is likely better than using an alternative format, because odds are you'll want to return the SMILES to the caller. Since you have the SMILES already, you don't need to do another lookup. Also, it's probably a better use of your time to optimize SMILES parsing, which has broad functionality, than to develop and optimize a new format.)

There's long precedence for using SMILES as the primary structure format. Daylight's database systems (Thor and Merlin) did this. They provided the first generation of in-memory chemical databases, and from what I heard, the result was about 100x faster than the previous disk-based systems, because memory is so much faster than disk.

You can even see some evidence of that in the literature. John Barnard's excellent review paper "Substructure Searching Methods: Old and New" in JCICS (1993), p. 535 in the section "Hardware Solutions" starts:

The modern availability of computers with very large amounts of random-access memory has provided an opportunity for futher application of the superimposition principle in the substructure sarch system marketed by Daylight Cehmical Information Systems Inc. The reduction in the length of the big string (or "fingerprint") required for each compound enables the entire bit map, even for very large databases, to be held in memory and, thus, permits very rapid screening searches.
That's not quite the verification I wanted, since it only talks about in-memory screens, but it's the best I can find so far.

SMILES parse performance

The question is, is the SMILES parser too slow for this task? That depends very much on the toolkit, or rather, the skills of the toolkit authors, the time they spent working on optimization, and how they decided to decompose the various stages of SMILES parsing.

Take for example OEChem. Its OEParseSmiles handles a blistering 60,000 SMILES strings per second. Not only did they spend a lot of time making the code go fast, but the parser does the bare minimum needed to parse the SMILES string. For example, this low-level API call will not modify the input aromaticity, so if the input benzene is in Kekule form then the molecule object will have single and double bonds instead of aromatic bonds. (You need to call another function if you want to perceive a certain aromaticity model.) I believe it also uses lazy initialization of some of the internal data structures, so timing OEParseSmiles isn't enough to get a real understanding of the overall performance.

By comparison, RDKit's MolFromSmiles does a more sedate 2,000 SMILES per second. Much of the time is spent doing in sanitization, such as aromaticity perception. If you pass in benzene in Kekule form then RDKit will convert it to aromatic form. It also builds up information ring and ring size information; see that above link for details.

Disabling RDKit sanitization

RDKit's default is good for a program which accepts arbitrary SMILES strings. But a chemistry database gets to specify its own internal format. It can assure that the SMILES strings in the database are valid, have the right aromaticity assignment and chirality, and so on. There's little need for sanitization overhead in this case.

RDKit recognizes this. MolFromSmiles has a 'sanitize' parameter which, when false, disables sanitization. This brings the performance up to 8,500 SMILES strings per second. Unfortunately, some SMARTS queries depend on information that's normally computed during aromaticity:

>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles("C1=CC=CC=C1", sanitize=False)
>>> Chem.MolToSmiles(mol)
'C1=CC=CC=C1'
>>> query = Chem.MolFromSmarts("[R]~[R]")
>>> mol.HasSubstructMatch(query)
[13:11:16] 

****
Pre-condition Violation
RingInfo not initialized
Violation occurred on line 36 in file /Users/dalke/envs/RDKit_2014_03_1_py27/Code/GraphMol/RingInfo.cpp
Failed Expression: df_init
****

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Pre-condition Violation
There are similar problem for ring size-based queries like "[R3]" and valence-based queries like "[v4]". RDKit has alternative methods to set some of these internal data structures, like the following:
>>> Chem.FastFindRings(mol)
>>> mol.HasSubstructMatch(query)
True

Encode ring membership in the SMILES

My point is that both RDKit and OEChem must do some sort of ring-finding, if that information needed for the substructure query, because that information isn't directly expressed in the SMILES string. It's one of the tradeoffs between a compact representation and one which is directly usable for substructure search.

SMILES strings contain only printable ASCII characters. Each character only needs 7 bits, but are often stored in a byte with 8 bits. There's an extra bit available! What about encoding the ring membership information in that bit? I came up with two possibilities.

Tag bonds by byte position

The first is based on the fact that a SMILES string with N bonds has at least N+1 characters. Proof: every bond requires either an additional atom symbol (adding at least one character) or a ring closure (adding at least two characters). Set the high bit of byte i to 1 if bond i is in a ring, otherwise set it to 0. Obviously, both atoms at the end of a ring bond are also in a ring.

Very simple, yes? It also has the advantage that the correct SMILES is trivially generated by masking the high bit to 0. It's even possible to experiment with parts of this idea on your own. Use your toolkit of choice to make a SMILES string, then parse the string back into a molecule. Use the resulting bond list to set the SMILES string bits.

Unfortunately, single pass SMILES generation (that is, without reading the SMILES back into a molecule to get the bond ordering) isn't possible without toolkit support, since I don't think any toolkit exposes that information into user space. Not that it makes a difference, since there's no way to make your own reader; the toolkits don't have a way to modify their internal ring membership state.

It's also not portable, since there's nothing in the SMILES spec which requires that toolkits preserve bond ordering is preserved.

By the way, I told a lie when I said that the number of bonds was less than the number of characters in the SMILES string. Most toolkits interpret "[ReH9-2]" as single rhenium atom with an implicit hydrogen count of 9, but a few treat it as a rhenium atom with a bond to each of 9 hydrogen atoms. That's 9 bonds in SMILES with 8 characters, so it's not even possible to experiment with 8-bit SMILES using those toolkits.

OEChem and RDKit support for 'foreign' aromaticity modes

Before I get to the second possibility, a bit about aromaticity models. The Daylight toolkit supports a single aromaticity model. All input structures are reperceived according to that model. But different toolkits have different models, and some file formats are based on toolkit-specific aromaticity assignments. If you want to work with multiple toolkits then you need a way to work with 'foreign' aromaticity, and not just the toolkit's default.

As you saw earlier, OEParseSmiles keeps the original bond type assignments, and RDKit can be convinced to do the same. But there's an ambiguity in how to handle implicit bonds in SMILES if there's no aromaticity perception stage.

Consider biphenyl, "c1ccccc1c1ccccc1". It's bonds are all implicit, which mean they are either a single bond or an aromatic bond and the toolkit needs to figure out which is which. What are the rules for how to interpret a foreign aromaticity model? In this case, it's possible to do a ring search, detect that one of the implicit bonds is not in a ring, and determine that it's a single bond. For the others, it can assume that an implicit ring bond between two aromatic atoms is itself aromatic.

But ring detection isn't enough. Consider "Cn1ccc2nccc2c1", where the fusion bond between the two rings is not aromatic enough though the bond is in a ring and the end atoms are aromatic. A smart enough reader could then apply chemistry knowledge to resolve the ambiguity, but "smart enough" readers are more complex, and at some point there will be disagreements on which interpretation is correct.

Instead, good SMILES writers follow a simple rule. A single bond whose two terminal atoms are both aromatic must be represented with an explicit "-". A non-validating reader which sees an implicit bond, need only to check the two terminal atoms. If both are aromatic then it's an aromatic bond, otherwise it's a single bond.

Under this rule, the biphenyl from before is better written as 'c1ccccc1-c1ccccc1' and the structure with two fused rings is better written as "Cn1ccc-2nccc2c1" instead of "Cn1ccc2nccc2c1". In that way, another toolkit can read the SMILES and unambiguously support the foreign aromaticity model.

Tag atoms and explicit bonds; workaround implicit ring bonds

My other 8-bit SMILES idea is portable, because it doesn't depend on a specific atom or bond parser order, but it's more complex to implement. The basic idea is to mark each ring atom and bond by setting the high bit of its corresponding SMILES symbol to 1. This is trivial for the atoms, because each atom must have at least one symbol in the SMILES, and it's trivial for explicit bonds, but what about the implicit bonds - how do we know which is in a ring and which isn't?

I considered making each ring bond explicit, but that was too much overhead. Instead, I'll use a rule which mirrors the aromaticity rule: if a bond which could be implicit is not in a ring but its terminal atoms are ring atoms then the bond must be made explicit. A quick survey using ChEMBL shows that on average the SMILES strings grow by 0.5 character, to incorporate the extra "-"s needed for this scheme.

Which about SSSR?

The two proposals use the available bit to specify ring membership, but most toolkits implement the more complex SSSR - "Smallest Set of Smallest Rings." As you can see from the RDKit timings, SSSR perception takes about 15% of the sanitization overhead, while bond membership detection is about 1/5th of that time. Thus, my proposals would only save a few percent of the overall time.

But then again, I don't think we should worry about SSSR.

There are two camps about SSSR. The old papers say things like "SSSR is very important; it reflects the way that chemists really think." Daylight comes from this era, and most toolkits follow its lead. Only thing is, there are so many SSSR implementations. My SSSR analysis from a few years ago convinced me from that SSSR is not worthwhile. In short, where it was possible to have multiple choices for the SSSR, there was little toolkit consensus on which was the right set.

I'm decidedly in the OpenEye camp - don't use SSSR. This has a few downstream consequences, the most obvious being that I agree with OpenEye's decision to redefine the "R<n>" SMARTS term. According to Daylight, "R2" selects atoms which are in 2 SSSR rings. According to OpenEye, "R2" selects atoms which have at least 2 ring bonds. (As an approximation, Daylight's "R2" is OpenEyes's "R3".)

You can see how this change to the SMARTS semantics - which I agree with - gives a definition which is more portable, easier to implement, and requires less setup time to get the parsed molecule into a useable state.

Bear in mind though that it's unlikely that RDKit will change the way it does things, even if it can get a ~15% speedup by getting rid of SSSR. And yes, this lack of SSSR perception is one of the reasons why OEChem's SMILES parser is so much faster than RDKit.

Which is better?

I'll call these "bond tagged" and "atom tagged" 8-bit SMILES, respectively. I originally came up with the atom tagged format because its portability fits in with the underlying goal that SMILES is a implementation-independent line notation for valance bond chemistry. The problem is that portability comes at a cost. For example, either the 7-bit output SMILES contains an occasional extra "-" in its canonical format, or there's a decidedly non-trivial conversion function to remove the occasional "-" characters. By comparison, the high bit stripping function for bond tagged SMILES is just a few lines of code.

The thing is, I can't come up with a good reason to have a portable 8-bit SMILES. Effectively every system will re-canonicalize the SMILES to put it into the database for record lookup, because there is no standard SMILES canonicalization algorithm. There might be some very clever programs which can reason directly on the SMILES strings without making a full-blown connection table, but overall it seems like a lot of work for little gain.

One observation is that it's easy to count the number of ring bonds in bond tagged SMILES, and ring atoms in atom tagged SMILES, by just doing a scan of the bytes. I speculate this might be useful.

Is it useful?

That's really the question, isn't it? I don't know. For RDKit it won't make a difference - ring membership identification is about 3% of the overall time, so it doesn't seem like it's worth speeding up. But for something like OEChem .. well, it's proprietary code and I don't know how much of its time is spent in ring detection.

It will be many years before I can get to the stage where I can try out this idea. Until then, this page is meant as a thought experiment for what might be possible.



Varkony Reconsidered #

This is an expanded version of the talk "Varkony Reconsidered: Subgraph enumeration and the Multiple MCS problem", which I gave at the 6th Joint Sheffield Conference on Chemoinformatics on 23 July 2013. (In the programme I called it "Varkony Revisited", but I later decided that "Reconsidered" was a more appropriate term.)

The major goal of the presentation was to show how the 1979 paper of Varkony, which is part of the MCS literature, is better seen as an early paper in frequent subgraph mining. My own MCS algorithm, fmcs, is a variation of that approach, and is best seen as an intellectual descendent of an algorithm by Takahashi. Both the Varkony and Takahashi papers are relatively uncited in the MCS literature, so I spend some time explaining how they fit into the larger context.

The slide number references refer to the accompanying PDF slides. See the slides for full citation references and images.

Early MCS history

[slide 1]
Today I'm going to talk about the maximum common substructure problem.

[slide 2]
Here's an example of the MCS problem in its simplest form. You have two molecules and you want to find the largest connected substructure which is common to both structures. I won't be talking about disconnected substructures.

[slide 3]
Talking about the MCS problem here is a bit like bringing coal to Newcastle. It seems that half of the significant papers in the field came from this university. One of the earliest ones is this 1967 paper by Armitage et al..

[slide 4]
Armitage proposed using the MCS as a way to identify a reaction mapping, that is, to identify the conserved subgraph between the product and the reactant.

[slide 5]
If you read the paper you'll see that it also talks about using the MCS as a measure of chemical similarity, that is, a larger MCS should indicate that two molecules are more similar than smaller ones.

[slide 6]
This early effort wasn't successful. It only handled acyclic molecules, in part because of the very limited computer hardware they had. They report that their Ferranti Mercury computer had 480 words. I researched the computer, and it looks like it was built around 1960, and had 10-bit words. It was the first Ferranti with valves instead of tubes.

Multiple structure MCS and Varkony

[slide 7]
I'm actually not as much interested in the pairwise MCS as I am the multiple structure MCS, where there are two or more structures.

[slide 8]
The 1970s and 1980s saw a lot of work on the MCS problem, and the problem space was pretty well developed. You can see that in this 1979 paper by Varkony et al.

[slide 9]
In it they motivate the reason for working on the multiple structure MCS: "molecules displaying similar activities do so because they share some similar feature or features." Given a set of compound, the frequent presumption is that the MCS contributes to the activity.

[slide 10]
By the late 1970s we knew the main ways to compare the atoms and bonds in the molecular graphs. For example, you may want carbons to match only carbons, but have fluorines match any other halogen. In the general case, the chemist can define atom types so that atoms match only other atoms of the same type.

Bonds comparisons have similar flexibility. You might require bonds to match other bonds with the same type, or to any bond type. You can also look at ring membership, so that ring bonds only match ring bonds and not chain bonds.

The Varkony paper mentioned all of those types of topology comparisons, and they are still present in nearly all modern MCS tools, though of course most add additional refinements.

[slide 11]
This talk is titled "Varkony Reconsidered" because I think it sets the early groundwork for an approach that I and a few others have taken in working on the MCS problem. In order to set the stage, I think it's important to describe the Varkony algorithm.

It starts by assigning atom and bond labels to all of the structures and ordering the structures in a way that should improve the overall performance.

For each structure, get the list of subgraphs with just two atoms and one bond. These are called the nuclei, to "convey the sense of growth ... in subsequent steps." Canonicalize the nuclei, which is easy in this case because there are only two ways to represent a subgraph with two atoms.

Now that you have the SMILES string ... sorry, I'm a modern cheminformatics researcher so I think of these as canonical SMILES strings for the subgraphs even though SMILES wasn't published for almost a decade .. now that you have the canonical subgraphs, check to see which ones are present in every structure. If a bond isn't in all of the structure then mark it as "do not participate", so it won't be used for future nuclei growth.

For the nuclei which are present, go back to the original graph and grow the original subgraphs by one atom, in every possible way. This will give you all subgraphs of size 3 atoms, where each of those subgraphs contains a subgraph of size 2 atoms that is present in all of the structures.

Canonicalize those subgraphs of size 3 and only use those subgraphs which are common across all of the molecules. Keep repeating this process until no more common subgraphs can be found.

This finds all of the common subgraphs, so the MCS is (or are) the final common subgraphs in the process.

[slide 12]
It's pretty easy to see that method this will work, but that it might not be so fast. Still, using 1970s hardware they were able to fine the MCS of three marine sterols in between 45 seconds and a several minutes, depending on how it was parameterized.

As a side note, it's interesting to read these papers to see how the technology has changed over time, and to see how the implementation language affects the algorithm and timings. In this case they used two different programming languages. Canonicalization was done in Algol while the rest of the code was in the much slower INTERLISP. I interpret this as the authors saying that their algorithm is good, and asking us to not take the timings at face value because the implementation gets in the way.

The computer was a 1972-era PDP-10 hosted by the SUMEX facility at Stanford. The SUMEX (Stanford University Medical Experimental Computer) project promoted research in artificial intelligence to bio-medical problems, and made their hardware available to others over the network.

Some other MCS approaches

McGregor

[slide 13]
Just a few years later McGregor published a paper on how to apply backtracking to the pairwise MCS problem. Backtracking was not a new programming concept, but in my reading of the MCS literature the McGregor paper was the one which specifically promoted it as a technique for the MCS problem, and it influenced a lot of the subsequent work.

Basically, start by mapping one edge from the first structure to the second. Extend the edge in the first structure and look for a corresponding extension in the second. There can be multiple possibilities, so pick one, but keep track of the other possibilities. Keep repeating the process until it's not possible to grow any further.

This gives a maximal common substructure. "Maximal" here means that there's no larger common subgraph which includes maximal common substructure, but there might be another maximal substructure which is even larger. A maximum common substructure is the largest maximal substructure.

[Note: The "maximal" terminology dates to at least Cone et al. 1977.]

At this point, the backtracking algorithm kicks in. It backs up and tries an alternative possibility. It keeps going forward and backtracking until it's done.

Backtracking can take a lot of time because the algorithm will find all of the ways to map any subgraph from the first structure to the second.

[slide 14]
However, if the current best maximal common substructure is of, say, size 10 atoms, then you know that any larger MCS must have at least 11 atoms. If at any point in the backtracking you can prove that there's no way to match 11 atoms then there's no reason to continue searching that part of the search tree.

Pruning makes backtracking effective.

[Note: pruning was not first introduced in the McGregor paper. Cone et al. in 1977 mention pruning as part of DENDRAL, but most papers reference McGregor as the source for the general backtrack and pruning for the MCS search in cheminformatics.]

[slide 15]
I was able to find some information about how the McGregor algorithm was first implemented. It was written in ALGOL68 for the University of Sheffield's ICL computer, which was about 5 years old by the time McGregor developed the software. It had 24 bit words, and it looks like subgraphs were represented in the algorithm using a word as a bitset for atoms and bonds, which meant the implementation was limited to at most 24 atoms and 24 bonds.

It took about 9 seconds on average to find the pairwise MCS between "a simple file of 140 reactions."

In my literature research, I found a 1986 paper which still used the ICL 1906S computer. This shows that the hardware was in use at Sheffield for over 15 years. Humorously, you can see how it comments that Algol 1968 was "not a particularly efficient language" for the 1986 algorithm, reflecting similar language from the Varkony paper.

Clique detection

[slide 16]
There's another technique to find the MCS by reducing it to the maximum clique problem.

I'm not going to go into the details of that the clique approach. I find that it's easier for me to think about backtracking than the equivalent approach of finding cliques.

[Note: The usual citation for this is to Cone et al. in 1977, though Cone cites G. Levi's 1973 paper in Calcolo titled "A note on the derivation of maximal common subgraphs of two directed or undirected graphs." As that paper costs US$40 to download and is an Italian applied math journal that at least my local library doesn't have, I've only read the first two pages, which are available online for free. That's enough to see that it's primarily a math paper and not a cheminformatics paper, which is why people usually cite Cone. Interstingly, the word "clique" does not exist in that paper. Who first said "clique" in a cheminformatics paper?]

[slide 17]
Let's jump forward a couple of decades. In 2002, Raymond and Willett published a very informative review paper on the various MCS solutions. This is after Raymond had implemented the RASCAL MCS program, which uses the clique approach.

[slide 18]
In the paper they recommend using the clique approach, in large part because mathematicians and computer scientists have spent a lot of time figuring out how to make clique detection be fast.

Papers can be viewed as a very long, slow conversation about a topic. Cao et. al point out that certain types of non-transitive comparison are hard to do with a clique algorithm. "For example, allowing a bromine atom to be matched to a chlorine atom when they are attached to an aromatic ring, but not elsewhere, is much harder to do with association graphs." Cao also points out that there there have been improvements to the graph backtracking methods, such as the VF algorithm. In other words, just because Raymond and Willett might have been right in 2002, that doesn't mean that a clique solution is the best one for now.

[Note: libmcs from ChemAxon uses the clique approach plus heuristics to implement an approximate pairwise MCS. As the poster titled "Making the Most of Approximate Maximum Common Substructure Search", which was next to mine at the Sheffield conference, Péter Kovács discussed the effort it took to handle non-transitive comparisons.]

Multiple structure MCS through pairwise reduction

[slide 19]
The Raymond and Willett review paper focuses on pairwise MCS algorithm and not the multiple structure MCS. They point out that most methods can be extended to handle a collection of graphs.

This is true. Any pairwise method which finds maximal substructures can be made to support multiple structures through a reduction method: the maximal common substructures of the first pair are used as reference structures to compare against the third, and so on. This can be combined with pruning methods to make the search be practicable.

[Note: I developed this pairwise reduction technique for my own MCS algorithm in the late 1990s. Researching it now, it looks like Denis M. Bayada and A. Peter Johnson published this approach in "Detection of the Largest Patterns Common to a Set of Molecules, Using 2D Graphs and 3D Skeletons", which is a chapter in "Molecular Similarity and Reactivity: From Quantum Chemical to Phenomenological Approaches, 243-265." As a side-side note, it also looks like they didn't quite get the key differences between Varkony and Takahashi, but I haven't yet talked about that topic.]

[slide 20]
Pairwise reduction is an indirect solution. There's a straight-forward direct extension of the clique program for multiple structures but it takes memory which grows exponentially with the number of molecules. This problem was resolved in the MultiMCS paper of Hariharan et al.

My two MCS implementations

[slide 21]
My first attempt at the MCS problem was in the late 1990s. I worked for a company called Bioreason. We developed a method to combine the MCS with clustering to build a dendrogram "where each node contains a set of molecules and a chemical fragment common to the molecules." I wrote that MCS code using a pairwise backtracking algorithm combined with the the reduction technique to handle multiple structures.

[Note: Actually, this was the second MCS implementation at Bioreason. Brian Kelley write the first, as an approximate method based on a genetic algorithm. We used this method during testing, to make sure that the GA never found a larger common substructure than the exact method.]

The experience with that project lead to ClassPharmer, which in turn became the basis for MedChemStudio, available from Simulations Plus. Even though there's probably no remaining code of mine in the project, I'm still pleased that there's a direct lineage from code I wrote in the 1990s to that used by an tool which people are still using.

fmcs

[slide 22]
In mid-2011, Roche asked if I would implement an MCS method for RDKit. I no longer had access to my original MCS work, but that's okay because I had been working on a new idea based on subgraph enumeration. The result of this is 'fmcs', which is now distributed as part of RDKit.

It has a pretty standard set of MCS options. It defines a common substructure through a SMARTS pattern definition, which gives pretty good control over how atoms and bonds should be matched. Arbitrary atom class assignment is possible through a bit of a hack; you can overload the isotopes field as an arbitrary integer label to represent the atom class type. Sadly, there is no corresponding hack to define arbitrary bond classes, but that need hasn't come up in my discussion with scientists, nor mentioned in previous MCS papers.

[slide 23]
The most novel difference in fmcs that it can find the MCS that matches at least M of N molecules. If M=N then this is the normal multiple structure MCS. For example, the MCS between all of vanillin, zingerone, and capsaicin has 9 heavy atoms ...

[slide 24]
... while the MCS which is in at least two of those structure has 13 atoms.

[slide 25]
The fmcs algorithm is similar to that of Varkony. I pick a reference molecule from the set and generate its subgraphs. For each subgraph, I convert it into a SMARTS pattern, then do a subgraph isomorphism check to see if matches the other molecules.

The largest SMARTS pattern that matches is an MCS for all N structures. Of course, if M<N then I need to enumerate subgraphs from a few more reference structures, while on the other hand I don't have to check all of the other molecules to see if a given subgraph SMARTS matches. The major difference at this level is that Varkony uses canonicalization to determine if two subgraphs match, while I use a subgraph isomorphism search. I chose this because I could build on top of the existing VF2-based SMARTS matcher in RDKit to do the backtracking search.

Just like with the Varkony method, you can intuitively see that it will work, but it might be slow. The methods to make it fast are nearly identical to how to speed up any other MCS program.

[slide 26]
This is a new algorithm, and it's my algorithm, so I'll go into the algorithm in a bit more detail.

(1) First, I create atom and bond types. These are based on the atom and bond SMARTS. The atom type is just the atom SMARTS while the bond type is based on canonical SMARTS representation for the atom type of one end, the bond type, and the other atom type; of the two possibilities, use the alphabetically smallest one.

[slide 27]
Once I have the canonical bond types, I can remove (or mark as "do not participate") the bonds which don't occur at least M times, because there's no way they can be part of the MCS. Bond removal may cause fragments.

Order the molecules by the size of the largest fragment, such that the smallest largest fragment is first. This is a common technique, which should help improve the overall performance.

[slide 28]
(2) There are generally two ways to search for an MCS: either start by looking for one bond match then two bonds, three, and so on, or to start from the smallest largest fragment and start removing bonds.

I decided to be optimistic and try the smallest largest fragment first before starting by building up the subgraphs. I expect that the input to the MCS algorithm will be similar structures or structures which are already known to have a common core. In that case, the pruning from step 1 might have been enough so that the smallest largest fragment is the actual MCS.

So, I take the largest fragment, turn it into a SMARTS pattern, and do the subgraph match to see if there are at least M-1 matches. If so, the algorithm is done.

subgraph enumeration

If not, then it's time to start enumerating the substructures.

[slide 29]
The literature shows that subgraph enumeration is not obvious. The Varkony algorithm suffered from generating duplicate structures, I haven't come across a description of the enumeration method when looking through the cheminformatics literature, and it's not typically covered in a graph theory course.

[slide 30]
The basic concept is quite simple. Start by generating all subgraphs which use the first bond, and then generate all subgraphs which don't use the first bond. Repeat the process for the second edge, then the third, and so on. This gives you all 2**n ways to make a subgraph, including the empty graph, and does so uniquely.

This simple method generates both connected and disconnected subgraphs but fmcs only uses connected subgraphs so I use a slightly different algorithm.

[slide 31]
I create an initial set of seeds and put them into a priority queue. The terminology and idea of a 'seed' is very similar to the 'nucleus' in Varkony. A seed contains three things: the set of bonds which are in the seed, the set of bonds which may not be used for future growth, and the set of bonds connected to the seed which are available for future growth.

If an initial seed contains bond i then bonds 1 to i are specifically excluded from future growth. This prevents duplication.

[slide 32]
For each iteration, I pick a seed and grow it. If there are n possible outgoing edges then there are 2n-1 possible ways to grow the seed into a new seed. For example, if there are two outgoing edges then I could choose both edges, choose the first, or choose the second. These new seeds go back into the priority queue. The n outgoing edges from the parent graph are marked as unavailable for future growth by the child, and new outgoing edges are found.

[slide 33]
I sort the priority queue of seeds by size. As an unusual variation compared to most MCS algorithms, the fmcs algorithm can grow by multiple bonds per step, then backtrack if that fails. I wrote it this way because of the heritage of how I came up with the algorithm. The fmcs approach is optimistic, but I've not tested to see if it's better than a more conservative single bond expansion.

[slide 34]
I also implement some of the usual pruning methods in MCS programs. If I can show that fragment growth can never beat the current largest subgraph then I don't bother to enumerate that part of the graph. There's also some optimization that I haven't yet added, like keeping track of the number of maximum number of bond types common to all structures and using that limit as a more fine-grained upper bound.

SMARTS subgraph isomorphism instead of canonicalization

[slide 35]
Then comes the key difference from Varkony. I convert the subgraph into a SMARTS pattern and check that it exists in at least M-1 of the remaining structures.

If it doesn't, then discard this seed and don't try to grow it, because no larger seed can be part of the MCS.

However, if it does, then add its children to the priority queue. If the seed is larger than the current best common substructure size, then it's an MCS candidate. fmcs supports an option that was in MultiMCS, which is that the if a ring bond from a structure is in the MCS then the corresponding bond in the MCS must also be in a ring. This option is called "complete rings only." It's implemented with a special post-processing step.

If the candidate passes these requirements, then it's the new best MCS.

[slide 36]
There's an optimization I can do as a consequence of turning the subgraph query into a SMARTS. I can easily cache the subgraph isomorphism search results so if I see something like "[#6]-[#6]-[#6]" again, I can reuse the cached result rather than do the isomorphism tests again.

[slide 37]
If you think about this a bit you might realize that there are multiple ways to represent the same search using SMARTS. Perhaps I can save time by canonalizing the SMARTS strings, in order to reduce the number of duplicate subgraph isomorphism searches.

[Note: The general category of "canonical SMARTS" is NP-complete, as Roger Sayle pointed out in a 1997 Daylight EuroMUG meeting. However, the fmcs atom expressions are simple, so canonicalizing them is trivial.]

Greg Landrum added a feature to RDKit so I could test this out. I found out that canonicalization in general took too much time. Remember, canonicalization is only needed if there is backtracking where different subgraphs are isomorphic and represented differently, and importantly, only when the canonicalization time is faster than the isomophism search time. The combination of these factors was rare in my test benchmarks. For example, in one experiment I needed to find the MCS over over 1,000 structures before canonicalization was faster. I don't expect that to be a common case.

[slide 38]
Instead, I developed what I've been calling semi-canonical SMARTS. For each atom in the subgraph, I assign it an initial atom invariant based on the atom type and the lexicographically ordered list of its subgraph bond types. This is a circular fingerprint of size 2, and corresponds to a single iteration of the Weininger et al. CANGEN algorithm. These invariants are fast to generate, and unique enough to gave usually canonical subgraphs.

[slide 39]
Finally, in my code I check for a timeout. This is, after all, an NP-hard problem, and there are times would I would rather give it a 30 second timeout for an approximate solution than spend hours waiting for a result.

MCS benchmark timings

[slide 40]
Now for the scary question: How well does this new algorithm work?

To examine that, I had to create a benchmarks which reflect how I think the MCS program might be used. (Actually, I would have used existing benchmarks, but the only ones I could find were too small to be informative.)

I started with a cleaned up version of ChEMBL 13 and generated 7 different benchmarks. The first selects two different compounds at random and finds the MCS between them. The others are based on fingerprint similarity searches. I found the k=2, k=10, and k=100 nearest neighbors for fingerprints picked at random, and I found the k=100 nearest neighbors for thresholds of 0.95, 0.90, and 0.80.

The benchmarks and the code used to generate them are under the BitBucket account for fmcs.

[slide 41]
The benchmarks include cases which cause all of the MCS programs to time out after 30 seconds. When timeout occurs, fmcs will report the largest substructure found by that point, which is usually pretty good, and it means tha fmcs can also be used as an approximate MCS program with a fixed upper bound in CPU time. Not all MCS programs can report a partial size.

[slide 42]
I was able to time fmcs against three other methods: extractCommonScaffold from the Indigo C++ cheminformatics toolkit, the MultiMCS maximum clique Java program I mentioned earlier, and MoSS, a frequent subgraph mining Java program that I'll mention in a bit.

It's difficult to compare the results directly. Unfortunately, most cheminformatics toolkits like to reperceive structures on input so they follow consistent aromaticity rules. This is what the Daylight toolkit did, and many follow its lead. While everyone agrees that "C1=CC=CC=C1" should be perceived as "c1ccccc1", they all differ with more complex fused multi-ring systems. Since the MCS search methods depend on bond types, different perceptions of an input structure can sometimes lead to different MCS results.

I inspected many of the differences until I started to go cross-eyed. As far as I can tell, the different programs only differ due to a difference in aromaticity perception.

[slide 43]
It's important to note that this benchmark does not give a perfect head-to-head comparison. Indigo's extractCommonScaffold and MoSS (when configured to find substructures with 100% support) return all maximal subgraphs, which mean they are doing more work than fmcs, which only finds one maximum subgraph. Also, these 4 benchmarks were done on three different computers, so there can be a factor of 2 or more variation just because of the different hardware.

Still, it's enough to point out that fmcs isn't all that bad. I've circled the fastest times in red and the second fastest in yellow. It looks like MoSS is overall the fastest algorithm, but not always.

[slide 44]
fmcs definitely isn't the fastest method, but I'm okay with that. For one, most of fmcs is written in Python. A profile shows that less than 5% of the time is spent in C++. Presumably, if more of fmcs were in C++ then there might be a 5-10x speedup... which sounds suspiciously like what the authors of the Varkony paper said about their implementation!

Also, there are some additional pruning heuristics that I haven't yet added.

MCS threshold profile

[slide 45]
I mentioned that fmcs finds the MCS of M of N structures, but I didn't give any real justification for why that might be useful. I wanted to see if I could use the MCS to find misclassifications in a cluster and give some sort of automatic quality score. It's pretty common to use similarity or some other means to group structures together, then try to find the MCS of the group. Unfortunately, the MCS is a rather fragile definition. A single misclassified structure can cause a very big difference in the result. I wondered if an MCS definition which might leave one or two structures out would provide useful information.

[slide 46]
I used ChEBI as my test case. ChEBI is a database of small molecules placed into an ontology by human curators. Some of the ontology terms are structure related. While terms like "contains two rings", cannot be defined as a SMARTS pattern, others can in principle be defined by SMARTS.

There are occasional misclassifications in the ChEBI ontology. For example, one of the nine 1,4-benzodiazepinones shown here is incorrect. If you look at the depictions for a bit you might be able to figure out which one is wrong.

[slide 47]
Try to find the MCS of all 9 and you'll see that that the expected benzodiazepinone core structure isn't found.

[slide 48]
But bring it down a step and find the MCS which is in at least 8 structures and you'll see the expected benzodiazepinone core in 8 of the structures, and quickly see the one misclassified structure. In this case it's a 1,5-benzodiazepinone.

[slide 49]
What I then did was compute the MCS size (as the number of atoms) for values of M from 4 to 9. (I actually used a binary search here so I didn't have to test all possible values of M. The MCS for M=6 is the same size as the MCS for M=8, which meant there was no need to compute the size for M=7.)

The plot of M vs. MCS size gives me a threshold profile. An abrubt dip in the profile indicates that there's something I might need to look at.

I computed the profile with three different topology parameters. One profile requires that element types and bond types match, another only requires that element types match, and the last only looks at the untyped topology. This gives me different sorts of information which can help identify the reason for a mismatch.

[Note: the pure topology profile never proved useful.]

[slide 50]
To see if MCS threshold profiles can be useful, I developed a prototype tool to explore the ChEBI ontology. The ontology terms are on the left. Click on a term and it shows the corresponding profiles. The profile points are selectable. Clicking on a point shows the SMARTS pattern for that point, displayed on the right using Karen Schomburg's excellent SMARTSview. The structure members in the term are displayed on the bottom, sorted so the structures which don't contain the SMARTS are shown first, and strutures which match have the match highlighted.

[slide 51]
The benzodiazepinone mismatch showed up directly in the profile based on atom and bond types. Sometimes it's the difference between two different MCS profile types which gives a useful signal. There appears to be a registration problem with the pyrazoles because five of them are not perceived as being aromatic by RDKit. In this case the profile based on bond types has a dip while the one which ignores bond types stays constant. This is often a signal that there is an aromaticity problem.

[slide 52]
The monocarboxylic term is interesting because it shows some of the different types of issues that come up with this approach. The SMARTS [#6]-[#6](=[#8])-[#8] matches 1424 of 1435 structures. However, this SMARTS requires a carbon in the R1 position, and there are two false negatives with structures which have a nitrogen in that position. (An obvious next step, which I haven't done, would be to use the SMARTS pattern to find additional structures which should have been classified as monocarboxylic.)

There are four true negatives, where the structure simply isn't a monocarboxylic acid. Finally, five of the structures show an implementation bug. The original ChEBI structures are in an SD file, which has a mechanism to denote an R-group. My prototype tools uses SMILES strings, which doesn't have an R-group notation. The conversion from SD records to SMILES converted the R-group into a "*", which is treated as its own atom type.

[slide 53]
I examined many ChEBI terms. Nearly all of them either had no problem or were grouped by a structural term that could not be described by a SMARTS. With practice, I found that the MCS profile could help me identify possible problem areas, but I ended up being unable to come up with an automatic quality score.

Frequent subgraph mining

[slide 54]
Here's where the history gets interesting.

I presented the fmcs algorithm last fall at the German Conference on Cheminformatics (GCC). Two people asked me if the fmcs solution wasn't just the same as frequent subgraph mining.

[slide 55]
I had no idea - I hadn't heard of frequent subgraph mining. So I read some of the papers from that field. They nearly all come from data mining, and use a somewhat different and perhaps more difficult language than I'm used to, so it took me a while to figure them out.

[slide 56]
The best known frequent subgraph mining programs seem to be gSpan and MoSS. Reading them I get the strong sense that these are written by people who don't really know the underlying chemical motivation. For example, I interpret the original MoSS paper to say that they don't really understand why chemists are interested in closed molecular fragments, but they have a conjecture. Still, since they know that's what chemists want, there are some optimizations they can do do handle that.

[slide 57]
These papers introduce new relevant terms for me, which I should apply to the fmcs description. Where I say "in at least M structures", they say "has a support of M." A closed molecular fragment is a fragment with support M where there is no larger fragment which contains the fragment and which also has support M. In my terms, it's a maximal common substructure to at least M of the structures.

[slide 58]
Frequent subgraph mining is often used to find discriminative fragments, that is, fragments which are in the actives but are rarely shared by the inactives. This is actually a more sophisticated approach than most of the MCS paper, which tend to focus on actives and ignore information about inactives. I was a bit chagrined when I read about this because while obvious in retrospect, I had never really thought about that issue in the MCS context.

gSpan and MoSS

[slide 59]
The gSpan and MoSS programs look at all of the possible embeddings across all of the structures. This can be generously be seen as an extension of the McGregor method to N structures. They start with the empty set then work on all unique embeddings using one atom, then one bond, two bonds, and so on. At each point they keep track of the support for that embedding. If it's too low then they prune.

The fundamental difference between gSpan and MoSS is how the graph embedding tree is traversed; gSpan uses a breadth-first search (Apriori method) while MoSS uses a breadth-first search (Eclat method).

[slide 60]
The problem with either approach is that the number of embedded graphs can be huge. In a toy case there might be only a couple of ways to match a simple pattern like "O-S-C" ...

[slide 61]
... but in more complex cases, like with rings, it can lead to "prohibitively high search complexity."

The original MoSS paper partially handled the complexity through a local solution, which prevented some of the explosion where a benzene ring can cause 12 different embeddings.

[slide 62]
Subsequent worked solved it through clever use of subgraph canonicalization. MoSS keeps track of the canonical representation for each substructure. When there are two different embeddings for the same substructure it uses the one which is lexicographically smallest. It does this by introducing a postfix version of SMILES, which is a perfect match to the depth-first machinery. The method is quite aware that the addition of a new bond can sometimes reorder the canonical representation, so it carefully monitors for all of those possible cases.

I've been doing cheminformatics for quite a while, and have never come across this approach before. I think it's brilliant and insighful thinking.

[slide 63]
It's my hypothesis that there's another way to describe the MoSS algorithm. Instead of thinking of the canonicalization as a way to remove duplicates, an alternative is to generate all canonical subgraphs in lexicographic order, and check for valid support across all of the structures.

If so, then that's part of the same intellectual family as the Varkony paper, where MoSS overcame many of the problems that Varkony had.

[slide 64]
What's amazing nabout this is that MoSS is designed as a frequent subgraph mining program, and not an MCS program. It finds the MCS as an edge case by setting the support threshold to 100%, but it can do much more. Yet in the timings, its performance was at least comparable to dedicated MCS programs!

Equally amazing is that it doesn't seem that the frequent subgraph mining people realize that they are doing a superset of the MCS algorithm. That connection is not mentioned in any of the frequent subgraph mining papers that I read, and you'll see later that gSpan avoids using subgraph isomorphism because it's NP-hard, seemingly not realizing that gSpan is itself solving an NP-hard problem.

"ad hoc" MCS methods

[slide 65]
Let's go back to the Raymond and Willet review paper. They classified Varkony is as an "ad hoc" method. I used to think that was a bit odd, since fmcs can be seen as a variant of the Varkony approach, and fmcs's performance is pretty good.

[slide 66]
It wasn't until I learned more about frequent subgraph mining that I realized that Raymond and Willett are right. I think that Varkony et al. is miscategorized as an MCS paper, when it's actually a frequent subgraph mining program, which finds all subgraphs with 100% support. As such, it does a lot more work than it needs to in order to find the MCS, including a lack of pruning methods.

With that in mind, I reread the Varkony paper for the fourth or so time and saw the single line that validated my belief in my interpretation: "There are three options to display results of the search for structural similarities: (a) All. The program reports all common substructures of all sizes."

That's why I say that the Varkony paper should be seen as one of the earliest frequent subgraph mining papers.

The change to find all substructures with a given support is trivial, but as "frequent subgraph mining" didn't start until about 20 years later they likely didn't realize that it could be an interesting new field.

[slide 67]
Going again back to the review paper, Raymond and Willett also mention in passing that there was a modified version of the Varkony algorithm subsequently proposed by Takahashi in 1987. I read through that paper, and realized that it replaced canonical subgraph generation with subgraph ismorphism. This is the intellectual predecessor to fmcs!

[slide 68]
Unfortunately, it's hard to figure that out from the paper. The English is difficult to follow and the subgraph isomorphism algorithm they use comes from Sussenguth 1965, which I hadn't heard of before. I'm surprised that a 1987 paper wouldn't have used the 1976 Ullmann method.

Based on the text, the Sussenguth algorithm is apparently very slow. Takahashi says that it's "the most time-consuming phase of the compuing flow", and that canonicalization is "an indispensable prerequisite" for performance. However, in my own tests, subgraph isomorphism testing is fast and canonicalization is not required. I conclude this is due to RDKit's use of VF2 instead of the Sussenguth algorithm.

Thus, I think most people who read Takahashi believe that it, like Varkony, requires canonicalization, when it actually doesn't. The only reason I figured this out was that I implemented fmcs and showed that to be the case.

Putting those details aside, it was quite exciting to read that someone else had previous worked on this approach before, and that I had managed to extend it in new and interesting ways.

[Note: I've researched a bit more of the data mining literature, and found that a few of them, like the book "Mining Graph Data" edited by Diane J. Cook, Lawrence B. Holder with the chapter "Discovery of Frequent Substructures" by Xifeng Yan and Jiawei Han (the gSpan authors) reference Takahashi as a graph related method in chemiformatics, but didn't notice that Takahashi is actually a variation on the Varkony paper from 10 years previous.]

Why so little previous work like fmcs?

[slide 69]
The obvious followup question is, why haven't any other people tried the subgraph isomorphism approach before? I've shown that fmcs has decent performance, and it's not that hard to understand.

Searching the literature gives some possible clues. Going back to the Armitage paper from 1967, they say "substruture search is useless here, unless one searches for every substructure of one compound in the struture of the second." That, combined with pruning, is the fmcs algorithm. But in reading the paper, it looks to be a throw-away line meant to imply that it's a completely unfeasible approach. Which, to be fair, it was for the hardware of that era.

But more than that, we needed to develop a better understanding of backtracking and pruning, and improved subgraph searches with first Ullmann and then VF2 before the fmcs approach could be practicable. By the late 1990s though, these were pretty well understood.

I suspect it was a combination of other factors. The clique approach is quite elegant, and Raymond and Willett showed that it was faster than the usual McGregor backtracking approach. There aren't that many people working on the MCS problem in chemisty, and even fewer working on the multiple structure MCS problem, so progress is slow.

Then there's the shaky feeling that the Takahashi/fmcs approach is too complicated to work well. fmcs goes from subgraph to SMARTS and back to subgraph, and potentially wastes a lot of time re-finding the same substructure only to add a single bond to it. And it's true - I can conceive of faster implementations. But on the other hand, a lot of work is put into making SMARTS matchers go fast, and that's work I don't need to redo.

The original gSpan paper gives another clue. The computer scientists avoided using subgraph isomorphism tests because it's "an NP-complete problem, thus pruning false positives is costly." It's always a bit scary to invoke NP-complete problems.

While they are correct, they didn't notice two things. First, in practice isomorphism matching of molecules is very fast, and second, as I showed, when an algorithm like gSpan or MoSS is used to find all substrutures with 100% support, it ends up identifying the maximum common substructure ... which is itself NP-complete.

[Note: This comment in the gSpan is one reason why I don't think the frequent subgraph mining researchers quite realize its connection to the MCS problem. Another is that I asked Thorsten Meinl about it, and he doesn't know of anyone who has specifically pointed that out.]

Future work

[slide 70]
So, where do we go for the future?

There's any number of features to add to fmcs. For example, it's pretty straight-forward to add support for a "*" as an alternate symbol for, say, a ring atom. This would help identify cases where there's a single atom substitution. It's also possible to require a certain initial subgraph to be part of the result, add additional pruning heurstics, and recode the algorithm in C++. This is an engineering task which "just" requires time and funding.

That said, I came into this problem because of thinking about canonical subgraph enumeration, only to find that isomorphism searching was more useful for this topic. However, I think canonical subgraph enumeration is a fruitful research area for the future.

The main reason why is that canonical subgraphs can be precomputed, and processed as fast string comparisons.

[slide 71]
For example, when clustering N compounds, it's pretty easy to compute, say, the canoncial subgraphs with up to 7 atoms. There's only on average few thousand per structure. To compare structure i with j, first look at the intersection of their subgraphs. If the MCS is 6 atoms or less, then you have the answer. Otherwise, you know the MCS has to be at least of size 7.

This might speed up clustering by quite a bit.

[slide 72]
NextMove Software has done the most advanced work in this field, to my knowledge. They precompute the canonical subgraphs for all of the structures in the data set, with links between each parent and child. If a child has multiple parents then that child is a common subgraph to all of those parents.

[slide 73]
Finding the MCS between any two structures then reduces to a graph traversal of a (very large) lattice to find the highest common node.

[slide 74]
I've also been thinking about using canonical subgraphs as perfect substructure keys for substructure search. Again, generate all canonical subgraphs for the structures in the database, up to a given size, and use these as keys for the inverted index. When a structure comes in, enumerate its subgraphs, look up the corresponding inverted indices, and find the intersection. If a query substructure doesn't have an inverted index, then no superstructure for the query exists in the database.

What's pretty neat about this is an off-the-shelf database like Lucene should be able to handle this sort of data without a problem.

[slide 75]
On the other hand, most people don't want to change their database environment to include a new database engine, so the third idea I've had is to improve substructure screen key selection.

There are two main types of substructure screens: hash-based like the Daylight fingerprints and feature based like the MDL MACCS keys. The former tend to be more common because they handle arbitrary chemistry and are easy to tuned to improve selectivity. The latter are usually made by hand, and the only other well-known example I can think of is the CACTVS substructure keys used by PubChem.

[slide 76]
Substructure screens are a sort of decision tree, and it's best if each bit can provide a discrimination of 50%. If that were the case then a 128 bit fingerprint would be able to provide very good screening ability. However, hash bits are correlated, which is why need more bits, while feature key bits tend to have low density - only about 150 of the unique fragments in PubChem have a density above 10%, and many of these are highly correlated.

Unique canonical subgraphs can be used to characterize the universe of possible decisions. I've wondered if it were possible to analyze a set of representative queries and targets to improve screening selection. Importantly, a bit should be able to have multiple fragments associated with it, in order to improve its discrimination ability.

[Note: This is rather like optimizing the 20 questions problem, only all 20 questions must be selected beforehand. Quoting from Wikipedia: "Careful selection of questions can greatly improve the odds of the questioner winning the game. For example, a question such as "Does it involve technology for communications, entertainment or work?" can allow the questioner to cover a broad range of areas using a single question that can be answered with a simple "yes" or "no".]

Unfortunately, I have not been able to come up with a good algorithm for this idea.

[slide 77]
This research involved a lot of software, and interaction with many people. I thank all of them for their help and support. Also, this presentation required digging into the early MCS literature. I thank the Chalmers University of Technology library for having most of the literature still on the shelves in paper, which meant I could do this research without hitting expensive paywalls all the time.

Comments?

Let me know if you have any comments or questions.

Is TDD better than Apps Hungarian? #

Greg Wilson and Jorge Aranda's essay Two Solitudes describes the different viewpoints and additudes between software engineering researchers and practicioners. Here is one of the complaints:

At industry-oriented gatherings, it seems that a strong opinion, a loud voice, and a couple of pints of beer constitute "proof" of almost any claim. Take test-driven development, for example. If you ask its advocates for evidence, they'll tell you why it has to be true; if you press them, you'll be given anecdotes, and if you press harder, people will be either puzzled or hostile.

Most working programmers simply don't know that scientists have been doing empirical studies of TDD, and of software development in general, for almost forty years. It's as if family doctors didn't know that the medical research community existed, much less what they had discovered. Once again, the first step toward bridging this gulf seemed to be to get each group to tell the other what they knew.
Greg worked with Andy Oram to collect and edit the essays in Making Software: What Really Works, and Why We Believe It, as a way to present some of that research evidence to software practicioners. The Two Solitudes essay refers to that book, saying its "topics range from the impact of test-driven development on productivity (probably small, if there is one at all) to whether machine learning techniques can predict fault rates in software modules (yes, with the right training)."

Why is there a debate over TDD?

Did you catch the "probably small, if there is one at all" part? Why then is there a continued debate over the effectiveness of TDD when the conclusion from reseach is that there is at best only a small improvement in productivity? The Two Solitudes essay goes into some of the possible reasons. You should read it.

That got me to thinking about other well-accepted or well-rejected practices: version control software and Apps Hungarian notation. We as practicioners generally accept that version control software is a Good Thing and generally view Apps Hungarian as a negative. If we "know" that version control systems are useful, then can software engineering researchers find experimental evidence to support or reject the claim? If they have, then it might be an example of what a successful research experiment might look like. If it can't, then perhaps it says something about the weakness of the current experimental approaches compared to personal experience.

We can ask the same about other development practices. Can software testing show if Apps Hungarian improves productivity over other variable naming systems?

With those experiments in place we could start asking questions like "are the gains from introducing TDD to the process more like the gains in introducing Apps Hungarian, or more like the gains in introducing Mercurial?" Or more crudely, "is TDD more effective than Apps Hungarian?"

(For "TDD" put pair programming, unit tests, coverage analysis, version control systems, or most any other practice.)

So let's ask the software engineering researchers if they can provide evidence for those beliefs of the practicioners. Perhaps they've done it already?

Do version control systems make people more productive?

As far as I can see (which isn't very far), no one has done experimental research to tell if a version control system makes people more productive. I asked Greg, and he was a bit chagrined to reply that he didn't know of any published research either. Yes, he promotes that everyone use a version control system, despite not knowing of research evidence to back that belief.

Curious, isn't it?

My challenge to researchers is, can you carry out experiments to test the effectiveness of version control software for different types of projects?

This research, like TDD, is complicated by many real-world factors. There are alternatives to using version control software:

different development scenarios: different types of developers: and even different needs:

It sounds like a lot of work to test these possibilities when we already know the answer, but that's my point. If research methods can't address this question then how do we trust those methods on other topics, like TDD or pair-programming?

Perhaps the research will point out something interesting. Perhaps it will find that research projects which take less than three months and where there is only one developer are 10% more effective using automatic versioning file system over a manual version control system. If that were true, perhaps we should recommend that most scientists use a versioning file system instead of version control.

Or perhaps we should develop better connections between the various file system monitoring tools (like inotify) and the version control systems, to give an automatic/no-touch interface to the version control system.

For that matter, some people start continuous integration tests based on checkins. Why not do CI based on filesystem changes? Obviously some will commit less often, but perhaps productivity will improve.

I can continue to conjecture, but of course, without evidence it's little better than blather. The world is full of blather. That's why we need research.

Does Apps Hungarian make people more productive?

Hungarian Notation is a classic topic. Quoting from Michael Feathers on the c2.com wiki page:

Hungarian notation inspires some of the most bitter religious wars among programmers. Detractors claim that it is indecipherable and that no real standards exist for its usage. Adherents claim that it becomes more likeable with use, and the instant mnemonic recognition of type that one gains with familiarity aids program comprehension immensely.

Hungarian notation seems to be in decline. Microsoft's .Net "General Naming Conventions" document says outright "Do not use Hungarian notation." Nor is Hungarian the de facto style in any of the few dozen languages that I'm familiar with, though some libraries, especially those with a Microsoft heritage, do use it.

This despite the efforts of people like Joel Spolsky to promote a distinction between Apps Hungarian and System Hungarian. The former was is "extremely valuable" while the latter is from "the dark side.")

The Wikipedia page lists other "notable opinions" both for and against.

Some people say it's useful in untyped languages. Others say it Hungarian Notation just gets in the way. Is there any research to back up any of these opinions, or are variable naming schemes yet another unexplored field?

Does TDD make people more productive?

I'm picking on TDD, but this section can be applied to many other topics.

In any TDD debate you'll hear from people who find it useful, and those who don't. You'll hear about people who tried TDD but didn't find it useful, and you'll hear from others who want to help the first group do TDD correctly. And you'll hear from people who tried git and found it confusing ... though they often move to other version control systems instead.

Look at the old Hungarian discussions and you'll see the same patterns, including people that tried it, liked it, won't move away, and believe that others just don't understand Hungarian correctly.

It looks like the effectiveness of TDD has had many more studies then the effectiveness of either version control systems or Hungarian Notation, so let's use that as the reference. Is the difference in effectiveness between using PEP-8 style names and Apps Hungarian more or less than the difference between TDD and coverage-based testing? Is the difference in effectiveness between using version control instead of manual backups more or less than the difference between TDD and test-last?

And more importantly, are the answers to those questions opinions based on experience or are they backed by evidence? Can we use these experiments to calibrate our understanding of what software engineering researchers can identify?

Feedback

Do you know of any research which tests the effectiveness of version control systems over other approaches? What about of App Hungarian over other approaches? Perhaps you have a "strong opinion,", "loud voice," or have had a "couple of pints of beer" and want to leave a comment?

I sent the seed of this essay to Greg, and he posted it on "It will never work in theory." I think that's the best place to continue this dicussion, so I hope to see you there.

I'll end with a quote from Peter Medawar's "Advice to a Young Scientist":

I cannot give any scientist of any age better advice than this: the intensity of a conviction that a hypothesis is true has no bearing over whether it is true or not. The importance of the strength of our conviction is only to provide a proportionately strong incentive to find out of the hypothesis will stand up to critical evaluation.

Postscript

The above essay places the entire burden on software engineering researchers to do new experiments. I don't mean for things to be so one sided. This was also meant for software practicioners.

Software practioners claim a lot of things. Some, like version control systems make sense. Hungarian notation, not so much. I don't think TDD makes much sense, and the research seems to back me up.

If you are software practioner then take that final quote from Medawar to heart. You can make claims about your method X or approach Y, but in the back of your head, remember that I'll be asking you "how do you know that your approach is any better at improving productivity than switching to Apps Hungarian?" You won't be able to answer that question ... unless X or Y is "Apps Hungarian," of course.

At best you'll point to success cases, and then I'll point to the success of Microsoft Word and Excel in the 1990s, and to shops like Fog Creek.

Of course, your proper response will be "but you use version control, right, and where's your evidence for that.". Sigh. Researchers? Help!



Interview with Igor Filippov about OSRA #

A new "Molecular Coding" podcast is out! I interviewed Igor Filippov about OSRA. We recorded it at ICCS in summer of 2011 but I didn't do the editing and transcript until the last couple of days. Good news on the personal side. After about two years of temporary housing, we have finally moved into some place more permanant, in Trollhättan, Sweden. (Yes, these sentences are related.)

Tomorrow I leave for the 2012 German Conference on Chemoinformatics, in Goslar, Germany. I present the fmcs work with Janna Hastings on Monday at 12:05pm, and I have a poster about chemfp at poster P38.

I hope to see some of you there. Let me know if you listen to the podcast!



Copyright © 2001-2013 Andrew Dalke Scientific AB