Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2014/09/25/parse_molecule

chemfp's parse_molecule()

I used several use cases to guide chemfp-1.2 development. One, under the category of "web service friendly", was to develop a web service which takes a structure record and optional format name as input for a k=3 nearest neighbor search of a pre-loaded target fingerprint file. That doesn't sound so hard, except I'll let the fingerprint file determine the fingerprint type based on its 'type' header, which might specify Open Babel, RDKit, or OpenEye's toolkits.

Those toolkit all have different APIs, but I would prefer not to write different code for each case. Chemfp-1.1 can be hacked to handle most of what's needed, because read_structure_fingerprints() will read a structure file and computes fingerprints for each record. The hack part is save the XML-RPC query to a file, since read_structure_fingerprints() only works from a file, not a string.

That isn't a full solution. Chemfp-1.1 doesn't support any sort of structure output, so there would need to be toolkit-specific code to set the tag data and convert the molecule to a record.

For chemfp I decided on the classic approach and make my own toolkit-independent API, with a back-end implementation for each of the supported toolkits. I think it's a great API, and I've been using it a lot for my other internal projects. My hope is that some of the ideas I developed go into other toolkits, or at least influence the design of the next generation of toolkits.

To see a more concrete example, here's that use case implemented as an XML-RPC service using the new chemfp-1.2 API.

from SimpleXMLRPCServer import SimpleXMLRPCServer

import chemfp

# Load the target fingerprints and figure out the fingerprint type and toolkit
targets = chemfp.load_fingerprints("databases/chembl_18_FP2.fps")
fptype = targets.get_fingerprint_type()  # uses the 'type' field from the FPS header
toolkit = fptype.toolkit

print("Loaded %d fingerprints of type %r" % (len(targets), fptype.get_type()))

def search(record, format="sdf"):
    # Parse the molecule and report any failures
        mol = toolkit.parse_molecule(record, format)
    except ValueError as err:
        return {"status": "error", "msg": str(err)}

    # Compute its fingerprint, search the targets, and get the nearest 3 neighbors
    fp = fptype.compute_fingerprint(mol)
    nearest = targets.knearest_tanimoto_search_fp(fp, k=3, threshold=0.0)

    # Add a tag value like "CHEMBL14060 1.00 CHEMBL8020 0.92 CHEMBL7970 0.92"
    (id1, score1), (id2, score2), (id3, score3) = nearest.get_ids_and_scores()
    toolkit.add_tag(mol, "NEAREST", "%s %.2f %s %.2f %s %.2f" %
                    (id1, score1, id2, score2, id3, score3))

    return {"status": "ok",
            "record": toolkit.create_string(mol, "sdf")}

server = SimpleXMLRPCServer(("localhost", 8000))

if __name__ == "__main__":

I think this is a clean API, and a bit easier to understand and use than the native toolkit APIs. It's very similar to cinfony, though I think at this level cinfony is bit easier to understand because it wraps its own Molecule object around the native toolkit molecules, while I leave them as native molecule objects. I have to use helper functions where cinfony can use methods. I did this because I don't want the performance overhead of wrapping and unwrapping for the normal use case of converting a structure file to fingerprints.

I also have more stand-alone objects than cinfony, like my fingerprint type object for fingerprint generation, where cinfony uses a method of the molecule.


That example, while illustrative of the new API, isn't significantly better than existing systems. For that, I need to delve into the details, starting with parse_molecule() .

Chemfp has three "toolkit" APIs, one for each of the supported toolkits. The usual way to get them is through one of the following:

from chemfp import rdkit_toolkit
from chemfp import openeye_toolkit
from chemfp import openbabel_toolkit
Most of the examples are toolkit independent, so I'll use "T" rather than stress a toolkit. I'll use RDKit as the back-end for these examples:
>>> from chemfp import rdkit_toolkit as T

The type signature is:

parse_molecule(content, format, id_tag=None, reader_args=None, errors='strict')
In the simplest case I'll parse a SMILES record to get an RDKit molecule object:
>>> mol = T.parse_molecule("c1ccccc1[16OH] phenol", "smi")
>>> mol
<rdkit.Chem.rdchem.Mol object at 0x104521360>
If you use the openeye_toolkit you would get an OEGraphMol, and openbabel_toolkit returns an OBMol.

You might have noticed that the first two parameters are in reverse order from cinfony's readstring(), which take the format in the first position instead of the second. This is because the parse_molecules() parameters parallel chemfp's read_molecules() parameters:

read_molecules(source=None, format=None, id_tag=None, reader_args=None, errors='strict', location=None)
The format there is second because the default of None means to auto-detect the format based on the source. (If the source is a filename then detection is based on the extension. I'll go into more details in another essay.)

In comparison, cinfony readfile() takes (format, filename), and doesn't auto-detect the format. (A future version could do that, but it would require a format like "auto" or None, which I thought was less elegant than being able to use the default format.) I wanted read_molecules() to support auto-detection, which meant the format had to be in the second position, which is why parse_molecule() takes the format in the second position.

parse_molecule() always returns a new molecule

This function promises to return a new molecule object each time. Compare to OEChem or Open Babel, which parse into an existing molecule object.

Those two toolkits reuse the molecule for performance reasons; clearing is faster than deallocating and reallocating a molecule object. My experience is that in practice molecule reuse is error-prone because it's easy to forget to clear the molecule, or save multiple results to a list and be surprised that they all end up with the same molecule.

I agree that performance is important. I chose a different route to get there. I noticed that even if the molecule were reused, there would still be overhead in calling parse_molecule() because it has to validate or at least interpret the function parameters. These parameters rarely change, that validation is unneeded overhead.

What I ended up doing was making a new function, make_id_and_molecule_parser(), which take the same parameters as parse_molecule(), except leaving out 'content'. It returns a specialized parser function which only takes one parameter, the content, and returns the (id, molecule) pair. (In a future release I may also have a make_molecule_parser() which only returns the molecule, but that's too much work for now.)

This new function is designed for performance, and is free to reuse the molecule object. Functions which return functions are relatively unusual, and I think only more advanced programmers will use it, which makes it less likely that people will experience the negative problems.

It's still possible, so in the future I may add an option to make_molecule_parser() to require a new molecule each time.

Format names

The second parameter is the format name or Format object. For now I'll only consider format names.

mol = T.parse_molecule("c1ccccc1[16OH] phenol", "smi")

This was a surprisingly tricky to get right, because a SMILES record and a SMILES string are different things, and because toolkits differ in what a SMILES record means. When someone says "takes a SMILES as input", does that mean to take a record or a string?

To clarify, a SMILES file contains SMILES records. A SMILES record is a SMILES string followed by a whitespace, followed by a title/id. Some toolkits take Daylight's lead and say that the title goes to the end of the line. Others interpret the file as a whitespace or space or tab separated file; or even an arbitrary character delimited file. Some also support a header in the first line. Finally, SMILES records are newline terminated, although that's not always required for the last record. (I'll come back to this in a bit.)

I decided to use different format names for these two cases. The "smi" format refers to a SMILES record, and the "smistring" format refers to just the SMILES string.

Björn Grüning pointed out that a similar issue exists with InChI. There's the InChI string, but Open Babel and some other tools also support an "InChI file" with the InChI string as the first term, followed by a whitespace, followed by some title or identifier, as well as an "InChIKey file", using the InChIKey instead of the InChI.

Thus, chemfp has "inchistring" and "inchi", and "inchikeystring" and "inchikey", in parallel with the "smistring"/"smi" distinction.

The other formats, like "sdf" and "pdb", are only record-oriented and don't have the same dual meaning as SMILES and InChI.

Compressed formats

A longer term chemfp idea is to extend the binary format to store record data. My current experiments use SQLite for the records and FPB files for the fingerprints.

Uncompressed SDF records take up a lot of space, and compress well using standard zlib compression. The file-based I/O function support format names like "smi.gz". I extended the idea to support zlib-compressed records, like "sdf.zlib".

Output format names: different types of SMILES

The SMILES output format names are also tricky. This needs a bit of history to understand fully. Daylight toolkit introduced SMILES strings, but the original syntax did not support isotopes or chirality. These were added latter, as so-called "isomeric SMILES". Daylight and nearly every toolkit since maintained that separation, where a "SMILES" output string (either canonical or non-canonical) was not isomeric, and something different needed to be done to get an isomeric SMILES.

This was a usability mistake. Most people expect that when the input is 238U then the output SMILES will be "[238U]". I know, because I've probably made that mistake a dozen times in my own code. On the plus side, it's usually very easy to detect and fix. On the minus side, I've only rarely needed canonical non-isomeric SMILES, so the default ends up encouraging mistakes.

OEChem 2.0 decided to break the tradition and say that "smi" refers to canonical isomeric SMILES, which is what most expect but didn't get, that "can" refers to canonical non-isomeric SMILES (this is unchanged), and that "usm" is the new term for non-canonical, non-isomeric SMILES.

It's a brillantly simple solution to a usability problem I hadn't really noticed before they solved it. This made so much sense to me that I changed chemfp's new I/O API to use those format names and meanings. I hope others also follow their lead.

That's why "smi" as a chemfp output format means canonical isomeric SMILES, "can" means canonical non-isomeric, and "usm" means non-canonical non-isomeric. The corresponding string formats are "smistring", "canstring", and "usmstring". Here they are in action:

>>> mol = T.parse_molecule("c1ccccc1[16OH] phenol", "smi")
>>> T.create_string(mol, "smi")
'[16OH]c1ccccc1 phenol\n'
>>> T.create_string(mol, "can")
'Oc1ccccc1 phenol\n'
>>> T.create_string(mol, "usm")
'c1ccccc1O phenol\n'
>>> T.create_string(mol, "smistring")

id_tag, parse_id_and_molecule(), and titles

The parse_molecule() function only really uses the "id_tag" parameter to improve error reporting - the error message will use the 'id_tag's value rather than the default title for a record.

The id_tag parameter exists because developers in Hinxton decided to put the record identifier in a tag of an SD file instead of placing it in the record title like the rest of the world. As a result, many cheminformatics tools stumble a bit with the ChEBI and ChEMBL datasets, which either have a blank title or the occasional non-blank but useless title like:

Structure #1
Untitled Document-1
I've decided that this is a bug, but despite years of complaints, they haven't changed, so chemfp has to work around it.

Here's an SD record from ChEBI that you can copy and paste:

  Marvin  02030822342D          

  2  1  0  0  0  0            999 V2000
   -0.4125    0.0000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    0.4125    0.0000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  2  0  0  0  0
M  STY  1   1 SUP
M  SAL   1  2   1   2
M  SMT   1 O2
> <ChEBI ID>

> <ChEBI Name>

> <Star>

I'll assign that to the variable 'sdf_record', parse it, and show that the title is empty, though the identifier is available as the "ChEBI ID" tag.
>>> from chemfp import rdkit_toolkit as T
>>> sdf_record = """\
...   Marvin  02030822342D          
...   2  1  0  0  0  0            999 V2000
...    -0.4125    0.0000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
              ... many lines deleted ...
... $$$$
... """
>>> mol= T.parse_molecule(sdf_record, "sdf")
[00:49:59]  S group SUP ignored on line 8
>>> mol
<rdkit.Chem.rdchem.Mol object at 0x1043de4b0>
>>> T.get_id(mol)
>>> T.get_tag(mol, "ChEBI ID")
(The "get_id" and "get_tag" methods are part of the new molecule API. I'll discuss them in a future essay.)

I found that I often want to get both the identifier and the molecule. Rather than use a combination of parse_molecule() and get_id()/get_tag() to get that information, I created the parse_id_and_molecule() function, which returns the 2-element tuple of (id, molecule), whose id_tag parameter specifies where to find the id.

> > > T.parse_id_and_molecule(sdf_record, "sdf", id_tag="ChEBI ID")
[00:50:37]  S group SUP ignored on line 8
('CHEBI:15379', <rdkit.Chem.rdchem.Mol object at 0x104346a60>)
If the id_tag is None, then it uses the appropriate default for the given format; the title for SD records. Otherwise it contains the name of the tag with the title. (If there are multiple tags with the same name then the choice of which tag to use is made arbitrarily.)

I can imagine that some people might place an identifier in a non-standard location for other formats. If that happens, then there will likely be an id_tag syntax for that format.

For example, the closest I've come up with is a SMILES variant where the id you want is in the third column. In that case the id_tag might be "#3". If the column is labeled "SMILES" then an alternate would be "@SMILES", or if you know the column name doesn't start with a '#' or '@' then simply "SMILES".

However, that's hypothetical. I haven't seen it in real life. What I have seen are CSV files, where the SMILES and id columns are arbtirary, and which may have Excel-specific delimiter and quoting rules. Chemfp doesn't currently support CSV files and that will likely be handled through reader_args.

Handling whitespace in a SMILES record

The SMILES record format is not portable across toolkits. According to Daylight, and followed by OEChem, the format is the SMILES string, a single whitespace, and the rest of the line is the title. This title might be a simple alphanumeric id, or an IUPAC name with spaces in it, or anything else.

Other toolkits treat it as a character delimited set of columns. For example, RDKit by default uses the space character as the delimiter, though you can change that to tab, comma, or other character.

This is a problem because you might want to generate fingerprints from a SMILES file containing the record:

C1C(C)=C(C=CC(C)=CC=CC(C)=CCO)C(C)(C)C1 vitamin a
In one toolkit you'll end up with the identifier "vitamin" and with another toolkit get the identifier "vitamin A".

My experience is that most people treat a SMILES file as a space or tab delimited set of columns, with a strong perference for the space character. This isn't universally true. Roger Sayle pointed out to me that the formal IUPAC name is a perfectly reasonable unique and canonical identifer, which can have a space in it. The Daylight interpretation supports IUPAC names, while the space delimited version does not.

There is no perfect automatic solution to this ambiguity. Instead, chemfp lets you specify the appropriate delimiter type using the "delimiter" reader argument. The supported delimiter type names are "whitespace", "tab", "space", and "to-eol", as well as the literal characters " " and "\t". The default delimiter is "whitespace", because most of the people I work with think of a SMILES file more like a whitespace delimited file.

Here's an example of how the default doesn't like "vitamin A", but where the "to-eol" delimiter handles it correctly:

>>> smiles = "C1C(C)=C(C=CC(C)=CC=CC(C)=CCO)C(C)(C)C1 vitamin a\n"
>>> T.parse_id_and_molecule(smiles, "smi")
('vitamin', <rdkit.Chem.rdchem.Mol object at 0x10bbf39f0>)
>>> T.parse_id_and_molecule(smiles, "smi",
...              reader_args={"delimiter": "to-eol"})
('vitamin a', <rdkit.Chem.rdchem.Mol object at 0x10bc8c2f0>)
as well as an example of the difference between the "tab" and "to-eol" delimiter types:
>>> T.parse_id_and_molecule("C\tA B\tC\t", "smi",
...              reader_args={"delimiter": "tab"})
('A B', <rdkit.Chem.rdchem.Mol object at 0x10bc8c2f0>)
>>> T.parse_id_and_molecule("C\tA B\tC\t", "smi",
...              reader_args={"delimiter": "to-eol"})
('A B\tC\t', <rdkit.Chem.rdchem.Mol object at 0x10bbf39f0>)

It was a bit of work to make the different toolkits work the same way, and my best attempt isn't perfect. For example, if you are daft and try to interpret the record "C Q\tA" as a tab-delimited set of columns, then OEChem will see this as methane with an id of "Q" while RDKit and Open Babel will say they can't parse the SMILES "C Q".

So don't do that!


As you saw, reader_args is a Python dictionary. All of the SMILES parsers accept the 'delimiter' argument, and the RDKit and Open Babel reader_args also support the "has_header" argument. If true, the first line of the file contains a header. (I couldn't think of a good implementation of this for OEChem.)

There are also toolkit-specific reader_args. Here I'll disable RDKit's sanity checker for SMILES, and show that it accepts a SMILES that it would otherwise reject.

>>> T.parse_molecule("c1ccccc1#N", "smistring", reader_args={"sanitize": False})
<rdkit.Chem.rdchem.Mol object at 0x104407440>
>>> T.parse_molecule("c1ccccc1#N", "smistring", reader_args={"sanitize": True})
[22:09:40] Can't kekulize mol 

Traceback (most recent call last):
  File "<stdin>", line 1, in 
  File "chemfp/rdkit_toolkit.py", line 251, in parse_molecule
    return _toolkit.parse_molecule(content, format, id_tag, reader_args, errors)
  File "chemfp/base_toolkit.py", line 986, in parse_molecule
    id_tag, reader_args, error_handler)
  File "chemfp/base_toolkit.py", line 990, in _parse_molecule_impl
    id, mol = format_config.parse_id_and_molecule(content, id_tag, reader_args, error_handler)
  File "chemfp/_rdkit_toolkit.py", line 1144, in parse_id_and_molecule
    error_handler.error("RDKit cannot parse the SMILES string %r" % (terms[0],))
  File "chemfp/io.py", line 77, in error
    raise ParseError(msg, location)
chemfp.ParseError: RDKit cannot parse the SMILES string 'c1ccccc1#N'

The SMILES parsers use different reader_args than the SDF parser. You can see the default reader_args by using the toolkit's format API:

>>> from chemfp import rdkit_toolkit
>>> rdkit_toolkit.get_format("smi").get_default_reader_args()
{'delimiter': None, 'has_header': False, 'sanitize': True}
>>> rdkit_toolkit.get_format("sdf").get_default_reader_args()
{'strictParsing': True, 'removeHs': True, 'sanitize': True}
Also, the different toolkits may use different reader_args for the same format.
>>> from chemfp import openeye_toolkit
>>> openeye_toolkit.get_format("sdf").get_default_reader_args()
{'flavor': None, 'aromaticity': None}
>>> from chemfp import openbabel_toolkit
>>> openbabel_toolkit.get_format("sdf").get_default_reader_args()
{'implementation': None, 'perceive_0d_stereo': False, 'perceive_stereo': False, 'options': None}
I'll cover more about the format API in another essay.


This can lead to a problem. You saw earlier how to get the correct toolkit for a given fingerprint type. Once you have the toolkit you can parse a record into a toolkit-specific molecule. But what if you want toolkit-specific and format-specific settings?

First off, parsers ignore unknown reader_arg names, and the reader_arg names for the different toolkits are different, except for "delimiter" and "has_header" where it doesn't make sense for them to be different. That means you could do:

reader_args = {
  "delimiter": "tab",       # for SMILES files
  "strictParsing": False,   # for RDKit SDF
  "perceive_stereo": True,  # for Open Babel SDF
  "aromaticity": "daylight, # for all OEChem readers
and have everything work.

Still, it's possible that you want OEChem to parse a SMILES using "daylight" aromaticity and an SD file using "openeye" aromaticity.

The reader_args are namespaced, so for that case you could use a format qualifier, like this:

reader_args = {
  "smi.aromaticity": "daylight",
  "can.aromaticity": "daylight",
  "usm.aromaticity": "daylight",
  "sdf.aromaticity": "openeye",
There's also a toolkit qualifier. In this daft example, the OpenEye reader uses the whitespace delimiter option for SMILES files, the RDKit SMILES and InChI readers uses tab, and the Open Babel SMILES and InChI readers use the space character.
reader_args = {
  "openeye.*.delimiter": "whitespace",
  "rdkit.*.delimiter": "tab",
  "openbabel.*.delimiter": "space",
Fully qualified names, like "openeye.smi.delimiter" are also allowed.

The remaining problem is how to configure the reader_args. It's no problem to write the Python dictionary yourself, but what if you want people to pass them in as command-line arguments or in a configuration file? I'll cover that detail when I talk about the format API.


What happens if the SMILES isn't valid? I'll use my favorite invalid SMILES:

>>> T.parse_id_and_molecule("Q q-ane", "smi")
[22:01:37] SMILES Parse Error: syntax error for input: Q
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "chemfp/rdkit_toolkit.py", line 264, in parse_id_and_molecule
    return _toolkit.parse_id_and_molecule(content, format, id_tag, reader_args, errors)
  File "chemfp/base_toolkit.py", line 1014, in parse_id_and_molecule
    id_tag, reader_args, error_handler)
  File "chemfp/base_toolkit.py", line 1018, in _parse_id_and_molecule_impl
    return format_config.parse_id_and_molecule(content, id_tag, reader_args, error_handler)
  File "chemfp/_rdkit_toolkit.py", line 249, in parse_id_and_molecule
    error_handler.error("RDKit cannot parse the SMILES %r" % (smiles,))
  File "chemfp/io.py", line 77, in error
    raise ParseError(msg, location)
chemfp.ParseError: RDKit cannot parse the SMILES 'Q'
A chemfp.ParseError is also a ValueError, so you can check for this exception like this:
    mol = T.parse_molecule("Q", "smistring")
except ValueError as err:
    print("Not a valid SMILES")

I think an exception is the right default action when there's an error, but various toolkits disagree. Some will return a None object in that case, and I can see the reasoning. It's sometimes easier to write things more like:

>>> mol = T.parse_molecule("Q", "smistring", errors="ignore")
[00:04:26] SMILES Parse Error: syntax error for input: Q
>>> print(mol)

The default errors value is "strict", which raises an exception. The "ignore" behavior is to return None when the function can't parse a SMILES.

In the above example, the "SMILES Parse Error" messages come from the the underlying RDKit C++ library. It's difficult to capture this message from Python. Chemfp doesn't even try. OEChem and Open Babel have similar error messages. Chemp doesn't try to capture those either.

Error handling control for a single record isn't so important. It's more important when parsing multiple records, where "ignore" skips a record and keeps processing and "strict" raises an exception and stops processing once a record cannot be parsed. I'll cover that in the essay where I talk about reading molecules.

Other error handling - experimental!

The errors parameter can be two other strings: "report" and "log". The "report" option writes the error information to stderr, like this:

>>> T.parse_molecule("Q", "smi", errors="report")
[03:05:52] SMILES Parse Error: syntax error for input: Q
ERROR: RDKit cannot parse the SMILES 'Q'. Skipping.
The "SMILES Parse Error", again, is generated by the RDKit C++ level going directly to stderr. The "RDKit cannot parse the SMILES" line, however, is chemfp sending information to Python's sys.stderr.

I don't think the "report" option is all that useful for this case since it's pretty much a duplicate of the underlying C++ toolkit. It does know about the id_tag, so it gives better error message for ChEBI and ChEMBL files.

The "log" option is the same as "report" except that it sends the message to the "chemfp" logger using Python's built-in logging system. I have very little experience with logging, so this is even more experimental than "report".

Finally, the "errors" parameter can take an object to use as the error handler. The idea is that the handler's error() is called when there's an error. See chemfp/io.py for how it works, but consider this to be undocumented, experiemental, and almost certain to change.

I have the idea that the error handler can have a warn() method, which would be called for warnings. The current generation of toolkits uses a global error handler or sends the message directly to stderr. In a multi-threaded system, which might parse two molecules at the same time in different threads, it's hard to know which molecule caused the error.

I haven't done this because ... it's hard using the current generation of tools to get that warning information. I'm also unsure about the error handler protocol. How does it know that it's collected all of the warnings for a given molecule. Is there a special "end_of_warnings()" method? Are the warning accumulated and sent in one call? What if there are both warnings and errors?

That's why this subsection it titled "experimental". :)

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me

Copyright © 2001-2013 Andrew Dalke Scientific AB