Dalke Scientific Software: More science. Less time. Products

Diary RSS | All of Andrew's writings | Diary archive

Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure. Code to taste. Best served at room temperature.

MACCS key 44 #

The MACCS 166 keys are one of the mainstay fingerprints of cheminformatics, especially regarding molecular similarity. It's rather odd, really, since they were developed for substructure screening and not similarity. I suppose that Jaccard would agree that any relatively diverse feature vector can likely be used to measure similarity, whether it be Alpine biomes or chemical structures.

Here's a bit of dirty laundry that you'll not read in the literature. There are a lot of MACCS implementations, and they don't agree fully with each other. The differences are likely small, but as far as I can tell, no one has really investigated if it's a problem, or noted that it might be a problem.

I'll structure the explanation around key #44. What is definition for key 44?

To start, there is no publication describing the MACCS 166 public keys. All of the citations for it either say a variation of "MDL did it" or cite the 2002 paper which reoptimized the keys for similarity ([PDF]). Thing is, just about everyone uses the "unoptimized" definitions, so this is, technically, the wrong citation. (Why do people use it? Tradition, perhaps, or because it feels better to have a real citation rather than a nebulous one.)

Instead, the definitions appear to have come from ISIS/Base, and have been passed around from person to person through informal means. I haven't used the MDL software and can't verify the source myself. There's a relatively recent whitepaper from Accelrys titled "The Keys to Understanding MDL Keyset Technology" which says they are defined in the file "eksfil.dat". A Google search finds 8 results for "eksfil.dat". All are tied to that white paper. The PDF has creation and modification dates of 31 August 2011, and Archive.org first saw that URL on 11 October 2011.

It's easy to see that the reoptimization fingerprint is not the same as the 166 keys that everyone uses. You'll find that many places say that key 44 is defined as "OTHER". Table 5 of the reoptimization paper has an entry for '"other" atom type', but there's nothing which assigns it to key 44. You can't even try to infer some sort of implicit ordering because the previous entry in table 5 is "isotope", which is key 1 in the MACCS 166 keys, and two entries later is "halogen", which is key 134.

If you cite Durant, Leland, Henry, and Nourse (2002) as your reference to the MACCS 166 bit public keys then you are doing your readers a disservice. Those define different fingerprints than you used. Just go ahead and cite "MACCS keys. MDL Information Systems" and if the reviewer complains that it's a bad citation, point them to this essay and ask them for the correct one. Then tell me what they said. If Accelrys complains then they need to suggest the correct citation and put it in their white paper. Even better would be a formal publication and a validation suite. (I can dream, can't I?)

In practice, many people use the MACCS keys as interpreted by the implementers of some piece of software. I used "interepreted by" because "implemented by" is too strong. There are ambiguities in the definition, mistakes in the implementations, and differences in chemical interpretation, compounded by a lack of any sort of comprehensive validation suite.

Let's take key 44, "OTHER". Remember how the definition comes from an internal MDL data file? What does "OTHER" mean? RDKit defines it as '?' in MACCSkeys.py to indicate that it has no definition for that key. That line has a commit date of 2006-05-06. RDKit's lack of a definition is notable because Open Babel, CDK, a user contributed implementation for ChemAxon and many others reuse the RDKit SMARTS definitions. All of them omit key 44.

Others have implemented key 44. TJ O'Donnell, in "Design and Use of Relational Databases in Chemistry" (2009) defines it as the SMARTS [!#6!#7!#8!#15!#16!#9!#17!#35]. MayaChemTools defines it in code as an atom with element number in "1|6|7|8|9|14|15|16|17|35|53". (See _IsOtherAtom.)

These are the ones where I have access to the source and could investigate without much effort.

Both the whitepaper and the reoptimization paper define what "other" means, and the whitepaper does so specifically in the context of the MACCS 166 keys. It says:

"Other" atoms include any atoms other than H, C, N, O, Si, P, S, F, Cl, Br, and I, and is abbreviated "Z".
This appears definite and final. Going back to the three different implementation geneologies, RDKit and its many spinoffs don't have a definition so by definition isn't correct. O'Donnell's is close, but the SMARTS pattern omits hydrogen, silicon, and iodine. And MayaChemTools gets it exactly correct.

Good job, Manish Sud!

Are these MACCS variations really a problem?

No. Not really. Well, maybe. It depends on who you are.

When used for similarity, a dead bit just makes things more similar because there are fewer ways to distinguish between molecules. In this case too, key 44 is rare. Only a handful of molecules contain "other" atoms (like the gold in auranofin) so when characterizing a database it's likely fine.

You don't need to trust my own gut feeling. You can read the RDKit documentation and see "The MACCS keys were critically evaluated and compared to other MACCS implementations in Q3 2008. In cases where the public keys are fully defined, things looked pretty good."

Okay, so you're hesistent about the keys which aren't "fully defined"? No need to despair. Roger Sayle ported the RDKit patterns (and without key 44) over to ChemAxon, and reported:

This work is heavily based upon the previous implementation by Miklos Vargyas, and the SMARTS definitions developed and refined by Greg Landrum and Andrew Dalke. This implementation achieves ~65% on the standard Briem and Lessel benchmark, i.e. almost identical to the expected value for MACCS keys reported in the literature by MDL and others.

(NB: All I did was proofread the RDKit SMARTS and find a few places that needed fixing.)

The MACCS 166 keys are a blunt tool, designed for substructure search and repurposed for similarity more because it was already present and easy to generate. 2D similarity search is another blunt tool. That's not to say they are horrible or worthless! A rock is a blunt tool for making an ax, but we used stone axes quite effectively throughout the Neolith.

Just don't treat the MACCS 166 keys as a good luck charm, or as some sort of arcane relic passed down by the ancients. There are limitations in the definition and limitations in the implementation. Different tools will give different answers, and if you don't understand your tools they may turn on you.

And when you write a paper, be honest to your readers. If you are using the RDKit implementation of the MACCS keys or derived version in another toolkit (and assuming they haven't been changed since I wrote this essay), point out that you are only using 164 of those 166 bits.

Homework assignment

For a warmup exerecise, what is the other unimplemented bit in the RDKit MACCS definition?

For your homework assignment, use two different programs to compute the MACCS keys for a large data set and see 1) how many bits are different? (eg, sum of the Manhattan distance between the fingerprints for each record, or come up with a better measure), 2) how many times does the nearest neighbor change?, and 3) (bonus points) characterize how often those differences are because of differences in how to interpret a key and how often it's because of different toolkit aromaticity/chemistry perception methods.

I expect a paper in a journal by the end of next year. :).

(Then again, for all I know this is one of those negative results papers that's so hard to publish. "9 different MACCS key implementations produce identical MACCS keys!" doesn't sound exciting, does it?)



chemfp's format API #

This is part of a series of essays describing the reasons behind chemfp's new APIs. This one, about the new format API, is a lot shorter than the previous one on parse_molecule() and parse_id_and_molecule().

It's sometimes useful to know which cheminformatics formats are available, if only to display a help message or pulldown menu. The get_formats() toolkit function returns a list of available formats, as 'Format' instances.

>>> from chemfp import rdkit_toolkit as T
>>> T.get_formats()
[Format('rdkit/canstring'), Format('rdkit/inchikey'),
Format('rdkit/usmstring'), Format('rdkit/smistring'),
Format('rdkit/molfile'), Format('rdkit/usm'),
Format('rdkit/inchikeystring'), Format('rdkit/sdf'),
Format('rdkit/can'), Format('rdkit/smi'), Format('rdkit/inchi'),
Format('rdkit/rdbinmol'), Format('rdkit/inchistring')]
(The next version of chemfp will likely support RDKit's relatively new PDB reader.)

You can ask a format for its name, or see if it is an input format or output format by checking respectively "is_input" and "is_output". If you just want the list of input formats or output formats, use get_input_formats() or get_output_formats().

Here's an example to show which output formats are not also input formats:

>>> [format.name for format in T.get_output_formats() if not format.is_input_format]
['inchikey', 'inchikeystring']

You may recall that some formats are record-based and others are, for lack of a better word, "string-based". The latter include "smistring", "inchistring", and "inchikeystring". These are not records in their own right, so can't be read or written to a file.

I really couldn't come up with a good predicate which described those formats. This closest was "is_a_record". I ended up with "supports_io". I'm not happy with the name. If true, the format can be used in file I/O.

The RDKit input formats which do not support I/O are the expected ones ... and rdbinmol.

>>> [format.name for format in T.get_input_formats() if not format.supports_io]
['canstring', 'usmstring', 'smistring', 'molfile', 'rdbinmol', 'inchistring']
(The "rdbinmol" is an experimental format. It's the byte string from calling an RDKit molecule's "ToBinary()" method, which is also the basis for its pickle support.)

get_format() and compression

You can get a specific format by name using get_format(). This can also be used to specify a compressed format:

>>> T.get_format("sdf")
Format('rdkit/sdf')
>>> T.get_format("smi.gz")
Format('rdkit/smi.gz')
>>> format = T.get_format("smi.gz")
>>> format
Format('rdkit/smi.gz')
>>> format.name
'smi'
>>> format.compression
'gz'

Default reader and writer arguments

Toolkit- and format-specific arguments were a difficult challenge. I want chemfp to support multiple toolkits, because I know people work with fingerprints from multiple toolkits. Each of the toolkit has its own way to parse and generate records. I needed some way to have a common API but with a mechanism to control the underlying toolkit options.

The result are reader_args, which I discussed in the previous essay, and the writer_args complement for turning a molecule into a record.

A Format instance can be toolkit specific; the "rdkit/smi.gz" is an RDKit format. (The toolkit name is available from the aptly named attribute 'toolkit_name'.) Each Format has a way to get the default reader_args and writer_args for the format:

>>> format = T.get_format("smi.gz")
>>> format
Format('rdkit/smi.gz')
>>> format.get_default_reader_args()
{'delimiter': None, 'has_header': False, 'sanitize': True}
>>> format.get_default_writer_args()
{'isomericSmiles': True, 'delimiter': None, 'kekuleSmiles': False, 'allBondsExplicit': False, 'canonical': True}
This is especially useful if you are on the interactive prompt and have forgotten the option names.

Convert text settings into arguments

The -R command-line options for the chemfp tools rdkit2fps, ob2fps, and oe2fps let users set the reader_args. If your target molecules are in a space-delimited SMILES file then you can set the 'delimiter' option to 'space':

oe2fps -R delimiter=space targets.smi.gz -o targets.fps
or ask RDKit to disable sanitization using:
rdkit2fps -R sanitize=false targets.smi.gz -o targets.fps
The -R takes string keys and values. On the other hand reader_args take a dictionary with string keys but possibly integers and booleans as values. You could write the converter yourself, but that gets old very quickly. Instead, I included it the format's get_reader_args_from_text_settings(). (The *2fps programs don't generate structure output, but if they did the equivalent command-like flag would be -W, and the equivalent format method is get_writer_args_from_text_settings().)

Yes, I agree that get_..._settings() is a very long name. I couldn't think of a better one. I decided that "text settings" are the reader_args and writer_args expressed as a dictionary with string names and string values.

I'll use that long named function to convert some text settings into proper reader_args:

>>> format.get_reader_args_from_text_settings({
...    "delimiter": "tab",
...    "sanitize": "false",
... })
{'delimiter': 'tab', 'sanitize': False}
You can see that the text "false" was converted into the Python False value.

Namespaces

Names like "delimiter" and "sanitize" are 'unqualified' and apply for every toolkit and every format which accept them. This makes sense for "delimiter" because it's pointless to have OEChem parse a SMILES file using a different delimiter style than RDKit. It's acceptable for "sanitize" because only RDKit knows what it means, and the other toolkits will ignore unknown names. For many cases then you could simply do something like:

reader_args = {
  "delimiter": "tab",       # for SMILES files
  "strictParsing": False,   # for RDKit SDF
  "perceive_stereo": True,  # for Open Babel SDF
  "aromaticity": "daylight, # for all OEChem readers
}

At the moment the toolkits all have different names for option names for the same format, so there's no conflict there. But toolkits do use the same name for options on different formats, and there can be a good reason for why the value for a SMILES output is different than a value for an SDF record output.

The best example is OEChem, which uses a "flavor" flag to specify the input and output options for all formats. (For chemfp I decided to split OEChem's flavor into 'flavor' and 'aromaticity' reader and writer arguments. I leave that discussion for elsewhere.) I'll start by making an OEGraphMol.

from chemfp import openeye_toolkit

phenol = "c1ccccc1[16OH]"
oemol = openeye_toolkit.parse_molecule(phenol, "smistring")
Even though "smistring" output by default generates the canonical isomorphic SMILES for the record, I can ask it to generate a different output flavor. For convience, the flavor value can be an integer, which is treated as the flavor bitmask, or it can be a string of "|" or "," separated bitmask names. Usually the bitmask names are or'ed together, but a leading "-" means to unset the corresponding bits for that flag.
>>> openeye_toolkit.create_string(oemol, "smistring")
'c1ccc(cc1)[16OH]'
>>> openeye_toolkit.create_string(oemol, "smistring",
...      writer_args={"flavor": "Default"})
'c1ccc(cc1)[16OH]'
>>> openeye_toolkit.create_string(oemol, "smistring",
...      writer_args={"flavor": "Default,-Isotopes"})
'c1ccc(cc1)O'
>>> openeye_toolkit.create_string(oemol, "smistring",
...      writer_args={"flavor": "Canonical|Kekule|Isotopes"})
'C1=CC=C(C=C1)[16OH]'
Here I'll ask for the SDF record output in V3000 format. (In the future I expect to have a special "sdf3" or "sdf3000" format, to make it easier to specify V3000 output across all toolkits.)
>>> print(openeye_toolkit.create_string(oemol, "sdf",
...        writer_args={"flavor": "Default|MV30"}))

  -OEChem-09261411132D

  0  0  0     0  0            999 V3000
M  V30 BEGIN CTAB
M  V30 COUNTS 7 7 0 0 0
M  V30 BEGIN ATOM
M  V30 1 C 0 0 0 0
M  V30 2 C 0 0 0 0
M  V30 3 C 0 0 0 0
M  V30 4 C 0 0 0 0
M  V30 5 C 0 0 0 0
M  V30 6 C 0 0 0 0
M  V30 7 O 0 0 0 0 MASS=16
M  V30 END ATOM
M  V30 BEGIN BOND
M  V30 1 2 1 6
M  V30 2 1 1 2
M  V30 3 2 2 3
M  V30 4 1 3 4
M  V30 5 2 4 5
M  V30 6 1 5 6
M  V30 7 1 6 7
M  V30 END BOND
M  V30 END CTAB
M  END
$$$$

What's the problem?

One problem comes when I want to configure chemfp so that if the output is SMILES then use one flavor, and if the output is SDF then use another flavor. You could construct a table of format-specific writer_args, like this:

writer_args_by_format = {
  "smi": {"flavor": "Canonical|Kekule|Isotopes", "aromaticity": "openeye"},
  "sdf": {"flavor": "Default|MV30", "aromaticity": "openeye"},
    ...
}

record = T.create_string(mol, format,
           writer_args = writer_args_by_format[format])
but not only is that tedious, it doesn't handle toolkit-specific options. Nor is there an easy way to turn the text settings into this data structure.

Qualified names

Instead, the reader_args and writer_args accept "qualified" names, which can be format-specific like "sdf.flavor", toolkit-specific like "openeye.*.aromaticity", or both, like "openeye.sdf.aromaticity".

A cleaner way to write the previous example is:

writer_args = {
  "smi.flavor": "Canonical|Kekule|Isotopes",
  "sdf.flavor": "Default|MV30",
  "aromaticity": "openeye",   # Use the openeye aromaticity model for all formats
    ...
}

record = T.create_string(mol, format, writer_args = writer_args)
or if you want to be toolkit-specific, use "openeye.smi.flavor", "openeye.sdf.flavor" and "openeye.*.aromaticity", etc.

Precendence

You probably noticed there are many ways to specify the same setting, as in the following:

reader_args = {
  "delimiter": "tab",
  "openeye.*.delimiter": "whitespace",
  "smi.delimiter": "space",
}
The chemfp precedence goes from most-qualified name to least-qualified, so for this case the search order is:
openeye.smi.delimiter
openeye.*.delimiter
smi.delimiter
delimiter

How to convert qualified names into unqualified names

The Format object's get_unqualified_reader_args() converts a complicated reader_args dictionary which may contain qualified names into a simpler reader_args dictionary with only unqualified names and only the names appropriate for the format. It's used internally to simplify the search for the right name, and it's part of the public API so you can help debug if your qualifiers are working correctly. I'll give an example of debugging in a moment.

Here's an example which shows that the previous 'reader_args' example, with several delimiter specification, is resolved to using the 'whitespace' delimiter style.

>>> from chemfp import openeye_toolkit
>>> 
>>> reader_args = {
...   "delimiter": "tab",
...   "openeye.*.delimiter": "whitespace",
...   "smi.delimiter": "space",
... }
>>> 
>>> format = openeye_toolkit.get_format("smi")
>>> format.get_unqualified_reader_args(reader_args)
{'delimiter': 'whitespace', 'flavor': None, 'aromaticity': None}
You can see that it also fills in the default values for unspecified arguments. Note that this function does not validate values. It's only concerned with resolving the names.

The equivalent method for writer_args is get_unqualified_writer_args() - I try to be predictable in my APIs.

This function is useful for debugging because it helps you spot typos. Readers ignore unknown arguments, so if you type "opneye" instead of "openeye" then it just assumes that you were talking about some other toolkit.

If you can't figure out why your reader_args or writer_args aren't being accepted, pass them through the 'unqualified' method and see what it gives:

>>> format.get_unqualified_reader_args({"opneye.*.aromaticity": "daylight"})
{'delimiter': None, 'flavor': None, 'aromaticity': None}

Qualified names and text settings

The Format object also supports qualifiers in the reader and writer text_settings and applies the same search order to give the unqualified reader_args.

>>> format.get_reader_args_from_text_settings({
...    "sanitize": "true",
...    "rdkit.*.sanitize": "false",
... })
{'sanitize': False}

Errors in the text settings

The get_reader_args_from_text_settings() and get_writer_args_from_text_settings() will validate the values as much as it can, and raise a ValueError with a helpful message if that fails.

>>> from chemfp import openeye_toolkit
>>> sdf_format = openeye_toolkit.get_format("sdf")
>>> sdf_format.get_writer_args_from_text_settings({
...   "flavor": "bland",
... })
Traceback (most recent call last):
  File "", line 2, in 
  File "chemfp/base_toolkit.py", line 407, in get_writer_args_from_text_settings
    return self._get_args_from_text_settings(writer_settings, self._format_config.output)
  File "chemfp/base_toolkit.py", line 351, in _get_args_from_text_settings
    % (self.toolkit_name, name, value, err))
ValueError: Unable to parse openeye setting flavor ('bland'): OEChem sdf format does not support the 'bland' flavor option. Available flavors are: CurrentParity, MCHG, MDLParity, MISO, MRGP, MV30, NoParity

File format detection based on extension

All of the above assumes you know the file format. Sometimes you only know the filename, and want to determine (or "guess") the format based on its extension. The file "abc.smi" is a SMILES file, the file "xyz.sdf" is an SD file, and "xyz.sdf.gz" is a gzip-compressed SD file.

The toolkit function get_input_format_from_source() will try to determine the format for an input file, given the source filename:

>>> from chemfp import openbabel_toolkit as T
>>> T.get_input_format_from_source("example.smi")
Format('openbabel/smi')
>>> T.get_input_format_from_source("example.sdf.gz")
Format('openbabel/sdf.gz')
>>> format = T.get_input_format_from_source("example.sdf.gz")
>>> format.get_default_reader_args()
{'implementation': None, 'perceive_0d_stereo': False, 'perceive_stereo': False, 'options': None}
The equivalent for output files is get_output_format_from_destination().

The main difference between the two is get_input_format_from_source() will raise an exception if the format is known but not supported as an input format, and get_input_format_from_destination() will raise an exception if the format is known but not supported as an output format.

>>> T.get_input_format_from_source("example.inchikey")
Traceback (most recent call last):
  File "", line 1, in 
  File "chemfp/openbabel_toolkit.py", line 109, in get_input_format_from_source
    return _format_registry.get_input_format_from_source(source, format)
  File "chemfp/base_toolkit.py", line 606, in get_input_format_from_source
    format_config = self.get_input_format_config(register_name)
  File "chemfp/base_toolkit.py", line 530, in get_input_format_config
    % (self.external_name, register_name))
ValueError: Open Babel does not support 'inchikey' as an input format

The format detection functions actually take two arguments, where the second is the format name.

>>> T.get_input_format_from_source("example.inchikey", "smi.gz")
Format('openbabel/smi.gz')
This is meant to simplify the logic that would otherwise lead to code like:
if format is not None:
    format = T.get_input_format(format)
else:
    format = T.get_input_format_from_source(source)

By the way, the source and destination can be None. This tells chemfp to read from stdin or write to stdout. Since stdin and stdout don't have a file extension, what format do they have? My cheminformatics roots started with Daylight, so I decided that the default format is "smi".



chemfp's parse_molecule() #

I used several use cases to guide chemfp-1.2 development. One, under the category of "web service friendly", was to develop a web service which takes a structure record and optional format name as input for a k=3 nearest neighbor search of a pre-loaded target fingerprint file. That doesn't sound so hard, except I'll let the fingerprint file determine the fingerprint type based on its 'type' header, which might specify Open Babel, RDKit, or OpenEye's toolkits.

Those toolkit all have different APIs, but I would prefer not to write different code for each case. Chemfp-1.1 can be hacked to handle most of what's needed, because read_structure_fingerprints() will read a structure file and computes fingerprints for each record. The hack part is save the XML-RPC query to a file, since read_structure_fingerprints() only works from a file, not a string.

That isn't a full solution. Chemfp-1.1 doesn't support any sort of structure output, so there would need to be toolkit-specific code to set the tag data and convert the molecule to a record.

For chemfp I decided on the classic approach and make my own toolkit-independent API, with a back-end implementation for each of the supported toolkits. I think it's a great API, and I've been using it a lot for my other internal projects. My hope is that some of the ideas I developed go into other toolkits, or at least influence the design of the next generation of toolkits.

To see a more concrete example, here's that use case implemented as an XML-RPC service using the new chemfp-1.2 API.

from SimpleXMLRPCServer import SimpleXMLRPCServer

import chemfp

# Load the target fingerprints and figure out the fingerprint type and toolkit
targets = chemfp.load_fingerprints("databases/chembl_18_FP2.fps")
fptype = targets.get_fingerprint_type()  # uses the 'type' field from the FPS header
toolkit = fptype.toolkit

print("Loaded %d fingerprints of type %r" % (len(targets), fptype.get_type()))

def search(record, format="sdf"):
    # Parse the molecule and report any failures
    try:
        mol = toolkit.parse_molecule(record, format)
    except ValueError as err:
        return {"status": "error", "msg": str(err)}

    # Compute its fingerprint, search the targets, and get the nearest 3 neighbors
    fp = fptype.compute_fingerprint(mol)
    nearest = targets.knearest_tanimoto_search_fp(fp, k=3, threshold=0.0)

    # Add a tag value like "CHEMBL14060 1.00 CHEMBL8020 0.92 CHEMBL7970 0.92"
    (id1, score1), (id2, score2), (id3, score3) = nearest.get_ids_and_scores()
    toolkit.add_tag(mol, "NEAREST", "%s %.2f %s %.2f %s %.2f" %
                    (id1, score1, id2, score2, id3, score3))

    return {"status": "ok",
            "record": toolkit.create_string(mol, "sdf")}

server = SimpleXMLRPCServer(("localhost", 8000))
server.register_introspection_functions()
server.register_function(search)

if __name__ == "__main__":
    server.serve_forever()

I think this is a clean API, and a bit easier to understand and use than the native toolkit APIs. It's very similar to cinfony, though I think at this level cinfony is bit easier to understand because it wraps its own Molecule object around the native toolkit molecules, while I leave them as native molecule objects. I have to use helper functions where cinfony can use methods. I did this because I don't want the performance overhead of wrapping and unwrapping for the normal use case of converting a structure file to fingerprints.

I also have more stand-alone objects than cinfony, like my fingerprint type object for fingerprint generation, where cinfony uses a method of the molecule.

parse_molecule()

That example, while illustrative of the new API, isn't significantly better than existing systems. For that, I need to delve into the details, starting with parse_molecule() .

Chemfp has three "toolkit" APIs, one for each of the supported toolkits. The usual way to get them is through one of the following:

from chemfp import rdkit_toolkit
from chemfp import openeye_toolkit
from chemfp import openbabel_toolkit
Most of the examples are toolkit independent, so I'll use "T" rather than stress a toolkit. I'll use RDKit as the back-end for these examples:
>>> from chemfp import rdkit_toolkit as T

The type signature is:

parse_molecule(content, format, id_tag=None, reader_args=None, errors='strict')
In the simplest case I'll parse a SMILES record to get an RDKit molecule object:
>>> mol = T.parse_molecule("c1ccccc1[16OH] phenol", "smi")
>>> mol
<rdkit.Chem.rdchem.Mol object at 0x104521360>
If you use the openeye_toolkit you would get an OEGraphMol, and openbabel_toolkit returns an OBMol.

You might have noticed that the first two parameters are in reverse order from cinfony's readstring(), which take the format in the first position instead of the second. This is because the parse_molecules() parameters parallel chemfp's read_molecules() parameters:

read_molecules(source=None, format=None, id_tag=None, reader_args=None, errors='strict', location=None)
The format there is second because the default of None means to auto-detect the format based on the source. (If the source is a filename then detection is based on the extension. I'll go into more details in another essay.)

In comparison, cinfony readfile() takes (format, filename), and doesn't auto-detect the format. (A future version could do that, but it would require a format like "auto" or None, which I thought was less elegant than being able to use the default format.) I wanted read_molecules() to support auto-detection, which meant the format had to be in the second position, which is why parse_molecule() takes the format in the second position.

parse_molecule() always returns a new molecule

This function promises to return a new molecule object each time. Compare to OEChem or Open Babel, which parse into an existing molecule object.

Those two toolkits reuse the molecule for performance reasons; clearing is faster than deallocating and reallocating a molecule object. My experience is that in practice molecule reuse is error-prone because it's easy to forget to clear the molecule, or save multiple results to a list and be surprised that they all end up with the same molecule.

I agree that performance is important. I chose a different route to get there. I noticed that even if the molecule were reused, there would still be overhead in calling parse_molecule() because it has to validate or at least interpret the function parameters. These parameters rarely change, that validation is unneeded overhead.

What I ended up doing was making a new function, make_id_and_molecule_parser(), which take the same parameters as parse_molecule(), except leaving out 'content'. It returns a specialized parser function which only takes one parameter, the content, and returns the (id, molecule) pair. (In a future release I may also have a make_molecule_parser() which only returns the molecule, but that's too much work for now.)

This new function is designed for performance, and is free to reuse the molecule object. Functions which return functions are relatively unusual, and I think only more advanced programmers will use it, which makes it less likely that people will experience the negative problems.

It's still possible, so in the future I may add an option to make_molecule_parser() to require a new molecule each time.

Format names

The second parameter is the format name or Format object. For now I'll only consider format names.

mol = T.parse_molecule("c1ccccc1[16OH] phenol", "smi")

This was a surprisingly tricky to get right, because a SMILES record and a SMILES string are different things, and because toolkits differ in what a SMILES record means. When someone says "takes a SMILES as input", does that mean to take a record or a string?

To clarify, a SMILES file contains SMILES records. A SMILES record is a SMILES string followed by a whitespace, followed by a title/id. Some toolkits take Daylight's lead and say that the title goes to the end of the line. Others interpret the file as a whitespace or space or tab separated file; or even an arbitrary character delimited file. Some also support a header in the first line. Finally, SMILES records are newline terminated, although that's not always required for the last record. (I'll come back to this in a bit.)

I decided to use different format names for these two cases. The "smi" format refers to a SMILES record, and the "smistring" format refers to just the SMILES string.

Björn Grüning pointed out that a similar issue exists with InChI. There's the InChI string, but Open Babel and some other tools also support an "InChI file" with the InChI string as the first term, followed by a whitespace, followed by some title or identifier, as well as an "InChIKey file", using the InChIKey instead of the InChI.

Thus, chemfp has "inchistring" and "inchi", and "inchikeystring" and "inchikey", in parallel with the "smistring"/"smi" distinction.

The other formats, like "sdf" and "pdb", are only record-oriented and don't have the same dual meaning as SMILES and InChI.

Compressed formats

A longer term chemfp idea is to extend the binary format to store record data. My current experiments use SQLite for the records and FPB files for the fingerprints.

Uncompressed SDF records take up a lot of space, and compress well using standard zlib compression. The file-based I/O function support format names like "smi.gz". I extended the idea to support zlib-compressed records, like "sdf.zlib".

Output format names: different types of SMILES

The SMILES output format names are also tricky. This needs a bit of history to understand fully. Daylight toolkit introduced SMILES strings, but the original syntax did not support isotopes or chirality. These were added latter, as so-called "isomeric SMILES". Daylight and nearly every toolkit since maintained that separation, where a "SMILES" output string (either canonical or non-canonical) was not isomeric, and something different needed to be done to get an isomeric SMILES.

This was a usability mistake. Most people expect that when the input is 238U then the output SMILES will be "[238U]". I know, because I've probably made that mistake a dozen times in my own code. On the plus side, it's usually very easy to detect and fix. On the minus side, I've only rarely needed canonical non-isomeric SMILES, so the default ends up encouraging mistakes.

OEChem 2.0 decided to break the tradition and say that "smi" refers to canonical isomeric SMILES, which is what most expect but didn't get, that "can" refers to canonical non-isomeric SMILES (this is unchanged), and that "usm" is the new term for non-canonical, non-isomeric SMILES.

It's a brillantly simple solution to a usability problem I hadn't really noticed before they solved it. This made so much sense to me that I changed chemfp's new I/O API to use those format names and meanings. I hope others also follow their lead.

That's why "smi" as a chemfp output format means canonical isomeric SMILES, "can" means canonical non-isomeric, and "usm" means non-canonical non-isomeric. The corresponding string formats are "smistring", "canstring", and "usmstring". Here they are in action:

>>> mol = T.parse_molecule("c1ccccc1[16OH] phenol", "smi")
>>> T.create_string(mol, "smi")
'[16OH]c1ccccc1 phenol\n'
>>> T.create_string(mol, "can")
'Oc1ccccc1 phenol\n'
>>> T.create_string(mol, "usm")
'c1ccccc1O phenol\n'
>>> T.create_string(mol, "smistring")
'[16OH]c1ccccc1'

id_tag, parse_id_and_molecule(), and titles

The parse_molecule() function only really uses the "id_tag" parameter to improve error reporting - the error message will use the 'id_tag's value rather than the default title for a record.

The id_tag parameter exists because developers in Hinxton decided to put the record identifier in a tag of an SD file instead of placing it in the record title like the rest of the world. As a result, many cheminformatics tools stumble a bit with the ChEBI and ChEMBL datasets, which either have a blank title or the occasional non-blank but useless title like:

"tetrahydromonapterin"
(1E)-1-methylprop-1-ene-1,2,3-tricarboxylate
S([O-])([O-])(=O)=O.[Ni+2].O.O.O.O.O.O.O
Structure #1
Untitled Document-1
XTP
I've decided that this is a bug, but despite years of complaints, they haven't changed, so chemfp has to work around it.

Here's an SD record from ChEBI that you can copy and paste:


  Marvin  02030822342D          

  2  1  0  0  0  0            999 V2000
   -0.4125    0.0000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    0.4125    0.0000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  2  0  0  0  0
M  STY  1   1 SUP
M  SAL   1  2   1   2
M  SMT   1 O2
M  END
> <ChEBI ID>
CHEBI:15379

> <ChEBI Name>
dioxygen

> <Star>
3

$$$$
I'll assign that to the variable 'sdf_record', parse it, and show that the title is empty, though the identifier is available as the "ChEBI ID" tag.
>>> from chemfp import rdkit_toolkit as T
>>> sdf_record = """\
...   
...   Marvin  02030822342D          
... 
...   2  1  0  0  0  0            999 V2000
...    -0.4125    0.0000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
              ... many lines deleted ...
... $$$$
... """
>>> mol= T.parse_molecule(sdf_record, "sdf")
[00:49:59]  S group SUP ignored on line 8
>>> mol
<rdkit.Chem.rdchem.Mol object at 0x1043de4b0>
>>> T.get_id(mol)
''
>>> T.get_tag(mol, "ChEBI ID")
'CHEBI:15379'
(The "get_id" and "get_tag" methods are part of the new molecule API. I'll discuss them in a future essay.)

I found that I often want to get both the identifier and the molecule. Rather than use a combination of parse_molecule() and get_id()/get_tag() to get that information, I created the parse_id_and_molecule() function, which returns the 2-element tuple of (id, molecule), whose id_tag parameter specifies where to find the id.

> > > T.parse_id_and_molecule(sdf_record, "sdf", id_tag="ChEBI ID")
[00:50:37]  S group SUP ignored on line 8
('CHEBI:15379', <rdkit.Chem.rdchem.Mol object at 0x104346a60>)
If the id_tag is None, then it uses the appropriate default for the given format; the title for SD records. Otherwise it contains the name of the tag with the title. (If there are multiple tags with the same name then the choice of which tag to use is made arbitrarily.)

I can imagine that some people might place an identifier in a non-standard location for other formats. If that happens, then there will likely be an id_tag syntax for that format.

For example, the closest I've come up with is a SMILES variant where the id you want is in the third column. In that case the id_tag might be "#3". If the column is labeled "SMILES" then an alternate would be "@SMILES", or if you know the column name doesn't start with a '#' or '@' then simply "SMILES".

However, that's hypothetical. I haven't seen it in real life. What I have seen are CSV files, where the SMILES and id columns are arbtirary, and which may have Excel-specific delimiter and quoting rules. Chemfp doesn't currently support CSV files and that will likely be handled through reader_args.

Handling whitespace in a SMILES record

The SMILES record format is not portable across toolkits. According to Daylight, and followed by OEChem, the format is the SMILES string, a single whitespace, and the rest of the line is the title. This title might be a simple alphanumeric id, or an IUPAC name with spaces in it, or anything else.

Other toolkits treat it as a character delimited set of columns. For example, RDKit by default uses the space character as the delimiter, though you can change that to tab, comma, or other character.

This is a problem because you might want to generate fingerprints from a SMILES file containing the record:

C1C(C)=C(C=CC(C)=CC=CC(C)=CCO)C(C)(C)C1 vitamin a
In one toolkit you'll end up with the identifier "vitamin" and with another toolkit get the identifier "vitamin A".

My experience is that most people treat a SMILES file as a space or tab delimited set of columns, with a strong perference for the space character. This isn't universally true. Roger Sayle pointed out to me that the formal IUPAC name is a perfectly reasonable unique and canonical identifer, which can have a space in it. The Daylight interpretation supports IUPAC names, while the space delimited version does not.

There is no perfect automatic solution to this ambiguity. Instead, chemfp lets you specify the appropriate delimiter type using the "delimiter" reader argument. The supported delimiter type names are "whitespace", "tab", "space", and "to-eol", as well as the literal characters " " and "\t". The default delimiter is "whitespace", because most of the people I work with think of a SMILES file more like a whitespace delimited file.

Here's an example of how the default doesn't like "vitamin A", but where the "to-eol" delimiter handles it correctly:

>>> smiles = "C1C(C)=C(C=CC(C)=CC=CC(C)=CCO)C(C)(C)C1 vitamin a\n"
>>> T.parse_id_and_molecule(smiles, "smi")
('vitamin', <rdkit.Chem.rdchem.Mol object at 0x10bbf39f0>)
>>> T.parse_id_and_molecule(smiles, "smi",
...              reader_args={"delimiter": "to-eol"})
('vitamin a', <rdkit.Chem.rdchem.Mol object at 0x10bc8c2f0>)
as well as an example of the difference between the "tab" and "to-eol" delimiter types:
>>> T.parse_id_and_molecule("C\tA B\tC\t", "smi",
...              reader_args={"delimiter": "tab"})
('A B', <rdkit.Chem.rdchem.Mol object at 0x10bc8c2f0>)
>>> T.parse_id_and_molecule("C\tA B\tC\t", "smi",
...              reader_args={"delimiter": "to-eol"})
('A B\tC\t', <rdkit.Chem.rdchem.Mol object at 0x10bbf39f0>)

It was a bit of work to make the different toolkits work the same way, and my best attempt isn't perfect. For example, if you are daft and try to interpret the record "C Q\tA" as a tab-delimited set of columns, then OEChem will see this as methane with an id of "Q" while RDKit and Open Babel will say they can't parse the SMILES "C Q".

So don't do that!

reader_args

As you saw, reader_args is a Python dictionary. All of the SMILES parsers accept the 'delimiter' argument, and the RDKit and Open Babel reader_args also support the "has_header" argument. If true, the first line of the file contains a header. (I couldn't think of a good implementation of this for OEChem.)

There are also toolkit-specific reader_args. Here I'll disable RDKit's sanity checker for SMILES, and show that it accepts a SMILES that it would otherwise reject.

>>> T.parse_molecule("c1ccccc1#N", "smistring", reader_args={"sanitize": False})
<rdkit.Chem.rdchem.Mol object at 0x104407440>
>>> T.parse_molecule("c1ccccc1#N", "smistring", reader_args={"sanitize": True})
[22:09:40] Can't kekulize mol 

Traceback (most recent call last):
  File "<stdin>", line 1, in 
  File "chemfp/rdkit_toolkit.py", line 251, in parse_molecule
    return _toolkit.parse_molecule(content, format, id_tag, reader_args, errors)
  File "chemfp/base_toolkit.py", line 986, in parse_molecule
    id_tag, reader_args, error_handler)
  File "chemfp/base_toolkit.py", line 990, in _parse_molecule_impl
    id, mol = format_config.parse_id_and_molecule(content, id_tag, reader_args, error_handler)
  File "chemfp/_rdkit_toolkit.py", line 1144, in parse_id_and_molecule
    error_handler.error("RDKit cannot parse the SMILES string %r" % (terms[0],))
  File "chemfp/io.py", line 77, in error
    raise ParseError(msg, location)
chemfp.ParseError: RDKit cannot parse the SMILES string 'c1ccccc1#N'

The SMILES parsers use different reader_args than the SDF parser. You can see the default reader_args by using the toolkit's format API:

>>> from chemfp import rdkit_toolkit
>>> rdkit_toolkit.get_format("smi").get_default_reader_args()
{'delimiter': None, 'has_header': False, 'sanitize': True}
>>> rdkit_toolkit.get_format("sdf").get_default_reader_args()
{'strictParsing': True, 'removeHs': True, 'sanitize': True}
Also, the different toolkits may use different reader_args for the same format.
>>> from chemfp import openeye_toolkit
>>> openeye_toolkit.get_format("sdf").get_default_reader_args()
{'flavor': None, 'aromaticity': None}
>>> from chemfp import openbabel_toolkit
>>> openbabel_toolkit.get_format("sdf").get_default_reader_args()
{'implementation': None, 'perceive_0d_stereo': False, 'perceive_stereo': False, 'options': None}
I'll cover more about the format API in another essay.

Namespaces

This can lead to a problem. You saw earlier how to get the correct toolkit for a given fingerprint type. Once you have the toolkit you can parse a record into a toolkit-specific molecule. But what if you want toolkit-specific and format-specific settings?

First off, parsers ignore unknown reader_arg names, and the reader_arg names for the different toolkits are different, except for "delimiter" and "has_header" where it doesn't make sense for them to be different. That means you could do:

reader_args = {
  "delimiter": "tab",       # for SMILES files
  "strictParsing": False,   # for RDKit SDF
  "perceive_stereo": True,  # for Open Babel SDF
  "aromaticity": "daylight, # for all OEChem readers
}
and have everything work.

Still, it's possible that you want OEChem to parse a SMILES using "daylight" aromaticity and an SD file using "openeye" aromaticity.

The reader_args are namespaced, so for that case you could use a format qualifier, like this:

reader_args = {
  "smi.aromaticity": "daylight",
  "can.aromaticity": "daylight",
  "usm.aromaticity": "daylight",
  "sdf.aromaticity": "openeye",
}
There's also a toolkit qualifier. In this daft example, the OpenEye reader uses the whitespace delimiter option for SMILES files, the RDKit SMILES and InChI readers uses tab, and the Open Babel SMILES and InChI readers use the space character.
reader_args = {
  "openeye.*.delimiter": "whitespace",
  "rdkit.*.delimiter": "tab",
  "openbabel.*.delimiter": "space",
}
Fully qualified names, like "openeye.smi.delimiter" are also allowed.

The remaining problem is how to configure the reader_args. It's no problem to write the Python dictionary yourself, but what if you want people to pass them in as command-line arguments or in a configuration file? I'll cover that detail when I talk about the format API.

Errors

What happens if the SMILES isn't valid? I'll use my favorite invalid SMILES:

>>> T.parse_id_and_molecule("Q q-ane", "smi")
[22:01:37] SMILES Parse Error: syntax error for input: Q
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "chemfp/rdkit_toolkit.py", line 264, in parse_id_and_molecule
    return _toolkit.parse_id_and_molecule(content, format, id_tag, reader_args, errors)
  File "chemfp/base_toolkit.py", line 1014, in parse_id_and_molecule
    id_tag, reader_args, error_handler)
  File "chemfp/base_toolkit.py", line 1018, in _parse_id_and_molecule_impl
    return format_config.parse_id_and_molecule(content, id_tag, reader_args, error_handler)
  File "chemfp/_rdkit_toolkit.py", line 249, in parse_id_and_molecule
    error_handler.error("RDKit cannot parse the SMILES %r" % (smiles,))
  File "chemfp/io.py", line 77, in error
    raise ParseError(msg, location)
chemfp.ParseError: RDKit cannot parse the SMILES 'Q'
A chemfp.ParseError is also a ValueError, so you can check for this exception like this:
try:
    mol = T.parse_molecule("Q", "smistring")
except ValueError as err:
    print("Not a valid SMILES")

I think an exception is the right default action when there's an error, but various toolkits disagree. Some will return a None object in that case, and I can see the reasoning. It's sometimes easier to write things more like:

>>> mol = T.parse_molecule("Q", "smistring", errors="ignore")
[00:04:26] SMILES Parse Error: syntax error for input: Q
>>> print(mol)
None

The default errors value is "strict", which raises an exception. The "ignore" behavior is to return None when the function can't parse a SMILES.

In the above example, the "SMILES Parse Error" messages come from the the underlying RDKit C++ library. It's difficult to capture this message from Python. Chemfp doesn't even try. OEChem and Open Babel have similar error messages. Chemp doesn't try to capture those either.

Error handling control for a single record isn't so important. It's more important when parsing multiple records, where "ignore" skips a record and keeps processing and "strict" raises an exception and stops processing once a record cannot be parsed. I'll cover that in the essay where I talk about reading molecules.

Other error handling - experimental!

The errors parameter can be two other strings: "report" and "log". The "report" option writes the error information to stderr, like this:

>>> T.parse_molecule("Q", "smi", errors="report")
[03:05:52] SMILES Parse Error: syntax error for input: Q
ERROR: RDKit cannot parse the SMILES 'Q'. Skipping.
The "SMILES Parse Error", again, is generated by the RDKit C++ level going directly to stderr. The "RDKit cannot parse the SMILES" line, however, is chemfp sending information to Python's sys.stderr.

I don't think the "report" option is all that useful for this case since it's pretty much a duplicate of the underlying C++ toolkit. It does know about the id_tag, so it gives better error message for ChEBI and ChEMBL files.

The "log" option is the same as "report" except that it sends the message to the "chemfp" logger using Python's built-in logging system. I have very little experience with logging, so this is even more experimental than "report".

Finally, the "errors" parameter can take an object to use as the error handler. The idea is that the handler's error() is called when there's an error. See chemfp/io.py for how it works, but consider this to be undocumented, experiemental, and almost certain to change.

I have the idea that the error handler can have a warn() method, which would be called for warnings. The current generation of toolkits uses a global error handler or sends the message directly to stderr. In a multi-threaded system, which might parse two molecules at the same time in different threads, it's hard to know which molecule caused the error.

I haven't done this because ... it's hard using the current generation of tools to get that warning information. I'm also unsure about the error handler protocol. How does it know that it's collected all of the warnings for a given molecule. Is there a special "end_of_warnings()" method? Are the warning accumulated and sent in one call? What if there are both warnings and errors?

That's why this subsection it titled "experimental". :)



New chemfp APIs #

In the new few essays I'll cover chemfp's new APIs. I'll have a different focus here than will be in the official documentation. I will be writing for the people interested in chemistry toolkit API development, and with a focus on the I/O subsystem in such a toolkit. I've spent about 15 years experimenting and evaluating different API ideas across a range of use cases. This new API is a synthesis of those ideas. I hope it proves influential.

Why does chemfp have these new APIs?

Chemfp is a multi-platform Python library for cheminformatics fingerprint generation and search. It depends on OEChem/OEGraphSim, RDKit, or Open Babel to parse a structure file and generate fingerprints, and has its own highly optimized search algorithms.

Chemfp supports multiple toolkits because many of the people I know who use fingerprints use fingerprints from multiple toolkits. While this is biased by the people I know, in general, I've seen that most people who uses a similarity-based search as part of their science spend some time to figure out which fingerprint type, and which parameters, are most appropriate for that research.

Unfortunately, this may mean a lot of fiddling around with different toolkits, since each has a different API and different take on the problem. My belief is that more people will use chemfp (and pay me money) if it really makes it easier to handle fingerprints across different toolkits. The hard part is that no one wants to learn yet another system except perhaps if it's significantly better. I think I've done that.

Example

Here's an example program using new API. It reads in a SMILES file, compute OpenEye's circular fingerprints (with the defaults settings) and save the results to an FPS file.

import chemfp

fptype = "OpenEye-Circular"
source = "benzodiazepine.smi"
destination = "benzodiazepine.fps"

with chemfp.read_molecule_fingerprints(fptype, source) as reader:
    with chemfp.open_fingerprint_writer(destination,
                                        metadata=reader.metadata) as  writer:
        writer.write_fingerprints(reader)
I wrote that in a more formal style, using Python's "with" statement to ensure that the files are closed when no longer needed. While this gets all of the details correct, it is verbose. You might even call it excessive, since you can get the basic result using:
import chemfp

chemfp.open_fingerprint_writer("benzodiazepine.fps").write_fingerprints(
    chemfp.read_molecule_fingerprints("OpenEye-Circular", "benzodiazepine.smi"))
The only major difference between the two is that the output fingerprint file in the latter doesn't contain any metadata.

Fingerprint family and fingerprint type API

Object-oriented programming uses the words "class" and "instance". A class is the more abstract concept. It describes what's possible. An instance fills in the details. For example, "integers" is the class of integer numbers, and "38" is an instance of that class.

Chemfp-1.2 makes a similar distinction between a "fingerprint family" and a "fingerprint type". A family might be "circular fingerprints" and a type is "circular fingerprints of size 2".

Internally, chemfp has a fingerprint family registry. To get information about all of the available fingerprint families, use get_fingerprint_families(), and use get_fingerprint_family() to get a specific family by name:

>>> import chemfp
>>> chemfp.get_fingerprint_families()
[FingerprintFamily(<ChemFP-Substruct-OpenBabel/1>),
FingerprintFamily(<ChemFP-Substruct-OpenEye/1>),
FingerprintFamily(<ChemFP-Substruct-RDKit/1>),
FingerprintFamily(<OpenBabel-FP2/1>),
  ...
FingerprintFamily(<RDMACCS-OpenBabel/1>),
FingerprintFamily(<RDMACCS-OpenEye/1>),
FingerprintFamily(<RDMACCS-RDKit/1>)]
>>> family = chemfp.get_fingerprint_family("OpenEye-Circular")
>>> family
FingerprintFamily(<OpenEye-Circular/2>)
(This make take a while because the registry uses a lazy resolver which doesn't load or probe the underlying toolkits until first requested.)

You can ask a fingerprint family for its default values, and make a new fingerprint type using those defaults.

>>> fptype = family()
>>> family.get_defaults()
{'numbits': 4096, 'minradius': 0, 'atype': 4239, 'maxradius': 5, 'btype': 1}
>>> fptype
<chemfp.openeye_types.OpenEyeCircularFingerprintType_v2 object at 0x10dd0ee90>
Every fingerprint type has a type string, which is the canonical string representation of the fingerprint family name, the chemfp version number for that family, and the fingerprint type parameters.
>>> fptype.get_type()
'OpenEye-Circular/2 numbits=4096 minradius=0 maxradius=5 
 atype=Arom|AtmNum|Chiral|EqHalo|FCharge|HCount btype=Order'
(This is really one long string. I added a newline to make it easier to read.)

You aren't restricted to the defaults. I'll use 1024 bits/fingerprint and a maximum radius of 3 instead of the defaults of 4096 and 5:

>>> fptype = family(numbits=1024, maxradius=3)
>>> fptype.get_type()
'OpenEye-Circular/2 numbits=1024 minradius=0 maxradius=3
 atype=Arom|AtmNum|Chiral|EqHalo|FCharge|HCount btype=Order'

If you have a type string then you can get the fingerprint type directly using chemfp's get_fingerprint_type(). The fingerprint type string does not need to be in canonical form nor include all of the parameters.

>>> fp2type = chemfp.get_fingerprint_type("OpenBabel-FP2")
>>> fp2type.get_type()
'OpenBabel-FP2/1'
>>> 
>>> rdktype = chemfp.get_fingerprint_type("RDKit-Fingerprint/2")
>>> rdktype.get_type()
'RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=2 useHs=1'
>>> rdktype = chemfp.get_fingerprint_type("RDKit-Fingerprint/2 fpSize=512 maxPath=5")
>>> rdktype.get_type()
'RDKit-Fingerprint/2 minPath=1 maxPath=5 fpSize=512 nBitsPerHash=2 useHs=1'

Fingerprint writer API

Chemfp has always been able to write fingerprints to an FPS file, but I thought the API was too clumsy for public use so I left it as part of the undocumented internals.

I'm much happier about the new fingerprint writer API, which is based on a writer object with a method to save a single fingerprint at a time, and another method to save multiple fingerprints. In this example I asked it write to None, which means to write to stdout, so you could see the output from each methods call.

>>> writer = chemfp.open_fingerprint_writer(None)
#FPS1
>>> writer.write_fingerprint("zeros", "\0\0\0\0")
00000000        	zeros
>>> writer.write_fingerprints( [("fp1", "\0\0\0\1"), ("fp2", "\0\0\0\2")] )
00000001        	fp1
00000002        fp2
>>> writer.close()

Version 1.2 also introduces a new binary file format, the FPB format. The writer API knows how to write an FPB file and will do so if the format ends with ".fpb" or told specifically to write an FPB file.

Toolkit APIs

The toolkit API is the collective name for the format API, the molecule I/O API, and the molecule API. There is one toolkit API implementation for each supported chemistry toolkit, which is basically a wrapper from the chemfp API to the underlying toolkit API. Use get_toolkit_names() to get a list of the available chemistry-based toolkits:

>>> chemfp.get_toolkit_names()
set(['openeye', 'rdkit', 'openbabel'])
>>> T = chemfp.get_toolkit("openeye")
>>> T.is_licensed()
True
Instead of get_toolkit() you can import a toolkit directly in Python:
>>> from chemfp import openeye_toolkit as T
>>> T.is_licensed()
True
>>> T.name
'openeye'

In addition to the chemistry-based toolkits, the "text" toolkit implements just enough of the toolkit API to process SMILES and SDF records. It doesn't not know chemistry; it stores the contents of each record as a string. I wrote it to handle a few use cases that couldn't be handled with the molecule centered toolkit. I'll discuss these cases in the future.

Format API

The get_formats() function lists the available formats for a given toolkit.

>>> from chemfp import rdkit_toolkit
>>> rdkit_toolkit.get_formats()
[Format('rdkit/canstring'), Format('rdkit/usmstring'),
Format('rdkit/smistring'), Format('rdkit/molfile'),
Format('rdkit/usm'), Format('rdkit/sdf'), Format('rdkit/can'),
Format('rdkit/smi'), Format('rdkit/rdbinmol')]
There's also a get_input_formats() and get_output_formats(), or you can use attributes of the format determine if the toolkit understand the format as an input or output, as well as a few other properties.

You can use the format API to figure out the toolkit's (or at least chemfp's) best guess for the format for a given filename. This includes compression detection.

>>> rdkit_toolkit.get_input_format_from_source("example.smi.gz")
Format('rdkit/smi.gz')

Molecule input API

This is by far the most extensive of the new APIs in chemfp-1.2. Previously there was read_molecule_fingerprints() which took a fingerprint type name and the source and returned fingerprints for each record. Suppose though that you want to read a structure file and generate fingerprints using several three RDKit fingerprint types. The older API requires reading the file three different types, which means parsing the same record three different times, even though the same RDKit molecule could be reused for each.

The new API lets you separate the process of reading molecules from generating fingerprints, which can be a nice performance boost.

The old API could only take input from files, or file-like sources. The new API has more extensive support for reading from a string. This is useful in a web application, where the input might be text field from a submitted form.

In practice, there several variations on how to read an input, depending on if it comes from a file or a string or if you want to read a single record (the first one) or all records. Sometimes you want the really molecule id when reading a molecule and sometimes you really don't that overhead; I ended up implementing both of those readers. Finally, there are also cases where you want to convert a lot of records into a molecule, but each record is stored into its own string. The make_id_and_molecule_parser() creates a specialized parser for the case.

Different toolkit reader API functions
  ...from a file ...from a string
Read 0 or more molecules ... read_molecules() read_molecules_from_string()
Read 0 or more (id, molecule) pairs ... read_ids_and_molecules() read_ids_and_molecules_from_string()
Parse the first molecule ...   parse_molecule()
Parse the first molecule and its id...   parse_id_and_molecule()
Make an optimized function to parse the
first molecule and its id...
  make_id_and_molecule_parser()

Molecule output API

Chemfp-1.2 adds new APIs to save molecules to a structure file or convert a molecule into a record. Like the reader, there are several variations in the writer.

The open_molecule_writer() and open_molecule_writer_to_string() functions return a writer object, though in the first case the writer saves to a file and in the latter it saves to memory that can be accessed as a string. The writer object has methods to save a single molecule at a time, to save multiple molecules at once, and to save a molecules using an alternate id.

To demonstrate, here's a program which reads the compressed ChEBI SD file and saves it to a compressed SMILES file. I specify the id tag because ChEBI stores its identifiers in the "ChEBI ID" SD tag, and not the title:

from chemfp import openeye_toolkit as T

with T.read_ids_and_molecules("ChEBI_complete.sdf.gz", id_tag="ChEBI ID") as reader:
    with T.open_molecule_writer("chebi.smi.gz") as writer:
        writer.write_ids_and_molecules(reader)

When you only need to turn a molecule into a record, use the create_string() function. If you need to do a lot of these conversions, all with the same parameters, then use make_string_creator(). This function returns returns a specialized function which is better optimized to turn a molecule into a string.

Molecule API

Chemfp does not try to make a common molecule API across the differnet toolkits. For that, see Noel O'Boyle's "cinfony". In fact, it's quite the opposite. The readers, writers, and fingerprint generation functions take native molecule types and not wrapped objects.

All chemfp offers are a small set of functions to get what the toolkit considers to be a molecule's title, to do some simple manipuation of the SD tags for a record, and to make a copy of a molecule.

Connecting fingerprint types and toolkits to make a fingerprint

If you have a fingerprint type string, and a SMILES string, how to you generate the fingerprint for it? Each toolkit fingerprint type has a "compute" function, which takes a toolkit molecule and returns the fingerprint string, so that's in the right direction. You still need to know which toolkit to use to convert the SMILES into the right molecule.

Well, actually you don't. The fingerprint type's parse_fingerprint() will parse a molecule and generate the fingerprint for you:

>>> fptype = chemfp.get_fingerprint_type("OpenBabel-FP2")
>>> fptype.parse_fingerprint("c1ccccc1O", "smi")
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x02
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x80\x00\x00\x00
\x00\x00\x00@\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x08
\x00\x00\x00\x00\x00\x00\x00'
This helper function works because the fingerprint type knows its toolkit, through the "toolkit" attribute. This attribute is part of the public API, so you could do the two steps yourself if you want to:
>>> fptype.toolkit
<module 'chemfp.openbabel_toolkit' from 'chemfp/openbabel_toolkit.pyc'<
>>> mol = fptype.toolkit.parse_molecule("c1ccccc1O", "smistring")
>>> fptype.compute_fingerprint(mol)
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x02
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x80\x00\x00\x00
\x00\x00\x00@\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x08
\x00\x00\x00\x00\x00\x00\x00'



Summary of the /etc/passwd reader API #

Next month I will be in Finland for Pycon Finland 2014. It will be my first time in Finland. I'm especially looking forward to the pre-conference sauna on Sunday evening.

My presentation, "Iterators, Generators, and Coroutines", will cover much of the same ground as my earlier essay. In that essay, I walked through the steps to make an /etc/passwd parser API which can be used like this:

from passwd_reader import read_passwd_entries

with read_passwd_entries("/etc/passwd", errors="strict") as reader:
    location = reader.location
    for entry in reader:
        print("%d:%s" % (location.lineno, entry.name))
I think the previous essay was a bit too detailed to understand the overall points, so in this essay I'll summarize what I did and why I did it. Hopefully it will also help me prepare for Finland.

The /etc/passwd parser is built around a generator, which is a pretty standard practice. Another standard approach is to build a parser object, as a class instance which implements the iterator protocol. The main difference is that the generator uses local variables in the generator's execution frame where the class approach uses instance variables.

Since the parser API can open a file, when the first parameter is a filename string instead of a file object, I want it to implement the context manager protocol and implement deterministic resource handling. If it always created a file then I could use contextlib.closing() or contextlib.contextmanager() to convert an iterator into a self-closing context manager, but my read_passwd_entries reader is polymorphic in that first parameter, so I can't use a standard solution.

I instead wrapped the generator inside of a PasswdReader which implements the appropriate __enter__ and __exit__ methods.

I also want the parser to track location information about current record; in this case the line number of the current record but in general it could include byte position or other information about the record's provenance. I store this in a Location instance accessed via the PasswdReader's "location" attribute.

The line number is stored as a local variable in the iterator's execution frame. While this could be accessed through the generator's ".gi_frame.f_locals", the documentation says that frames are internal types whose definitions may change with future versions of the interpreter. That doesn't sound like something I want to depend upon.

Instead, I used an uncommon technique where the generator registers a callback function that the Location can use to get the line number. This function is defined inside of the generator's scope so can access the local variables. This isn't quite as simple as I would like, because exception handling in a generator, including the StopIteration from calling a generator's close(), is a bit tricky, but it does work.

The more common technique is to rewrite the generator as a class which implements the iterator protocol, where each instance stores its state information as instance variables. It's easy to access instance variables, but it's a different sort of tricky to read and write the state information at the respectively start and end of each iteration step.

A good software design balances many factors, including performance and maintability. The weights for each factor depend on the expected use cases. An unsual alternate design can be justified when it's a better match to the use cases, which I think is the case with my uncommon technique.

In most cases, API users don't want the line number of each record. For the /etc/passwd parser I think it's only useful for error reporting. More generally, it could be used to build a record index, or a syntax highlighter, but those are relatively rare needs.

The traditional class-based solution is, I think, easier to understand and implement, though it's a bit tedious to save and restore the parser state for each entry and exit point. This synchronization adds a lot of overhead to the parser, which isn't neeed for the common case where that information is ignored.

By comparison, my alternative generator solution has a larger overhead - two function calls instead of an attribute lookup - to access location information, but it doesn't need the explicit save/restore for each step because those are maintained by the generator's own execution frame. I think it's a better match for my use cases.

(Hmm. I think I've gone the other way. I think few will understand this without the examples or context. So, definitely lots of code examples for Pycon Finland.)



Lausanne Cheminformatics workshop and contest #

Dietrich Rordorf from MDPI sent an announcement to the CHMINF mailing list about the upcoming 9th Workshop in Chemical Information. It will be on 12 September 2014 in Lausanne, Switzerland. It seems like it will be a nice meeting, so I thought to forward information about it here. They also have a software contest, with a 2,000 CHF prize, which I think will interest some of my readers.

The workshop has been around for 10 years, so I was a bit suprised that I hadn't heard of it before. Typically between 30 and 50 people attend, which I think is a nice size. The preliminary program is structured around 20 minute presentations, including:

If you know the authors, you might recognize that one is from Strasbourg and another London, and the rest from Switzerland. I can understand. From where I live in Sweden it will cost over US $300 in order to get there, and Lausanne doesn't have its own commercial airport so I would need to fly into Geneva or Bern, while my local air hub doesn't fly there directly.

But I live in a corner of Europe, and my constraints aren't yours.

Source code contest

I had an email conversation with Luc Patiny about an interesting aspect of the workshop. They are running a contest to identify the best open source cheminformatics tool of the year, with a prize of 2000 CHF. That's 1650 EUR, 2200 USD, 1300 GBP, or 15000 SEK, which is plenty enough for someone in Europe or even the US to be able to travel there! They have a time slot set aside for the winner of the contest to present the work. The main requirement is that contest participants are selected from submissions (1-2 pages long) to the open access journal Challenges. (And no, there are no journal page fees for this contest, so it doesn't seem like a sneaky revenue generating trick.)

The other requirement is that the submission be "open source". I put that in quotes because much of my conversation with Luc was to understand what they mean. They want people to be able to download the (unobsfucated) software source code for no cost and be able to read and review it to gain insight.

I think this is very much in line with classical peer review thought, even though it can include software which are neither open source nor free software. For example, software submissions for this contest could be "for research purposes only" or "not for use in nuclear reactors", or "redistributions of modified versions are not permitted."

Instead, I think their definition is more in line with Microsoft terms shared source.

In my case, my latest version of chemfp is 'commercial open source', meaning that those who pay me money for it get a copy of it under the MIT open source license. It's both "free software" and "open source", but it's not eligible for this workshop because it costs money to download it.

But I live in a corner of open source, and my constraints aren't yours. ;) If you have a free software project, open source software project, or shared source software project, then you might be interested in submitting it to this workshop and journal. If you win, think of it as an all-expenses paid trip to Switzerland. If you don't win, think of it as a free publication.



I/O location and error handling #

I've been interested in a thorny question of how to handle errors and location information in file I/O, especially as it applies to chemical structure file formats. I'll organize my ideas around a parser for the /etc/passwd file on my Mac.

Some of the lines from my passwd file are:

# See the opendirectoryd(8) man page for additional information about
# Open Directory.
##
nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false
root:*:0:0:System Administrator:/var/root:/bin/sh
Comment lines start with a "#". Each record is on a single line, with colon separated fields.

A simple parser

This is very easy to parse. Here's an example, along with a driver:

from __future__ import print_function

class PasswdEntry(object):
    def __init__(self, name, passwd, uid, gid, gecos, dir, shell):
        self.name = name
        self.passwd = passwd
        self.uid = uid
        self.gid = gid
        self.gecos = gecos
        self.dir = dir
        self.shell = shell
        
def read_passwd_entries(infile):
    for line in infile:
        line = line.rstrip("\n")  # remove trailing newline
        # Ignore comments and blank lines
        if line[:1] == "#" or line.strip() == "": 
            continue
     
        name, passwd, uid, gid, gecos, dir, shell = line.split(":")
        yield PasswdEntry(name, passwd, uid, gid, gecos, dir, shell)

def main():
    import sys
    filename = "/etc/passwd"
    if sys.argv[1:]:
      filename = sys.argv[1]
    with open(filename) as passwd_file:
        for entry in read_passwd_entries(passwd_file):
            print(entry.name, "=>", entry.shell)

if __name__ == "__main__":
    main()

The output when I run this starts with:

nobody => /usr/bin/false
root => /bin/sh
daemon => /usr/bin/false
Great! I have working code which handles the wide majority of things I want to do with a passwd file.

Errors should match the API level

But it doesn't handle everything. For example, what happens if a file contains a poorly formatted line? When I test with the following line:

dalke:*:-2:-2:Andrew Dalke:/var/empty
I get the output:
% python entries.py bad_passwd 
nobody => /usr/bin/false
Traceback (most recent call last):
  File "entries.py", line 33, in <module>
    main()
  File "entries.py", line 29, in main
    for entry in read_passwd_entries(passwd_file):
  File "entries.py", line 20, in read_passwd_entries
    name, passwd, uid, gid, gecos, dir, shell = line.split(":")
ValueError: need more than 6 values to unpack
This isn't very helpful. Which record failed, and which line do I need to fix?

It isn't hard to modify read_passwd_entries() to make the error message match the API level:

def read_passwd_entries(infile):
    for lineno, line in enumerate(infile, 1):
        line = line.rstrip("\n")  # remove trailing newline
        # Ignore comments and blank lines
        if line[:1] == "#" or line.strip() == "": 
            continue

        try:
            name, passwd, uid, gid, gecos, dir, shell = line.split(":")
        except ValueError:
            name = getattr(infile, "name", None)
            if name is not None:
                where = " of %r" % (name,)
            else:
                where = ""
            raise ValueError("Cannot parse password entry on line %d%s: %r"
                             % (lineno, where, line))
        yield PasswdEntry(name, passwd, uid, gid, gecos, dir, shell)
The error behavior is now:
% python entries.py bad_passwd
nobody => /usr/bin/false
Traceback (most recent call last):
  File "entries.py", line 42, in <module>
    main()
  File "entries.py", line 38, in main
    for entry in read_passwd_entries(passwd_file):
  File "entries.py", line 29, in read_passwd_entries
    % (lineno, where, line))
ValueError: Cannot parse password entry on line 12 of 'bad_passwd': 'dalke:*:-2:-2:Andrew Dalke:/var/empty'

Specify error handling policy

That's beter, but sometimes I want the parser to ignore an error record and continue processing, instead of stopping. This is frequently needed in chemistry file formats. While the formats are pretty well defined, different chemistry toolkits have different chemistry models and different levels of strictness. RDKit, for example, does not support hexavalent carbon, while OEChem does accept non-realistic valences.

The failure rates in structure files are low, with failures in perhaps 1 in 10,000 structures, depending on the data source and toolkit combination. The usual policy is to report a warning message and continue.

The solution I like is to have a parameter which describes the error handling policy. By default I'll say it's "strict", which stops processing. The "warn" policy prints a message to stderr, and "ignore" just skips the record. (It's not hard to think of other policies, or to support user-defined error handler objects in addition to strings.)

I modified the parser code to handle these three policies, and updated the driver so you could specify the policies on the command-line. Here's the complete program:

from __future__ import print_function

import sys
import argparse

class PasswdEntry(object):
    def __init__(self, name, passwd, uid, gid, gecos, dir, shell):
        self.name = name
        self.passwd = passwd
        self.uid = uid
        self.gid = gid
        self.gecos = gecos
        self.dir = dir
        self.shell = shell

# Define different error handlers, and a common error formatter.

def _format_msg(lineno, source, line):
    if source is None:
        where = "on line %d" % (lineno, source)
    else:
        where = "on line %d of %r" % (lineno, source)
    return "Cannot parse record %s: %r" % (where, line)

def ignore_handler(lineno, source, line):
    pass

def warn_handler(lineno, source, line):
    msg = _format_msg(lineno, source, line)
    sys.stderr.write(msg + "\n")

def strict_handler(lineno, source, line):
    msg = _format_msg(lineno, source, line)
    raise ValueError(msg)

error_handlers = {
    "ignore": ignore_handler,
    "warn": warn_handler,
    "strict": strict_handler,
}
        
def read_passwd_entries(infile, errors="warn"):
    # Get the error handler for the given policy.
    # (A more sophisticated solution might support a user-defined
    # error handler as well as a string.)
    try:
        error_handler = error_handlers[errors]
    except KeyError:
        raise ValueError("Unsupported errors value %r" % (errors,))
    
    for lineno, line in enumerate(infile, 1):
        line = line.rstrip("\n")  # remove trailing newline
        # Ignore comments and blank lines
        if line[:1] == "#" or line.strip() == "": 
            continue

        try:
            name, passwd, uid, gid, gecos, dir, shell = line.split(":")
        except ValueError:
            # Handle the failure
            source_name = getattr(infile, "name", None)
            error_handler(lineno, source_name, line)
            # If we get here then continue to the next record
            continue
        
        yield PasswdEntry(name, passwd, uid, gid, gecos, dir, shell)

# Driver code to help test the library

parser = argparse.ArgumentParser(
    description = "List username/shell entries in a password file")
parser.add_argument("--errors", choices = ["ignore", "warn", "strict"],
                    default = "strict",
                    help = "Specify the error handling policy")
parser.add_argument("filename", nargs="?", default=["/etc/passwd"],
                    help = "Password file to parse")
                  
def main():
    args = parser.parse_args()
    filename = args.filename[0]
    try:
        with open(filename) as passwd_file:
            try:
                for entry in read_passwd_entries(passwd_file, errors=args.errors):
                    print(entry.name, "=>", entry.shell)
            except ValueError as err:
                raise SystemExit("ERROR with password entry: %s" % (err,))
    except IOError as err:
        raise SystemExit("ERROR with password file: %s" % (err,))
        

if __name__ == "__main__":
    main()

Let's see it in action. First, the default, which raises a ValueError. The main driver catches the ValueError, reports the error message, and exits:

% python entries.py bad_passwd 
nobody => /usr/bin/false
ERROR with password entry: Cannot parse record on line 12 of 'bad_passwd': 'dalke:*:-2:-2:Andrew Dalke:/var/empty'
Next, I'll specify the "warn" handler. (The warning message comes first because stderr does not buffer while stdout does.)
% python entries.py bad_passwd --errors warn | head -4
Cannot parse record on line 12 of 'bad_passwd': 'dalke:*:-2:-2:Andrew Dalke:/var/empty'
nobody => /usr/bin/false
root => /bin/sh
daemon => /usr/bin/false
_uucp => /usr/sbin/uucico
Finally, ignore all errors and keep on processing:
% python entries.py bad_passwd --errors ignore | head -4
nobody => /usr/bin/false
root => /bin/sh
daemon => /usr/bin/false
_uucp => /usr/sbin/uucico

API access to location information

Unfortunately, there are still a few things the parser API doesn't support. For example, which line contains the "root" record? The parser tracks the line number, but there's no way for the caller to get access to that information.

My solution was to develop a "Location" object. You might have noticed that source filename, current line number, and line content were all passed around together. This is often a hint that those parameters should be part of the same data type, rather than individually. I tried it out, and found the result much easier to understand.

Here's the new Location class:

class Location(object):
    def __init__(self, source=None, lineno=None, line=None):
        self.source = source
        self.lineno = lineno
        self.line = line

    def where(self):
        source = self.source
        if source is None:
            return "line %s" % self.lineno
        else:
            return "line %d of %r" % (self.lineno, source)
It's also nice that I could abstract the "where()" code into a function, so it's the only place which needs to worry about an unknown source name.

The error handlers are now easier, because they only take a single location argument, and because the "where()" method hides some of the complexity:

# Define different error handlers, and a common error formatter.

def _format_msg(location):
    return "Cannot parse record %s: %s" % (location.where(), location.line)

def ignore_handler(location):
    pass

def warn_handler(location):
    msg = _format_msg(location)
    sys.stderr.write(msg + "\n")

def strict_handler(location):
    msg = _format_msg(location)
    raise ValueError(msg)

The parser code has a few changes. I want people to be able to pass in their own location tracker, or if not specified, to create my own local one. More importantly, I need to see the location's "lineno" and "line" properties before passing control either to the error handler or back to the caller. I also decided that the final value of "location.lineno" should be the number of lines in the file, and if the file is empty then it should be 0.

Here's the updated parser code:

def read_passwd_entries(infile, errors="warn", location=None):        
    # Get the error handler for the given policy.
    # (A more sophisticated solution might support a user-defined
    # error handler as well as a string.)
    try:
        error_handler = error_handlers[errors]
    except KeyError:
        raise ValueError("Unsupported errors value %r" % (errors,))

    if location is None:
        location = Location(getattr(infile, "name", None))
    
    lineno = 0  # define lineno even if the file is empty

    for lineno, line in enumerate(infile, 1):
        line = line.rstrip("\n")  # remove trailing newline
        # Ignore comments and blank lines
        if line[:1] == "#" or line.strip() == "": 
            continue

        # Track where we are
        location.lineno = lineno
        location.line = line

        try:
            name, passwd, uid, gid, gecos, dir, shell = line.split(":")
        except ValueError:
            # Handle the failure
            error_handler(location)
            # If we get here then continue to the next record
            continue
        
        yield PasswdEntry(name, passwd, uid, gid, gecos, dir, shell)

    # Save the final line number
    location.lineno = location
It's getting more complex, but I think you can still follow the control flow.

I made some changes to the driver code to test the new location API. If you specify "--with-lineno" then each output line will start with the line number for the given record.

Here's the updated driver code:

parser = argparse.ArgumentParser(
    description = "List username/shell entries in a password file")
parser.add_argument("--errors", choices = ["ignore", "warn", "strict"],
                    default = "strict",
                    help = "Specify the error handling policy")
parser.add_argument("--with-lineno", action="store_true",
                    help="include line numbers in the output")
parser.add_argument("filename", nargs="?", default=["/etc/passwd"],
                    help = "Password file to parse")
                  
def main():
    args = parser.parse_args()
    filename = args.filename[0]
    location = Location(filename)
    if args.with_lineno:
        output_fmt = "{location.lineno}: {entry.name} => {entry.shell}"
    else:
        output_fmt = "{entry.name} => {entry.shell}"
    try:
        with open(filename) as passwd_file:
            try:
                for entry in read_passwd_entries(
                        passwd_file, errors=args.errors, location=location):
                    print(output_fmt.format(entry=entry, location=location))
            except ValueError as err:
                raise SystemExit("ERROR with password entry: %s" % (err,))
    except IOError as err:
        raise SystemExit("ERROR with password file: %s" % (err,))

Now to see if the new code works. First, run it with the defaults to get the same output as before:

% python entries.py | head -4
nobody => /usr/bin/false
root => /bin/sh
daemon => /usr/bin/false
_uucp => /usr/sbin/uucico
and then with option to show the line numbers:
% python entries.py --with-lineno | head -4
11: nobody => /usr/bin/false
12: root => /bin/sh
13: daemon => /usr/bin/false
14: _uucp => /usr/sbin/uucico
Yay!

A PasswdReader iterator, with location

I think this API is still complicated. The caller must create the Location instance and pass it to read_passwd_entries() in order to the location information. The thing is, even if the location is None, the function ends up creating a Location in order to have good error reporting. What about making that otherwise internal location public, so I can use it as the default location, instead of always specifying it when I need it?

I did this with a bit of wrapping. The read_passwd_entries() function API returns an iterator of PasswdEntry instances. The iterator is currently implemented using a generator, but it doesn't have to be an generatore. I'll instead have it return a PasswdReader instance, where PasswdReader.location gives the location, and iterating over the PasswdReader gives the PasswdEntry elements from the underlying generator.

The PasswdReader iterator class, which takes the location and the underlying generator, is as follows:

class PasswdReader(object):
    def __init__(self, location, reader):
        self.location = location
        self._reader = reader

    def __iter__(self):
        return self._reader

    # For Python 2
    def next(self):
        return self._reader.next()

    # For Python 3
    def __next__(self):
        return next(self._reader)
I also modified read_passwd_entries() so it returns this new iterator instead of the original generator. I've found that the easiest way is to break the original function into two parts. The first does parameter validation and normalization, and the second is the actual generator:
def read_passwd_entries (infile, errors="warn", location=None):
    # Get the error handler for the given policy.
    # (A more sophisticated solution might support a user-defined
    # error handler as well as a string.)
    try:
        error_handler = error_handlers[errors]
    except KeyError:
        raise ValueError("Unsupported errors value %r" % (errors,))
    
    if location is None:
        location = Location(getattr(infile, "name", None))

    return PasswdReader(location,
                        _read_passwd_entries(infile, error_handler, location))
    
def _read_passwd_entries(infile, error_handler, location):
    lineno = 0  # define lineno even if the file is empty
    for lineno, line in enumerate(infile, 1):
       ...
The main reason to split it up this way is to have eager evaluation for the parameter checking, and lazy evaluation for the actual reader.

PasswdReader iterator as context manager

The end result is a simpler driver code, though it's still not as simple as I would like. The old code was something like:

    location = Location(filename)
    ...
        with open(filename) as passwd_file:
            try:
                for entry in read_passwd_entries(
                        passwd_file, errors=args.errors, location=location):
                    print(output_fmt.format(entry=entry, location=location))
            except ValueError as err:
                raise SystemExit("ERROR with password entry: %s" % (err,))
while the new driver only needs to get the "location" field. The critical core in the new driver is:
        with open(filename) as passwd_file:
            reader = read_passwd_entries(passwd_file, errors=args.errors)
            location = reader.location
            try:
                for entry in reader:
                    print(output_fmt.format(entry=entry, location=location))
            except ValueError as err:
                raise SystemExit("ERROR with password entry: %s" % (err,))

To make it simpler still, I'll include the open() functionality as part of the read_passwd_entries() API. That is, if the first parameter is a string then I'll assume it's a file name and open the file myself, otherwise I'll assume it's a Python file object. (In CS terms, I'll make the function polymorphic on the first parameter.)

That is, I want the driver code to look like this:

        with read_passwd_entries(filename, errors=args.errors) as reader:
            location = reader.location
            try:
                for entry in reader:
                    print(output_fmt.format(entry=entry, location=location))
            except ValueError as err:
                raise SystemExit("ERROR with password entry: %s" % (err,))
That means that the object returned from read_passwd_entries() must also support the context manager API. The simplest (and wrong) version adds dummy __enter__ and __exit__ methods to the PasswdReader, like this:
# WARNING: incomplete code
class PasswdReader(object):
    def __init__(self, location, reader):
        self.location = location
        self._reader = reader

    ...
    def __enter__(self):
        return self

    def __exit__(self, *args):
        pass
This is wrong because if the reader is given a filename and calls open() then the context manager must call the corresponding close() in the __exit__, while if the reader is given a file object then the __exit__ should not do anything to the file.

The more correct class takes a "close" parameter which, if not None, is called during __exit__:

class PasswdReader(object):
    def __init__(self, location, reader, close):
        self.location = location
        self._reader = reader
        self._close = close

    ...

    def __enter__(self):
        return self

    def __exit__(self, *args):
        if self._close is not None:
            self._close()
            self._close = None

I also need to change read_passwd_entries() to do the right thing based on the type of the first parameter. I've renamed the parameter to "source", and added a bit of special checking to handle the correct type check for both Python 2 and Python 3. Here are the core changes:

try:
    _basestring = basestring
except NameError:
    _basestring = str

def read_passwd_entries(source, errors="warn", location=None):
    if isinstance(source, _basestring):
        infile = open(source)
        source_name = source
        close = infile.close
    else:
        infile = source
        source_name = getattr(infile, "name", None)
        close = None
    
    # Get the error handler for the given policy.
    # (A more sophisticated solution might support a user-defined
    # error handler as well as a string.)
    try:
        error_handler = error_handlers[errors]
    except KeyError:
        raise ValueError("Unsupported errors value %r" % (errors,))
    
    if location is None:
        location = Location(source_name)

    return PasswdReader(location,
                        _read_passwd_entries(infile, error_handler, location),
                        close)

The final version

Enough has changed that I'll show the full code instead of you trying to piece the parts together yourself.

from __future__ import print_function

import sys
import argparse

class PasswdReader(object):
    def __init__(self, location, reader, close):
        self.location = location
        self._reader = reader
        self._close = close

    def __iter__(self):
        return self._reader

    # For Python 2
    def next(self):
        return self._reader.next()

    # For Python 3
    def __next__(self):
        return next(self._reader)

    def __enter__(self):
        return self

    def __exit__(self, *args):
        if self._close is not None:
            self._close()
            self._close = None

class PasswdEntry(object):
    def __init__(self, name, passwd, uid, gid, gecos, dir, shell):
        self.name = name
        self.passwd = passwd
        self.uid = uid
        self.gid = gid
        self.gecos = gecos
        self.dir = dir
        self.shell = shell

class Location(object):
    def __init__(self, source=None, lineno=None, line=None):
        self.source = source
        self.lineno = lineno
        self.line = line

    def where(self):
        source = self.source
        if source is None:
            return "line %s" % self.lineno
        else:
            return "line %d of %r" % (self.lineno, source)
            

# Define different error handlers, and a common error formatter.

def _format_msg(location):
    return "Cannot parse record %s: %s" % (location.where(), location.line)

def ignore_handler(location):
    pass

def warn_handler(location):
    msg = _format_msg(location)
    sys.stderr.write(msg + "\n")

def strict_handler(location):
    msg = _format_msg(location)
    raise ValueError(msg)

error_handlers = {
    "ignore": ignore_handler,
    "warn": warn_handler,
    "strict": strict_handler,
}

# Main API call to read from a passwd file

try:
    _basestring = basestring
except NameError:
    _basestring = str

def read_passwd_entries(source, errors="warn", location=None):
    if isinstance(source, _basestring):
        infile = open(source)
        source_name = source
        close = infile.close
    else:
        infile = source
        source_name = getattr(infile, "name", None)
        close = None
    
    # Get the error handler for the given policy.
    # (A more sophisticated solution might support a user-defined
    # error handler as well as a string.)
    try:
        error_handler = error_handlers[errors]
    except KeyError:
        raise ValueError("Unsupported errors value %r" % (errors,))
    
    if location is None:
        location = Location(source_name)

    return PasswdReader(location,
                        _read_passwd_entries(infile, error_handler, location),
                        close)

# The actual passwd file parser, as a generator used by the PasswdReader.

def _read_passwd_entries(infile, error_handler, location):
    lineno = 0  # define lineno even if the file is empty
    for lineno, line in enumerate(infile, 1):
        line = line.rstrip("\n")  # remove trailing newline
        # Ignore comments and blank lines
        if line[:1] == "#" or line.strip() == "": 
            continue

        # Track where we are
        location.lineno = lineno
        location.line = line

        try:
            name, passwd, uid, gid, gecos, dir, shell = line.split(":")
        except ValueError:
            # Handle the failure
            error_handler(location)
            # If we get here then continue to the next record
            continue
        
        yield PasswdEntry(name, passwd, uid, gid, gecos, dir, shell)

    # Save the final line number
    location.lineno = lineno

# Driver code to help test the library

parser = argparse.ArgumentParser(
    description = "List username/shell entries in a password file")
parser.add_argument("--errors", choices = ["ignore", "warn", "strict"],
                    default = "strict",
                    help = "Specify the error handling policy")
parser.add_argument("--with-lineno", action="store_true",
                    help="include line numbers in the output")
parser.add_argument("filename", nargs="?", default=["/etc/passwd"],
                    help = "Password file to parse")
                  
def main():
    args = parser.parse_args()
    filename = args.filename[0]
    if args.with_lineno:
        output_fmt = "{location.lineno}: {entry.name} => {entry.shell}"
    else:
        output_fmt = "{entry.name} => {entry.shell}"
    try:
        with read_passwd_entries(filename, errors=args.errors) as reader:
            location = reader.location
            try:
                for entry in reader:
                    print(output_fmt.format(entry=entry, location=location))
            except ValueError as err:
                raise SystemExit("ERROR with password entry: %s" % (err,))
    except IOError as err:
        raise SystemExit("ERROR with password file: %s" % (err,))

if __name__ == "__main__":
    main()
The result is a lot more complex than the original 8 line parser, because it's a lot more configurable and provides more information. You can also see how to expand it futher, like tracking the current record as another location field. In my chemistry parsers, I also track byte positions, so users can determine out that record XYZ123 is between bytes M and N.

Location through accessor functions

Unfortunately, this more complex API, with location tracking, is also a bit slower than the simple parser, and trickier to write. For example, the following lines:

        # Track where we are
        location.lineno = lineno
        location.line = line
always occur, even when the location information isn't needed, because the parser doesn't know if the API caller will need the information. In addition to the extra processing overhead, I found that a more complex format might have multiple branches for the different processing and error paths, each of which need to set location information correctly. This is error prone.

There are a few ways to minimize or shift that overhead, at the expensive of yet more code. In my chemical structure parsers, I expect that most people won't want any of this information, so I made the location code more expensive to get and less expensive to track.

What I did was to change the Location class so I can register a handler for each attribute. To get the value for "lineno", the location calls the associated handler. This function can be defined inside of the scope of the actual reader function, so it has access to the internal "lineno" variable. This means that reader function doesn't need to set a location.lineno, at the expense of an extra function call lookup to get the value.

I'll sketch what I mean, so you can get an idea. This essay is long enough as it is, so I'll only give a flavor and not show the full implementation.

Here's a variation of the Location class, with the ability to register the "lineno" property:

class Location(object):
    def __init__(self, filename=None):
        self.filename = filename
        self._get_lineno = lambda: None
    
    def register(self, **kwargs):
        if "get_lineno" in kwargs:
            self._get_lineno = kwargs["get_lineno"]
    
    def save(self, **kwargs):
        if "lineno" in kwargs:
            self._get_lineno = lambda value=kwargs["value"]: value
    
    @property
    def lineno(self):
        return self._get_lineno()
The updated reader code, which registers the handler and cleans up aftwards, is:
def read_passwd_entries (source, errors="warn", location=None):
    ...
    reader = PasswdReader(location,
                        _read_passwd_entries(infile, error_handler, location),
                        close)
    next(reader) # prime the reader
    return reader

def _read_passwd_entries (infile, error_handler, location):
    lineno = 0
    def get_lineno():
        return lineno
    
    location.register(get_lineno = get_lineno)
    yield "ready" # ready to start reading
    
    try:
        for lineno, line in enumerate(infile, 1):
            ....
    finally:
        location.save(lineno=lineno)

Property registration by forcing the generator to start

The biggest complication with this approach is the timing order of location registration. Registration must occur inside of the generator, in order to get access to the local scope. However, the code in the generator won't be executed until the first item is needed.

Consider for example:

with read_passwd_entries(filename) as reader:
    print(reader.lineno)
You should expect this to print either 0 or 1 (I think it should print 0). But if the generator hasn't started then the get_lineno isn't registered, so the default value of None will be returned.

Instead, I prime the generator in read_passwd_entries() by doing:

    next(reader)
This forces the generator to execute, up the first "yield" statement, which is:
    yield "ready" # ready to start reading
(The value of "ready" is thrown away.)

Removing reference cycles

The other complication is the cyclical memory references. The _read_passwd_entries() generator has a reference to the location instance as a local variable, the location instance has a reference to the get_lineno() function, the get_lineno() function knows the outer scope, which is in the _read_passwd_entries() generator. While Python has garbage collection which can handle those cycles, some of the location properties can be complex objects, like file handles and molecule objects, which use resoures that aren't fully exposed to Python.

The try/finally block exists to break the cycle. The finally removes the link to get_lineno() and replaces it with the final value for lineno.

Influence from XML processing

My ideas are influenced by the SAX2 API, which you can see by looking at its Locator class, and to a lesser extent the ErrorHandler class. (I once implemented an entire bioinformatics parsing system called Martel, based on SAX events.)

One of the biggest differences is that XML and SAX deals with potentially deep tree structures, while I only work with a linear sequence of records, turned into molecules. My thoughts about an API which is better for this data were originally guided by Fredrik Lundh's iterparse.



PNG checksum #

The chemfp FPB fingerprint format is a chunk-based format. An FPB file starts with an 8 byte signature followed by a sequence of chunks. Each chunk starts with a four character chunk code followed by an 8 byte length, and then that many bytes for the data itself.

The general idea for this type of format dates back to at least the IFF format from 1985. I first came across it when I read the PNG specification. I liked it so much that I decided to model FPB along those lines.

My first attempt was in 2008(!). I've iterated a few times since then. The result is more IFF-like than PNG-like. This essay describes the reasons for those changes.

4 bytes vs. 8 bytes.

IFF and the PNG specification use an 4 byte chunk length. If the image data doesn't fit into a single chunk then it can be split across several chunks. Actually, quoting from the PNG spec, Multiple IDAT chunks are allowed so that encoders can work in a fixed amount of memory; typically the chunk size will correspond to the encoder's buffer size. Chunks are also useful because PNG viewer can update the image display after each chunk, even though the file hasn't been completely downloaded.

4 bytes is only enough for 4 GB. My design goal is 100 million fingerprints. Even 1024 bit fingerprints still take up 12 GB of data. There's very little I can do with only a chunk of the entire data set, and for search performance and ease of programming I want all of the fingerprints in a single contiguous region.

That's why I used an 8 byte chunk length. (I could have used a 6 byte length instead of 8, which sets an upper limit of 256 TB. That's enough space for 275 billion 1024-bit fingerprints. While that's more than the 166 billion from GDB-17, I figured I would be decadent and use two more bytes.)

chunk CRC

A curious thing about the PNG format, compared to most other IFF-like formats, is the 4-byte CRC, which occurs as the fourth component of the chunk, after the chunk data. Here's how the specification describes it:

A 4-byte CRC (Cyclic Redundancy Check) calculated on the preceding bytes in the chunk, including the chunk type code and chunk data fields, but not including the length field. The CRC is always present, even for chunks containing no data. See CRC algorithm.

Part of me likes this extra integrity check. The question I had to ask my self was, should I carry that idea over to the FPB format?

Why chunk CRCs instead of a file CRC?

The CRC is designed to detect certain types of transmission errors. ("Transmission" includes data storage, which is data transmission to the future.) It can, for example, detect if a burst of up to 32 zeros or 32 ones corrupted the transmission. As a trivial reduction, it can also tell if there was a single corrupted bit.

But why does each chunk have a CRC? Why not the file itself? For that I consulted the PNG mail archive. The short answer: Info-ZIP experience and Usenet.

First off, why does it have a CRC at all? Based on my reading, Greg "Cave Newt" Roelogs proposed a CRC in the first place. Roelogs became the UnZip maintainer with release 4.0, in December 1990, and became involved with PNG development in January 1995. PNG's reference compression code comes from Zip and UnZip. Zip is an archive format, and the CRC helps verify long-term integrity of the zip file.

Roelogs proposed a per-file CRC, but there was a lot of discussion about where to put it. Ed Hamrick suggested Why not just put a CRC as the last field of every chunk?, and Lee Daniel Crocker listed various advantages to a per-chunk CRC:

 1. Streamability.  Streaming applications that read/write chunks do
    not have to maintain state information across chunks. ...

 2. Simpler file changes. ...

 3. You can make changes to comments, gamma, and other chunks with a
    binary editor, and programs won't think the whole file is bad. ...

 4. More data integrity. While a CRC-32 is likely to catch changes in
    even a very large file, it is mathematically guranteed to catch 1-
    and 2-bit errors in an 8k chunk (I don't the book with me right now,
    but I think those are the right numbers).

 5. You'll never hear me argue about where to put the CRC again, and
    the nesting problem with future extensions is gone, and Newt gets to
    keep his "chunk concept".  Hell, if a 0.0005 increase in file size
    can just cut the message traffic on this list down, it's worth it.

 6. You can cut the file size issue in half by using 16-bit chunk
    lengths (a separate argument I lost earlier but now have more
    weight behind), and eliminate it entirely by doing that and using
    CRC-16s.  I personally favor the former but not the latter, and
    don't really care that much either way.  That will even end the
    continual "this is just IFF" noise on comp.graphics too.
There was a bit of back-and-forth about using CRC-16 vs CRC-32 vs. something stronger, like MD5. It was Alexander Lehmann who described the appropriate engineering tradeoffs for PNG:
IFF doesn't contain any checksumming, validity checks can be done with length checking (unexpected EOF), but that is not sufficient for network (especially multipart USENET) transmission.

If an uuencoded file contains truncated lines, this goes unnoticed by length checks, unless the missing chars are contained in a header or length chunk. (unless the image data is decompressed, but I would prefer a file checker that doesn't burn excess cpu cycles just to check if the uncompressed data has the correct size)

So there you go: chunk CRCs were added to help detect transmission errors over multi-part Usenet transmissions without needing to wait until the end to detect failures.

Multi-part Usenet transmissions?

If you don't recall the glory days of Usenet, it is a distributed newsgroup system designed around text. Post sizes were limited. If you wanted to send a large image, or application binary, or directory archive, you needed to split it up into multiple parts, encode it (usually uuencoded) and post each part separately.

For example, the first public release of Python was sent to alt.sources in 20 files. The first was titled "Python 0.9.1 part 01/21", followed by 02, etc.

These could be decoded and reassembled either manually or with various automated tools. Some newsreaders could even figure this out automatically. But sometimes things went missing or were incomplete.

Given this sort of noisy environment, I think you can understand better why a chunk CRC might be useful.

Is CRC good enough?

I wondered if CRC was good enough? Should I use something stronger than CRC? How many PNG failures occur?

We know that bit errors exist. Artem Dinaburg, in Bitsquatting: DNS Hijacking without exploitation discusses one of the effects of single-bit errors in DNS resolution, with links to previous work including a Verisign paper estimating the number of network errors in DNS queries (a bit error rate of 10-9 is about what we'd expect for typical data communication circuits), and how to use bit errors to break out of a virtual machine.

But those are typically taken care of by the network layer. On the other hand, Bram Cohen, in his Stanford EE380/2005 talk "Under the hood of BitTorrent" (video no longer available, or at least, I can't find it) mentioned how TCP's checksum isn't good enough once you start swapping terabytes of data over the Internet, so BitTorrent had to add additional validation.

I asked on the PNG mailing list, but no one recalled seeing an error in over 10 years.

Going back to the original design work, Roelogs commented that logical though a per-chunk CRC might be, it's overkill – and that's coming from the guy who proposed (demanded) CRCs in the first place. I agree. There's no need to have as many CRC validations as PNG has.

My data files are only 10 GB at best, and more like 0.5 GB. I concluded that was nowhere near the point where I had to worry about sophisticated or advanced error detection.

Single bit failures are possible in the PNG format

By the way, there's a subtle framing issue with a chunk CRC. The length field determines where to find the CRC, but is not part of the CRC. If one of the length bits changes then there's a chance - a very small chance - that the bytes at the new position have a valid CRC for the interveaning bytes.

There's an even smaller chance that the new chunk will contain valid data for that chunk type, so this isn't something to worry about for real-world data.

On the other hand, if you think about it for a bit you'll realize that it's possible to construct a valid PNG such that a single bit change will still generate a valid PNG. Sketch: add a private ancillary chunk whose payload is the alternate PNG, plus a partial final chunk designed to skip the rest of the original PNG. It will need to be a little clever to include the bytes needed to resynchronize with another CRC, but that isn't very hard.

Sadly, I couldn't figure out a way to use this for evil.

When to validate the CRC? What to do with it fails?

What happens when the CRC fails?

Most PNG readers exist to display the image. They need to read all of the image data, which means reading and decoding nearly all of the chunks in a data file. The overhead for CRC validation is trivial. The main question for the reader is simply, what to do when it fails?

If you use libpng then you call png_set_crc_action() to tell it what to do on validation failure. The actions are "warn, and use the data", "warn and discard the data", "use the data without warning", "warn and discard the data" (for non-critical chunks), and "exit with an error." All reasonable actions for different circumstances.

My fingerprint data is a bit different than image data. For example, a 1.3 million fingerprints data file, with 1024 bit fingerprints, takes 217 MB. wc -l on that file takes 0.2 seconds, which sets the minimum time needed to compute the CRC.

Not only is my data set larger than a typical image, but I don't always need to use all of the data. Simsearch uses the sublinear search algorithm described by Swamidass and Baldi, which greatly reduces the search space for high threshold searches. As a result, typical search time is around 1/100th of a second, well less than the time needed to verify any CRC.

This tells me that FPB-based applications will rarely want to validate the fingerprint print chunk, much less the entire file. Experience tells me that if validation fields are rarely used, then they are rarely tested, and more likely to be error-prone. In fact, that's likely why there's a "warn, and use the data" option for libpng.

No chunk CRC for FPB

I concluded that it's not worthwhile to follow the PNG lead and store a CRC with each chunk. Generating the chunk CRC is a bit of hassle, with very little gain.

Instead, I'll wait to see real-world failure cases before deciding if I need to have any sort of validation, or possibly even error correction. Should this happen, I'll define a new chunk type which can store the validation data about the preceeding chunks, and place it just before the end of the FPB file.



chemfp's FPB format #

Chemfp 1.2 supports new a fingerprint file format, called the "FPB" format. It's designed so the fingerprints can be memory-mapped directly to chemfp's internal data structures. This makes it very fast, but also internally complicated. Unlike the FPS format, which is designed as an exchange fingerprints between diverse programs, the FPB format is an binary application format. Internally it's a chunk-based container file format similar to PNG, Interchange File Format, and similar type-length-value formats. I'll talk more about the details in a future essay.

chemfp business model

Chemfp is a package for cheminformatics fingerprint generation and high-speed Tanimoto search. Version 1.1 is available for free, under the MIT license. Version 1.2 is the first release with my new business model. It's "free software for a fee." It's still under the MIT license, but you need to pay to get a copy of it.

Previously the commercial and no cost versions were the same version, but who wants to pay for something that's available for nothing? Many free software projects suffer from a resource problem because it's hard to get funding when you don't charge anything for the software. But people will pay to get access to useful features, which will goes into support and additional development. If all goes well, I'll release older commercial versions as no cost versions after a few years.

FPS files take a couple seconds to load

Perhaps the most widely useful new feature in chemfp-1.2 is the FPB file format, which complements the FPS file format. The FPS format is a human-readable text format which is easy to generate and parse, but it's not designed to be fast to read and write. I'll show you want I mean using ChEMBL 18 as my target database, and Open Babel's 1021-bit FP2 fingerprints.

I'll create the fingerprints in FPS format:

% ob2fps chembl_18.sdf.gz --id-tag chembl_id -o chembl_18_FP2.fps
% head -7 chembl_18_FP2.fps
#FPS1
#num_bits=1021
#type=OpenBabel-FP2/1
#software=OpenBabel/2.3.90
#source=chembl_18.sdf.gz
#date=2014-07-08T22:07:54
20000120200a0010402006040c00000064000000000220c80080104c03104c01000041
0000488021808002180a000000000020001800200084348082000802010c000c000320
020409000000080000041000017004cb10009340000000010000888012000001004010
20000420020029100f8010000900800008010002000300      CHEMBL153534
then do a similarity search, asking it to report the search times to stderr:
% simsearch --query 'Cn1cnc2c1c(=O)n(c(=O)n2C)C' chembl_18_FP2.fps --time
#Simsearch/1
#num_bits=1021
#type=Tanimoto k=3 threshold=0.7
#software=chemfp/1.2b2
#targets=chembl_18_FP2.fps
#query_sources=chembl_18.sdf.gz
#target_sources=chembl_18.sdf.gz
3	Query1	CHEMBL113	1.00000	CHEMBL1767	0.97222	CHEMBL74063	0.97183
open 0.00 search 2.15 total 2.15
The "--query" command-line parameter is new in chemfp-1.2. It takes a SMILES string by default. Simsearch looks at the fingerprint type line of the header to get the appropriate toolkit and generate the corresponding fingerprint for that structure query record.

Why aren't FPS searches faster?

Similarity search in chemfp is supposed to be fast. Why does it take over 2 seconds to search 1,352,681 records? Answer: nearly all of the time is spent reading the data and parsing the FPS file. Just doing a "wc -l" on the file takes 0.5 seconds, so that sets the upper bound on performance, unless I switch to an SSD.

This is why Noel O'Boyle's head-to-head timing comparison against Open Babel's own "fastsearch" finds that they have similar search times for single query searches; both simsearch and fastsearch are mostly I/O and parser bound.

Use 'fpcat' to convert from FPS to FPB format

I'll do the same test, but with the FPB format. I could ask ob2fps to write an FPB file instead of and FPS file, by simply changing the extension for the output file, like this:

% ob2fps chembl_18.sdf.gz --id-tag chembl_id -o chembl_18_FP2.fpb
However, this will re-parse the structure and recompute the fingerprints, which takes a long time.

Since I already have the fingerprints in the FPS file, I'll instead use the new "fpcat" program to convert from FPS format to FPB format.

% fpcat chembl_18_FP2.fps -o chembl_18_FP2.fpb
The conversion took about 12 seconds to run. The FPB format is pre-sorted and indexed by population count, to enable sublinear similarity search directly on the file, and the fingerprints are word aligned for optimal popcount calculations.

You can get a sense of the popcount ordering by using "fpcat" to view the contents of the FPB file as an FPS file:

% fpcat chembl_18_FP2.fpb | head -6
#FPS1
#num_bits=1021
#type=OpenBabel-FP2/1
#software=OpenBabel/2.3.90
0000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000	CHEMBL17564
0000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000	CHEMBL1098659
% fpcat chembl_18_FP2.fpb | tail -1
373e2327965282fe3795e7443613480cb59dd5d47164f8cc11ac2bbe91be9f873c09f1
dfaff7b0ffb09eb7fb21993243e3dbea4038a0e011a60be22f229e2634ced97e0a0e7c
9b832ffbb502d6a8139e08bccffaf5bb3ba8f36edf23814fe2953ff77738e10615f32a
e09040f7d42cbf510f15df765b3a6279f802471a86cb04	CHEMBL2368798

FPB searches are much faster

Simsearch accepts FPS and FPB files. By default it figures out the file type by looking at the file extension. I'll pass in the .fpb version:

% time simsearch --query 'Cn1cnc2c1c(=O)n(c(=O)n2C)C' chembl_18_FP2.fpb --time
#Simsearch/1
#num_bits=1021
#type=Tanimoto k=3 threshold=0.7
#software=chemfp/1.2b2
#targets=chembl_18_FP2.fpb
3	Query1	CHEMBL113	1.00000	CHEMBL1767	0.97222	CHEMBL74063	0.97183
open 0.00 search 0.00 total 0.00
0.305u 0.070s 0:00.41 90.2%	0+0k 0+0io 0pf+0w
Yes, FPB search is less than 1/100th of a second. I wrapped everything in the "time" command to show you that the whole search takes 0.4 seconds. Much of that extra time (about 0.25 seconds) is waiting for my Open Babel and my hard disk to load the available Open Babel file formats, but there's also overhead for starting Python and importing chemfp's own files.

The slowest part is loading Python and Open Babel

In fact, I'll break it down so you can get a sense of how long each part takes:

# Python startup
% time python -c "pass"
0.011u 0.007s 0:00.01 100.0%	0+0k 0+8io 0pf+0w

# Open Babel extension overhead
% time python -c "import openbabel"
0.027u 0.013s 0:00.04 75.0%	0+0k 0+3io 0pf+0w

# Overhead for Open Babel to load the available formats
% time python -c "import openbabel; openbabel.OBConversion()"
0.233u 0.021s 0:00.25 100.0%	0+0k 0+0io 0pf+0w

# Chemfp import overhead to use Open Babel
% time python -c "from chemfp import openbabel_toolkit"
0.281u 0.032s 0:00.31 100.0%	0+0k 0+26io 0pf+0w
In other words, about 0.3 seconds of the 0.4 seconds is used to get Python and Open Babel to the point where chemfp can start working.

When is the FPB format useful?

If you already have a fingerprint (so no toolkit overhead), or have an SSD, then the total search time on the command-line is less than 0.1 seconds.

For a command-line user, this is great because you can easily integrate similarity searches into scripts and other command-line tools.

It's also useful for web development. Of course, a web server in production rarely restarts, so the load time isn't critical. But as you develop the server you end up restarting it often. Many web application frameworks, including Django, will auto-reload the application server every time a file changed. It's annoying to wait even two seconds for 1.3 million records; imagine how much more annoying it is to handle a few 5 million record fingerprint sets.

Switch to the FPB format, and reloads become fast again. Even better, because FPB files are memory-mapped, the operating system can share the same memory between multiple processes. This means you can run multiple servers on the same machine without using extra memory.

Combine these together and you'll see that even CGI scripts can now include similarity search functionality with good performance. (Yes, there are still good reasons for using CGI scripts. The last one I developed was only two years ago, which was supposed to be a drop-in replacement for a system developed 10 years previous.)

More about chemfp

If you're interested in chemfp, see the product page and download version 1.1 to evaluate the performance. If you're interested in paying for access to version 1.2, email me.



Copyright © 2001-2013 Andrew Dalke Scientific AB