Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2014/09/24/new_chemfp_apis

New chemfp APIs

In the new few essays I'll cover chemfp's new APIs. I'll have a different focus here than will be in the official documentation. I will be writing for the people interested in chemistry toolkit API development, and with a focus on the I/O subsystem in such a toolkit. I've spent about 15 years experimenting and evaluating different API ideas across a range of use cases. This new API is a synthesis of those ideas. I hope it proves influential.

Why does chemfp have these new APIs?

Chemfp is a multi-platform Python library for cheminformatics fingerprint generation and search. It depends on OEChem/OEGraphSim, RDKit, or Open Babel to parse a structure file and generate fingerprints, and has its own highly optimized search algorithms.

Chemfp supports multiple toolkits because many of the people I know who use fingerprints use fingerprints from multiple toolkits. While this is biased by the people I know, in general, I've seen that most people who uses a similarity-based search as part of their science spend some time to figure out which fingerprint type, and which parameters, are most appropriate for that research.

Unfortunately, this may mean a lot of fiddling around with different toolkits, since each has a different API and different take on the problem. My belief is that more people will use chemfp (and pay me money) if it really makes it easier to handle fingerprints across different toolkits. The hard part is that no one wants to learn yet another system except perhaps if it's significantly better. I think I've done that.

Example

Here's an example program using new API. It reads in a SMILES file, compute OpenEye's circular fingerprints (with the defaults settings) and save the results to an FPS file.

import chemfp

fptype = "OpenEye-Circular"
source = "benzodiazepine.smi"
destination = "benzodiazepine.fps"

with chemfp.read_molecule_fingerprints(fptype, source) as reader:
    with chemfp.open_fingerprint_writer(destination,
                                        metadata=reader.metadata) as  writer:
        writer.write_fingerprints(reader)
I wrote that in a more formal style, using Python's "with" statement to ensure that the files are closed when no longer needed. While this gets all of the details correct, it is verbose. You might even call it excessive, since you can get the basic result using:
import chemfp

chemfp.open_fingerprint_writer("benzodiazepine.fps").write_fingerprints(
    chemfp.read_molecule_fingerprints("OpenEye-Circular", "benzodiazepine.smi"))
The only major difference between the two is that the output fingerprint file in the latter doesn't contain any metadata.

Fingerprint family and fingerprint type API

Object-oriented programming uses the words "class" and "instance". A class is the more abstract concept. It describes what's possible. An instance fills in the details. For example, "integers" is the class of integer numbers, and "38" is an instance of that class.

Chemfp-1.2 makes a similar distinction between a "fingerprint family" and a "fingerprint type". A family might be "circular fingerprints" and a type is "circular fingerprints of size 2".

Internally, chemfp has a fingerprint family registry. To get information about all of the available fingerprint families, use get_fingerprint_families(), and use get_fingerprint_family() to get a specific family by name:

>>> import chemfp
>>> chemfp.get_fingerprint_families()
[FingerprintFamily(<ChemFP-Substruct-OpenBabel/1>),
FingerprintFamily(<ChemFP-Substruct-OpenEye/1>),
FingerprintFamily(<ChemFP-Substruct-RDKit/1>),
FingerprintFamily(<OpenBabel-FP2/1>),
  ...
FingerprintFamily(<RDMACCS-OpenBabel/1>),
FingerprintFamily(<RDMACCS-OpenEye/1>),
FingerprintFamily(<RDMACCS-RDKit/1>)]
>>> family = chemfp.get_fingerprint_family("OpenEye-Circular")
>>> family
FingerprintFamily(<OpenEye-Circular/2>)
(This make take a while because the registry uses a lazy resolver which doesn't load or probe the underlying toolkits until first requested.)

You can ask a fingerprint family for its default values, and make a new fingerprint type using those defaults.

>>> fptype = family()
>>> family.get_defaults()
{'numbits': 4096, 'minradius': 0, 'atype': 4239, 'maxradius': 5, 'btype': 1}
>>> fptype
<chemfp.openeye_types.OpenEyeCircularFingerprintType_v2 object at 0x10dd0ee90>
Every fingerprint type has a type string, which is the canonical string representation of the fingerprint family name, the chemfp version number for that family, and the fingerprint type parameters.
>>> fptype.get_type()
'OpenEye-Circular/2 numbits=4096 minradius=0 maxradius=5 
 atype=Arom|AtmNum|Chiral|EqHalo|FCharge|HCount btype=Order'
(This is really one long string. I added a newline to make it easier to read.)

You aren't restricted to the defaults. I'll use 1024 bits/fingerprint and a maximum radius of 3 instead of the defaults of 4096 and 5:

>>> fptype = family(numbits=1024, maxradius=3)
>>> fptype.get_type()
'OpenEye-Circular/2 numbits=1024 minradius=0 maxradius=3
 atype=Arom|AtmNum|Chiral|EqHalo|FCharge|HCount btype=Order'

If you have a type string then you can get the fingerprint type directly using chemfp's get_fingerprint_type(). The fingerprint type string does not need to be in canonical form nor include all of the parameters.

>>> fp2type = chemfp.get_fingerprint_type("OpenBabel-FP2")
>>> fp2type.get_type()
'OpenBabel-FP2/1'
>>> 
>>> rdktype = chemfp.get_fingerprint_type("RDKit-Fingerprint/2")
>>> rdktype.get_type()
'RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=2 useHs=1'
>>> rdktype = chemfp.get_fingerprint_type("RDKit-Fingerprint/2 fpSize=512 maxPath=5")
>>> rdktype.get_type()
'RDKit-Fingerprint/2 minPath=1 maxPath=5 fpSize=512 nBitsPerHash=2 useHs=1'

Fingerprint writer API

Chemfp has always been able to write fingerprints to an FPS file, but I thought the API was too clumsy for public use so I left it as part of the undocumented internals.

I'm much happier about the new fingerprint writer API, which is based on a writer object with a method to save a single fingerprint at a time, and another method to save multiple fingerprints. In this example I asked it write to None, which means to write to stdout, so you could see the output from each methods call.

>>> writer = chemfp.open_fingerprint_writer(None)
#FPS1
>>> writer.write_fingerprint("zeros", "\0\0\0\0")
00000000        	zeros
>>> writer.write_fingerprints( [("fp1", "\0\0\0\1"), ("fp2", "\0\0\0\2")] )
00000001        	fp1
00000002        fp2
>>> writer.close()

Version 1.2 also introduces a new binary file format, the FPB format. The writer API knows how to write an FPB file and will do so if the format ends with ".fpb" or told specifically to write an FPB file.

Toolkit APIs

The toolkit API is the collective name for the format API, the molecule I/O API, and the molecule API. There is one toolkit API implementation for each supported chemistry toolkit, which is basically a wrapper from the chemfp API to the underlying toolkit API. Use get_toolkit_names() to get a list of the available chemistry-based toolkits:

>>> chemfp.get_toolkit_names()
set(['openeye', 'rdkit', 'openbabel'])
>>> T = chemfp.get_toolkit("openeye")
>>> T.is_licensed()
True
Instead of get_toolkit() you can import a toolkit directly in Python:
>>> from chemfp import openeye_toolkit as T
>>> T.is_licensed()
True
>>> T.name
'openeye'

In addition to the chemistry-based toolkits, the "text" toolkit implements just enough of the toolkit API to process SMILES and SDF records. It doesn't not know chemistry; it stores the contents of each record as a string. I wrote it to handle a few use cases that couldn't be handled with the molecule centered toolkit. I'll discuss these cases in the future.

Format API

The get_formats() function lists the available formats for a given toolkit.

>>> from chemfp import rdkit_toolkit
>>> rdkit_toolkit.get_formats()
[Format('rdkit/canstring'), Format('rdkit/usmstring'),
Format('rdkit/smistring'), Format('rdkit/molfile'),
Format('rdkit/usm'), Format('rdkit/sdf'), Format('rdkit/can'),
Format('rdkit/smi'), Format('rdkit/rdbinmol')]
There's also a get_input_formats() and get_output_formats(), or you can use attributes of the format determine if the toolkit understand the format as an input or output, as well as a few other properties.

You can use the format API to figure out the toolkit's (or at least chemfp's) best guess for the format for a given filename. This includes compression detection.

>>> rdkit_toolkit.get_input_format_from_source("example.smi.gz")
Format('rdkit/smi.gz')

Molecule input API

This is by far the most extensive of the new APIs in chemfp-1.2. Previously there was read_molecule_fingerprints() which took a fingerprint type name and the source and returned fingerprints for each record. Suppose though that you want to read a structure file and generate fingerprints using several three RDKit fingerprint types. The older API requires reading the file three different types, which means parsing the same record three different times, even though the same RDKit molecule could be reused for each.

The new API lets you separate the process of reading molecules from generating fingerprints, which can be a nice performance boost.

The old API could only take input from files, or file-like sources. The new API has more extensive support for reading from a string. This is useful in a web application, where the input might be text field from a submitted form.

In practice, there several variations on how to read an input, depending on if it comes from a file or a string or if you want to read a single record (the first one) or all records. Sometimes you want the really molecule id when reading a molecule and sometimes you really don't that overhead; I ended up implementing both of those readers. Finally, there are also cases where you want to convert a lot of records into a molecule, but each record is stored into its own string. The make_id_and_molecule_parser() creates a specialized parser for the case.

Different toolkit reader API functions
  ...from a file ...from a string
Read 0 or more molecules ... read_molecules() read_molecules_from_string()
Read 0 or more (id, molecule) pairs ... read_ids_and_molecules() read_ids_and_molecules_from_string()
Parse the first molecule ...   parse_molecule()
Parse the first molecule and its id...   parse_id_and_molecule()
Make an optimized function to parse the
first molecule and its id...
  make_id_and_molecule_parser()

Molecule output API

Chemfp-1.2 adds new APIs to save molecules to a structure file or convert a molecule into a record. Like the reader, there are several variations in the writer.

The open_molecule_writer() and open_molecule_writer_to_string() functions return a writer object, though in the first case the writer saves to a file and in the latter it saves to memory that can be accessed as a string. The writer object has methods to save a single molecule at a time, to save multiple molecules at once, and to save a molecules using an alternate id.

To demonstrate, here's a program which reads the compressed ChEBI SD file and saves it to a compressed SMILES file. I specify the id tag because ChEBI stores its identifiers in the "ChEBI ID" SD tag, and not the title:

from chemfp import openeye_toolkit as T

with T.read_ids_and_molecules("ChEBI_complete.sdf.gz", id_tag="ChEBI ID") as reader:
    with T.open_molecule_writer("chebi.smi.gz") as writer:
        writer.write_ids_and_molecules(reader)

When you only need to turn a molecule into a record, use the create_string() function. If you need to do a lot of these conversions, all with the same parameters, then use make_string_creator(). This function returns returns a specialized function which is better optimized to turn a molecule into a string.

Molecule API

Chemfp does not try to make a common molecule API across the differnet toolkits. For that, see Noel O'Boyle's "cinfony". In fact, it's quite the opposite. The readers, writers, and fingerprint generation functions take native molecule types and not wrapped objects.

All chemfp offers are a small set of functions to get what the toolkit considers to be a molecule's title, to do some simple manipuation of the SD tags for a record, and to make a copy of a molecule.

Connecting fingerprint types and toolkits to make a fingerprint

If you have a fingerprint type string, and a SMILES string, how to you generate the fingerprint for it? Each toolkit fingerprint type has a "compute" function, which takes a toolkit molecule and returns the fingerprint string, so that's in the right direction. You still need to know which toolkit to use to convert the SMILES into the right molecule.

Well, actually you don't. The fingerprint type's parse_fingerprint() will parse a molecule and generate the fingerprint for you:

>>> fptype = chemfp.get_fingerprint_type("OpenBabel-FP2")
>>> fptype.parse_fingerprint("c1ccccc1O", "smi")
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x02
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x80\x00\x00\x00
\x00\x00\x00@\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x08
\x00\x00\x00\x00\x00\x00\x00'
This helper function works because the fingerprint type knows its toolkit, through the "toolkit" attribute. This attribute is part of the public API, so you could do the two steps yourself if you want to:
>>> fptype.toolkit
<module 'chemfp.openbabel_toolkit' from 'chemfp/openbabel_toolkit.pyc'<
>>> mol = fptype.toolkit.parse_molecule("c1ccccc1O", "smistring")
>>> fptype.compute_fingerprint(mol)
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x02
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x80\x00\x00\x00
\x00\x00\x00@\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x08
\x00\x00\x00\x00\x00\x00\x00'


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB