Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2020/09/24/chemfp_toolkit_api

chemfp's chemistry toolkit I/O API

This is part of a series of essays about working with SD files at the record and simple text level. In the last two essays I showed examples of using chemfp to process SDF records and to read two record data items. In this essay I'll introduce chemfp's chemistry toolkit I/O API, which I developed to have a consistent way to handle structure input and output when working with the OEChem, RDKit, and Open Babel toolkits.

You can follow along yourself by installing chemfp (under the Base License Agreement) using:

python -m pip install chemfp -i https://chemfp.com/packages/

chemfp is a package for high-performance cheminformatics fingerprint similarity search. You'll also need at least one of the chemistry toolkits I mentioned.

Add an SDF data item using the native APIs

Every cheminformatics toolkit deserving of that description can add properties to an SDF record. Here's how to do it in several different toolkits, using the input file chebi16594.sdf (a modified version of CHEBI:16594) which contains the following:

CHEBI: 16594

Shortened for demonstration purposes.
  9  8  0  0  0  0  0  0  0  0  2 V2000
   19.3348  -19.3671    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   20.4867  -18.7021    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   21.6385  -19.3671    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   22.7903  -18.7021    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   23.9421  -19.3671    0.0000 O   0  5  0  0  0  0  0  0  0  0  0  0
   22.7903  -17.3721    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   21.6385  -20.6971    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   18.1830  -18.7021    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   19.3348  -20.6971    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
  2  1  1  0  0  0  0
  3  2  1  0  0  0  0
  4  3  1  0  0  0  0
  5  4  1  0  0  0  0
  6  4  2  0  0  0  0
  7  3  1  0  0  0  0
  8  1  1  0  0  0  0
  9  1  1  0  0  0  0
M  CHG  1   5  -1
M  END
> <ChEBI ID>
CHEBI:16594

> <ChEBI Name>
2,4-diaminopentanoate

$$$$

For each toolkit I'll add an "MW" data item where the value is the molecular weight, as determined by the toolkit.

OEChem

For OEChem I create an oemolistream ("OpenEye molecule input stream") with the given filename. By default it auto-detects the format from the filename extension. The oemolistream's GetOEGraphMols() returns a molecule iterator. I'll use next() to get the first molecule, then iterate over the data items to report the existing data items:

>>> from openeye.oechem import *
>>> mol = next(oemolistream("chebi16594.sdf").GetOEGraphMols())
>>> [(data_item.GetTag(), data_item.GetValue()) for data_item in OEGetSDDataPairs(mol)]
[('ChEBI ID', 'CHEBI:16594'), ('ChEBI Name', '2,4-diaminopentanoate')]

The OECalculateMolecularWeight() function computes the molecule weight, so I'll use that to add an "MW" data item (with the weight rounded to 2 decimal digits), check that the item was added, then write the result to stdout in SD format:

>>> OECalculateMolecularWeight(mol)
131.15303999999998
>>> OEAddSDData(mol, "MW", f"{OECalculateMolecularWeight(mol):.2f}")
True
>>> [(data_item.GetTag(), data_item.GetValue()) for data_item in OEGetSDDataPairs(mol)]
[('ChEBI ID', 'CHEBI:16594'), ('ChEBI Name', '2,4-diaminopentanoate'), ('MW', '131.15')]
>>> ofs = oemolostream()
>>> ofs.SetFormat(OEFormat_SDF)
True
>>> OEWriteMolecule(ofs, mol)
CHEBI:16594
  -OEChem-09242013332D
Shortened for demonstration purposes.
  9  8  0     0  0  0  0  0  0999 V2000
   19.3348  -19.3671    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   20.4867  -18.7021    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   21.6385  -19.3671    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   22.7903  -18.7021    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   23.9421  -19.3671    0.0000 O   0  5  0  0  0  0  0  0  0  0  0  0
   22.7903  -17.3721    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   21.6385  -20.6971    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   18.1830  -18.7021    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   19.3348  -20.6971    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
  2  3  1  0  0  0  0
  3  4  1  0  0  0  0
  4  5  1  0  0  0  0
  4  6  2  0  0  0  0
  3  7  1  0  0  0  0
  1  8  1  0  0  0  0
  1  9  1  0  0  0  0
M  CHG  1   5  -1
M  END
> <ChEBI ID>
CHEBI:16594

> <ChEBI Name>
2,4-diaminopentanoate

> <MW>
131.15

$$$$
0

That final 0 is the interactive Python shell printing the return value of OEWriteMolecule. It is not part of what OEChem wrote to stdout.

RDKit

In RDKit you need to know which file reader to use for a given file, in this case, FowardSDMolSupplier(). (An upcoming release will offer a generic reader function which dispatches to the appropriate file reader.) The reader is a molecule iterator so again I'll use next() to get the first molecule, then see which data items are present:

>>> from rdkit import Chem
>>> mol = next(Chem.ForwardSDMolSupplier("chebi16594.sdf"))
>>> mol.GetPropsAsDict()
{'ChEBI ID': 'CHEBI:16594', 'ChEBI Name': '2,4-diaminopentanoate'}

I'll use Descriptors.MolWt() to compute the molecular weight and set the "MW" data item. You can see that even though I set the MW as a string, GetPropsAsDict() returns it as a float. This is because GetPropsAsDict() will try to coerce strings which look like floats or integers into native Python floats or integers (including "nan" and "-inf"). To prevent coercion, use the GetProp() method:

>>> from rdkit.Chem import Descriptors
>>> Descriptors.MolWt(mol)
131.155
>>> mol.SetProp("MW", f"{Descriptors.MolWt(mol):.2f}")
>>> mol.GetPropsAsDict()
{'ChEBI ID': 'CHEBI:16594', 'ChEBI Name': '2,4-diaminopentanoate', 'MW': 131.16}
>>> [(name, mol.GetProp(name)) for name in mol.GetPropNames()]
[('ChEBI ID', 'CHEBI:16594'), ('ChEBI Name', '2,4-diaminopentanoate'), ('MW', '131.16')]

Finally, I'll write the molecule to stdout.

>>> import sys
>>> writer = Chem.SDWriter(sys.stdout)
>>> writer.write(mol)
>>> writer.close()
CHEBI:16594
     RDKit          2D

  9  8  0  0  0  0  0  0  0  0999 V2000
   19.3348  -19.3671    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   20.4867  -18.7021    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   21.6385  -19.3671    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   22.7903  -18.7021    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   23.9421  -19.3671    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   22.7903  -17.3721    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   21.6385  -20.6971    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   18.1830  -18.7021    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   19.3348  -20.6971    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
  2  1  1  0
  3  2  1  0
  4  3  1  0
  5  4  1  0
  6  4  2  0
  7  3  1  0
  8  1  1  0
  9  1  1  0
M  CHG  1   5  -1
M  END
>  <ChEBI ID>  (1)
CHEBI:16594

>  <ChEBI Name>  (1)
2,4-diaminopentanoate

>  <MW>  (1)
131.16

$$$$

Open Babel

Open Babel, because of the pybel interface, is the easiest of the bunch. The following uses Open Babel 3.0, which moved pybel to a submodule of openbabel. I ask readfile() to open the given file as "sdf" format. That returns an iterator. I get the first molecule. It has a special "data" attribute with the SD data items combined with some internal Open Babel data items (RDKit does the same thing, but by default they are hidden)

>>> from openbabel import pybel
>>> mol = next(pybel.readfile("sdf", "chebi16594.sdf"))
>>> mol.data
{'MOL Chiral Flag': '0', 'ChEBI ID': 'CHEBI:16594', 'ChEBI Name': '2,4-diaminopentanoate',
'OpenBabel Symmetry Classes': '8 5 7 9 1 6 3 2 4'}

Pybel molecules have a molwt attribute containing the molecule weight, or I can compute it via the underlying OpenBabel OBMol object. I save it to the data attribute object, export the contents as an string in "sdf" format, and write the output to stdout, asking print() to not include the terminal newline:

>>> mol.molwt
131.15304 
>>> mol.OBMol.GetMolWt()
131.15304 
>>> mol.data["MW"] = f"{mol.molwt:.2f}"
>>> mol.data
{'MOL Chiral Flag': '0', 'ChEBI ID': 'CHEBI:16594', 'ChEBI Name': '2,4-diaminopentanoate',
'OpenBabel Symmetry Classes': '8 5 7 9 1 6 3 2 4', 'MW': '131.15'}
>>> print(mol.write("sdf"), end="")
CHEBI:16594
 OpenBabel09242014012D
Shortened for demonstration purposes.
  9  8  0  0  0  0  0  0  0  0999 V2000
   19.3348  -19.3671    0.0000 C   0  0  3  0  0  0  0  0  0  0  0  0
   20.4867  -18.7021    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   21.6385  -19.3671    0.0000 C   0  0  3  0  0  0  0  0  0  0  0  0
   22.7903  -18.7021    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   23.9421  -19.3671    0.0000 O   0  5  0  0  0  0  0  0  0  0  0  0
   22.7903  -17.3721    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   21.6385  -20.6971    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   18.1830  -18.7021    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   19.3348  -20.6971    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
  2  1  1  0  0  0  0
  3  2  1  0  0  0  0
  4  3  1  0  0  0  0
  5  4  1  0  0  0  0
  6  4  2  0  0  0  0
  7  3  1  0  0  0  0
  8  1  1  0  0  0  0
  9  1  1  0  0  0  0
M  CHG  1   5  -1
M  END
>  <ChEBI ID>
CHEBI:16594

>  <ChEBI Name>
2,4-diaminopentanoate

>  <MW>
131.15

$$$$

chemfp's chemistry toolkit API

Chemfp supports Open Babel, OEChem+OEGraphSim, and RDKit. Each toolkit has its own way of handling chemical structure I/O. Following the fundamental theorem of software engineering, I "solved" the problem by introducing an extra level of indirection - I created a chemistry toolkit I/O API and developed wrapper implementations for each of the underlying chemistry toolkits.

Here's a side-by-side comparison:

Comparison of toolkit native and chemfp wrapper APIs
OEChem
nativechemfp
from openeye.oechem import *

mol = next(oemolistream("chebi16594.sdf").GetOEGraphMols())
mw = OECalculateMolecularWeight(mol)
OEAddSDData(mol, "MW", f"{mw:.2f}")
ofs = oemolostream()
ofs.SetFormat(OEFormat_SDF)
OEWriteMolecule(ofs, mol)
from chemfp import openeye_toolkit as OETK
from openeye.oechem import OECalculateMolecularWeight

mol = next(OETK.read_molecules("chebi16594.sdf"))
mw = OECalculateMolecularWeight(mol)
OETK.add_tag(mol, "MW", f"{mw:.2f}")
print(OETK.create_string(mol, "sdf"), end="")
RDKit
nativechemfp
import sys
from rdkit import Chem
from rdkit.Chem import Descriptors

mol = next(Chem.ForwardSDMolSupplier("chebi16594.sdf"))
mw = Descriptors.MolWt(mol)
mol.SetProp("MW", f"{mw:.2f}")

writer = Chem.SDWriter(sys.stdout)
writer.write(mol)
writer.close()
from chemfp import rdkit_toolkit as RDTK
from rdkit.Chem import Descriptors

mol = next(RDTK.read_molecules("chebi16594.sdf"))
mw = Descriptors.MolWt(mol)
RDTK.add_tag(mol, "MW", f"{mw:.2f}")
print(RDTK.create_string(mol, "sdf"), end="")
Open Babel
native (pybel)chemfp
from openbabel import pybel

mol = next(pybel.readfile("sdf", "chebi16594.sdf"))
mol.data["MW"] = f"{mol.molwt:.2f}"
print(mol.write("sdf"), end="")
from chemfp import openbabel_toolkit as OBTK

mol = next(OBTK.read_molecules("chebi16594.sdf"))
OBTK.add_tag(mol, "MW", f"{mol.GetMolWt():.2f}")
print(OBTK.create_string(mol, "sdf"), end="")

The point is not that chemfp's toolkit API is all that much shorter than the underlying toolkit API, but rather that it's consistent across the three toolkits. This becomes more useful when you start working with more than one toolkit and have to remember the nuances of each one.

Format and format option discovery

One of the important features I wanted in chemfp was full support for all of formats supported by the underlying toolkits, and all of the options for each of those toolkits. And I wanted to make that information discoverable. For example, the following shows the formats available through chemfp for each toolkit:

>>> from chemfp import rdkit_toolkit
>>> print(", ".join(fmt.name for fmt in rdkit_toolkit.get_formats()))
smi, can, usm, sdf, smistring, canstring, usmstring, molfile,
rdbinmol, fasta, sequence, helm, mol2, pdb, xyz, mae, inchi, inchikey,
inchistring, inchikeystring
>>> from chemfp import openeye_toolkit
>>> print(", ".join(fmt.name for fmt in openeye_toolkit.get_formats()))
smi, usm, can, sdf, molfile, skc, mol2, mol2h, sln, mmod, pdb, xyz,
cdx, mopac, mf, oeb, inchi, inchikey, oez, cif, mmcif, fasta,
sequence, csv, json, smistring, canstring, usmstring, slnstring,
inchistring, inchikeystring
>>> from chemfp import openbabel_toolkit
>>> print(", ".join(fmt.name for fmt in openbabel_toolkit.get_formats()))
smi, can, usm, smistring, canstring, usmstring, sdf, inchi, inchikey,
inchistring, inchikeystring, fa, abinit, dalmol, pdbqt, mmcif, xsf,
    ... many lines removed ...
acesout, POSCAR, pcjson, gzmat, mae, pointcloud, gamess, mopcrt,
confabreport

For each format type, there are properties to say of it is an input format or output format (InChIKey, for example, is only an output format), or if the format can handle file I/O or only handle string-based I/O. (The "smistring" format can only parse a SMILES string while the "smi" format can parse a SMILES file, specified by filename or by the contents in a string.)

There are also ways to figure out the default values for the readers and writers:

>>> from chemfp import rdkit_toolkit
>>> fmt = rdkit_toolkit.get_format("sdf")
>>> fmt.get_default_reader_args()
{'sanitize': True, 'removeHs': True, 'strictParsing': True, 'includeTags': True}
>>> fmt.get_default_writer_args()
{'includeStereo': False, 'kekulize': True, 'v3k': False}

Reader and writer args

Those reader_args and writer_args can be passed to the input and output methods. For example, the RDKit writer's v3k writer_arg, if True, asks RDKit to always generate a V3000 record, even if the molecule can be expressed as a V2000 record:

>>> from chemfp import rdkit_toolkit
>>> mol = rdkit_toolkit.parse_molecule("C#N", "smistring")
>>> print(rdkit_toolkit.create_string(mol, "sdf"))

     RDKit

  2  1  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  3  0
M  END
$$$$

>>> print(rdkit_toolkit.create_string(mol, "sdf", writer_args={"v3k": True}))

     RDKit

  0  0  0  0  0  0  0  0  0  0999 V3000
M  V30 BEGIN CTAB
M  V30 COUNTS 2 1 0 0 0
M  V30 BEGIN ATOM
M  V30 1 C 0 0 0 0
M  V30 2 N 0 0 0 0
M  V30 END ATOM
M  V30 BEGIN BOND
M  V30 1 3 1 2
M  V30 END BOND
M  V30 END CTAB
M  END
$$$$

and here's an example where I enable OEChem's "strict" SMILES parser so that multiple sequential bond symbols are not accepted:

>>> from chemfp import openeye_toolkit
>>> mol = openeye_toolkit.parse_molecule("C=#-C", "smistring")
>>> openeye_toolkit.create_string(mol, "smistring")
'CC'
>>> mol = openeye_toolkit.parse_molecule("C=#-C", "smistring", reader_args={"flavor": "Default|Strict"})
Warning: Problem parsing SMILES:
Warning: Bond without end atom.
Warning: C=#-C
Warning:   ^

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
       .... many lines omitted ... 
  File "<string>", line 1, in raise_tb  
chemfp.ParseError: OEChem cannot parse the smistring record: 'C=#-C'

The OpenEye flavor reader and writer args support the raw OEChem integer flags, as well as a string-based syntax to express them symbolically. In this case Default|Strict says to start with the default flags for this format then add the Strict option to it.

OEChem flavor help

The format API doesn't have a way to get detailed help about each option. For most cases it's not hard to guess from the name and Python data type. This doesn't work for OEChem's flavor options. The quickest way to get interactive help is to pass an invalid flavor and see the error message:

>>> from chemfp import openeye_toolkit
>>> openeye_toolkit.parse_molecule(mol, "sdf", reader_args={"flavor": "x"})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
       .... many lines omitted ... 
    raise err
ValueError: OEChem sdf format does not support the 'x' flavor option.
Available flavors are: FixBondMarks, SuppressEmptyMolSkip, SuppressImp2ExpENHSTE

Why did I develop chemfp's toolkit API?

I developed the API starting with chemfp 2.0 because there was a clear need to allow users to configure input processing in a way that respected what the underlying toolkits could do.

As a somewhat extreme example, the most recent version of RDKit supports the FASTA format, with a flavor option to configure it to interpret the input as protein (0 or 1), RNA (2-5), or DNA (6-9), with different values for L- or L+D- amino acids, or different options for 3' and 5' caps on the nucleotides. Thus, AA could be dialanine or a nucleotide sequence with two adenines. The following gives an example of computing the MACCS fingerprint for both cases, where I use the -R parameter to specify a reader argument:

% printf ">dialanine\nAA\n" | rdkit2fps --maccs -R flavor=0 --in fasta | tail -1
00000000000020000040084800201004842452fa09	dialanine
% printf ">diadenine capped RNA\nAA\n" | rdkit2fps --maccs -R flavor=5 --in fasta | tail -1
000000102084002191d41ccf33b3907bde6feb7d1f	diadenine capped RNA

There's less need to handle writer options since chemfp doesn't really need to write structure files. The closest is if the fingerprints or the results of a similarity search are added to an SDF output. In tomorrow's essay I'll describe some ways to modify the SDF output to include a new data item.

But really, that part of chemfp is probably more of a vanity project than anything else. I have some strong opinions on what a good API should be, and had the chance to implement it, show it handles the needs of multiple chemistry toolkits, and document it. Just like I drew some inspiration from pybel, perhaps others will draw some inspiration from the chemfp API.

I personally find it really satisfying to be able to develop, say, a HELM to SLN conversion tool which uses RDKit to convert the HELM string into an SDF record, then OEChem to convert the SDF record to SLN.

>>> from chemfp import rdkit_toolkit, openeye_toolkit
>>>
>>> def helm_to_sln(helm_str):
...   rdmol = rdkit_toolkit.parse_molecule(helm_str, "helm") 
...   sdf_record = rdkit_toolkit.create_string(rdmol, "sdf")
...   oemol = openeye_toolkit.parse_molecule(sdf_record, "sdf")
...   return openeye_toolkit.create_string(oemol, "slnstring")
...
>>> helm_to_sln("PEPTIDE1{[dA].[dN].[dD].[dR].[dE].[dW]}$$$$")
'NH2CH(C(=O)NHCH(C(=O)NHCH(C(=O)NHCH(C(=O)NHCH(C(=O)NHCH(C(=O)
OH)CH2C[1]=CHNHC[2]:C(@1):CH:CH:CH:CH:@2)CH2CH2C(=O)OH)CH2CH2C
H2NHC(=NH)NH2)CH2C(=O)OH)CH2C(=O)NH2)CH3'

This specific function may not be useful, but the ability to specify this sort of work in only a few lines makes it easier to try out new ideas.


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2020 Andrew Dalke Scientific AB