Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2020/09/25/mixing_text_and_chemistry_toolkits

Mixing text and chemistry toolkits

This is part of a series of essays about using chemfp to work with SD files at the record and simple text level. Chemfp has a text toolkit to read and write SDF and SMILES files as records, rather than molecules. It also has a chemistry toolkit I/O API to have a consistent way to handle structure input and output when working with the OEChem, RDKit, and Open Babel toolkits. In this essay I'll combine the two, so chemfp reads records from an SD file, which are then passed to a chemistry toolkit for further parsing, then chemfp adds a data item back to the original record instead of converting the toolkits molecule into a new SDF record.

You can follow along yourself by installing chemfp (under the Base License Agreement) using:

python -m pip install chemfp -i https://chemfp.com/packages/

chemfp is a package for high-performance cheminformatics fingerprint similarity search. You'll also need at least one of the chemistry toolkits I mentioned.

A simple pipeline component to process SD files

Here's a program to add a data item to an SD file. It's supposed to be part of pipeline, reading from stdin and writing to stdout. To make things simple, all it does is add the data item STATUS with the value good. It uses chemfp's toolkit API (see yesterday's essay) to handle file I/O so you can select your toolkit of choice by commenting/uncommenting the appropriate import lines:

## Select the toolkit you want to use
from chemfp import openbabel_toolkit as T
#from chemfp import rdkit_toolkit as T
#from chemfp import openeye_toolkit as T

# Use None to read from stdin or or write to stdout.
# Specify "sdf" format; The default "smi" is for SMILES file format.

with T.read_molecules(None, "sdf") as reader:
    with T.open_molecule_writer(None, "sdf") as writer:
        for mol in reader:
            T.add_tag(mol, "STATUS", "good")
            writer.write_molecule(mol)

Toolkits make non-chemically significant changes

I used this program to process the chebi16594.sdf file I created for yesterday's essay. Here's a side-by-side difference of the changes each toolkit makes to the original record, using screen shots from Apple's "FileMerge":

Changes each toolkit makes to chebi16594.sdf
Open Babel
  • Adds an OpenBabel program, timestamp, and "2D" on the second line.
  • Uses "999" for the deprecated count of additional property lines.
  • Sorts the atom indices in the bond block from smallest to largest.
  • Adds atom stereo parity of 3 (either or unmarked stereo center) to two atoms.
  • Uses two spaces instead of one for the first line of the data items.
RDKit
  • Adds an RDKit program and "2D" on the second line.
  • Uses "999" for the deprecated count of additional property lines.
  • Doesn't include "5" (meaning a charge of -1) for the first oxygen atom.
    It's unneeded because it duplicates the CHG property. The documentation
    since at least 1992 says it's Retained for compatibility with older Ctabs,
  • Removes three bond columns which are not used or not needed for this record type.
OEChem
  • Adds an OEChem program, timestamp, and "2D" on the second line.
  • Omits the fff field which was already obsolete by 1992.
  • Uses "999" for the deprecated count of additional property lines.
  • Sorts the atom indices in the bond block from smallest to largest.

Every program changed the input but not, I'll stress, not in a chemically meaningful way. I've never come across a case where one of the changes would have affected me, though I suppose there are times when you might want to preserve the program/timestamp and comment lines.

Some toolkits can't process some records

Now I put the program in the pipeline and start using it on real data sets. For example, if I process ChEBI_complete.sdf.gz with RDKit it quickly stops with the following:

% gzcat ChEBI_complete.sdf.gz | python add_status.py > /dev/null
[12:52:56] WARNING: not removing hydrogen atom without neighbors
[12:52:56] Explicit valence for atom # 12 N, 4, is greater than permitted
Traceback (most recent call last):
  File "/Users/dalke/add_status.py", line 11, in <module>
    for mol in reader:
         ... many lines deleted ...
  File "<string>", line 1, in raise_tb
chemfp.ParseError: Could not parse molecule block, file '<stdin>', line 679957, record #380

By default chemfp's toolkit API stops processing when it detects that a record cannot be processed. This is part of the Errors should never pass silently suggested guideline for Python. To skip unparsable records, add errors="ignore" to the reader, like this:

with T.read_molecules(None, "sdf", errors="ignore") as reader:

I was curious about how many ChEBI records could not be parsed by the different toolkits so I wrote the following program to test each available toolkit (including chemfp's own text toolkit which iterates over records in SDF and SMILES files).

import chemfp

filename = "ChEBI_complete.sdf.gz"

# Test all available toolkits
toolkits = []
for name in ("text", "rdkit", "openeye", "openbabel"):
    try:
        toolkits.append(chemfp.get_toolkit(name))
    except ValueError:
        pass

# Tell OEChem to not skip records with no atoms
reader_args = {"openeye.sdf.flavor": "Default|SuppressEmptyMolSkip"}

# Count how many records or molecules each one finds.
results = []
for toolkit in toolkits:
    with toolkit.read_molecules(filename, reader_args=reader_args, errors="ignore") as reader:
        num_records = sum(1 for _ in reader)
    results.append( (toolkit.name, num_records) )

# Report the counts
print(f"After parsing {filename!r}:")
for name, num_records in results:
    print(f"{name} toolkit found {num_records} records")

The output shows that Open Babel and OEChem could process each record, while RDKit was unable to read 244 records.

After parsing 'ChEBI_complete.sdf.gz':
text toolkit found 113902 records
rdkit toolkit found 113658 records
openeye toolkit found 113902 records
openbabel toolkit found 113902 records

OEChem's SuppressEmptyMolSkip

Be aware that that the first version of this program reported that OEChem was not able to find 3 of the records. Further analysis showed the missing records were CHEBI:147324, CHEBI:147325, and CHEBI:156288 and all three of these records have no atoms.

That reminded me that by default OEChem skips records from an SD file with no atoms. Release 2.2.0 added the SuppressEmptyMolSkip flag. Quoting the documentation:

This input flavor suppresses the default action of skipping empty molecules in the input stream. This may be important in order to recover SDData stored on empty molecule records.

chemfp's reader_args support namespaces

The above code uses a reader_arg to configure OEChem's SDF flavor to include SuppressEmptyMolSkip, which I'll repeat here:

# Tell OEChem to not skip records with no atoms
reader_args = {"openeye.sdf.flavor": "Default|SuppressEmptyMolSkip"}

Even thought it wasn't needed for this case, I decided to show how the reader_args support namespaces. This reader_args will only configure the "flavor" value for the "sdf" reader in the "openeye" toolkit. It won't change the flavor for, say, OEChem's SMILES reader, or RDKit's or Open Babel's FASTA flavor.

What this means is that you can have one reader_args which correctly specifies the configuration for each of toolkit formats you might pass in.

What if you must always output a record?

Sometimes it's okay to ignore uninterpretable records, sometimes it isn't. That uncertainty is why chemfp's toolkit API default is to raise an exception and force you to decide.

Let's suppose your pipeline requires the same number of output records as input records, and that the status to use for a record which could not be parsed is bad. How do you do that? (Assume you use RDKit, where there's a chance that some records cannot be parsed.)

One way is to use chemfp's text toolkit to read the records from the SD file, and pass each record to the chemistry toolkit for parsing. If the molecule cannot be parsed, add bad to the STATUS data item, otherwise add good. Here's how that program might look:

from chemfp import text_toolkit

#from chemfp import openbabel_toolkit as T
#from chemfp import openeye_toolkit as T
from chemfp import rdkit_toolkit as T

reader_args = {"openeye.sdf.flavor": "Default|SuppressEmptyMolSkip"}

with text_toolkit.read_molecules(None, "sdf") as reader:
    with text_toolkit.open_molecule_writer(None, "sdf") as writer:
        for text_record in reader:
            if T.parse_molecule(text_record.record, "sdf",
                                reader_args=reader_args, errors="ignore") is None:
                status = "bad"
            else:
                status = "good"
            text_record.add_tag("STATUS", status)
            writer.write_molecule(text_record)

Chemfp's text_toolkit has very limited ability to modify an SDF record; all it can do is append a data item to the end of the record. This is because that feature was developed so chemfp-based programs can easily add a fingerprint, or add similarity search results, to an SDF record. It doesn't (yet) have the ability to replace or remove data items.

What if you want to preserve the input record?

Remember earlier when I showed how the different toolkits can change the syntax of a record, in a chemically insignificant way? The biggest change was the second line, which may store the program name and date (and other information) is always updated by the toolkit.

What if you don't want to change anything? In that case, the previous program shows that you can use chemfp to read the record, pass it off to the toolkit, and update the original record, so the only changes are any new data items you may have appended.


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2020 Andrew Dalke Scientific AB