Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2020/09/22/fun_with_sdf_records

Fun with SDF records - chemfp's text toolkit

Earlier this year, Noel O'Boyle wrote the essay Python patterns for processing large SDF files and Richard Apodaca wrote Reading Large SDfiles in Rust. In this essay I'll show some examples of using chemfp's text toolkit API to extract non-chemical/near-chemical data from SDF records. The next essay will be a short one on read_sdf_ids_and_values(), followed by one which is more chemisty focused.

You can follow along yourself by installing chemfp (under the Base License Agreement) using:

python -m pip install chemfp -i https://chemfp.com/packages/

chemfp's text toolkit API

You likely know chemfp is a package for high-performance cheminformatics fingerprint similarity search. What you probably don't know is the Python API includes a text toolkit API, with functions to read SMILES and SD files. It can be used to, for examples:

The low-level SDF record iterators are:

Comment lines in ChEBI

Last week in my SDF record walkthrough I pointed out that ChEBI record CHEBI:76065 is a rare example of an SDF record with a non-empty comment line. (That's the third line of the record.)

I found it using something like this long one-liner using ChEBI_complete.sdf.gz which I downloaded on 21 Sept. 2020:

% python -c 'from chemfp import text_toolkit as T;print(*sorted(set(record.splitlines()[2] for record in T.read_sdf_records("ChEBI_complete.sdf.gz"))), sep="\n")'
b' '
b'  '
b'    '
b'         '
b'          '
b'           '
b'            '
b'              '
b'                '
b'Mrv1722008032014042D          '
b'Structure generated using tools available at www.lipidmaps.org'

The reason for the b'' in the output because chemfp parse the records as byte strings, not Unicode, for performance reasons. (In addition, the SD file format does not specify a character encoding, and I've seen Latin-1 encoded files which cannot be read as UTF-8.) The print function, when given a byte string, encodes it with the leading b' and trailing '. It would also escape any non-ASCII bytes, but these comments are all ASCII.

This one-liner can be written as a small program which is likely easier to read. However, it's very idomatic so I'll explain it after showing you the program.

from chemfp import text_toolkit as T

with T.read_sdf_records("ChEBI_complete.sdf.gz") as record_reader:
    # the comment is the 3rd line in the record = record.splitlines()[2]
    comment_lines = set(record.splitlines()[2] for record in record_reader)

# print each comment, with a newline as a separator
print(*sorted(comment_lines), sep="\n")

The read_sdf_records function returns a context manager to close the file when done, so I'll use the with-statement. The context manager is also a iterator over the records in the file. For each record, I get the comment line and add it to a set. Once done, I sort the comments then print them. The splat operator * iterates over the sorted elements and turns them into positional arguments to the print function. The sep parameter separates each term by a newline (rather than a space) so one comment is on each output line.

The output from this program is the same as the one-liner so I won't repeat it here.

Comment counts

I was curious about which comments were the most common so I modified the program to use the Counter data type. It count the number of occurrences of each unique value, and its most_common method returns the unique value and count ordered from most to least:

from chemfp import text_toolkit as T
from collections import Counter

with T.read_sdf_records("ChEBI_complete.sdf.gz") as record_reader:
    comment_lines = Counter(record.splitlines()[2] for record in record_reader)

for comment, count in comment_lines.most_common():
    print(f"{count:8d}: {comment}")

The output is:

   68485: b''
   40832: b'null'
    4558: b' '
      12: b'                '
       4: b'           '
       3: b'          '
       2: b'    '
       1: b'  '
       1: b'              '
       1: b'         '
       1: b'            '
       1: b'Structure generated using tools available at www.lipidmaps.org'
       1: b'Mrv1722008032014042D          '

That last entry, starting with Mrv1722…, is odd. It's formatted like what should have been on line 2, not line 3. And indeed, that record is missing line 2:

% zgrep --before=3 --after=2 Mrv1722008032014042D ChEBI_complete.sdf.gz

  6  5  0  0  0  0            999 V2000
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0

I was also curious about which tool put a "null" in the comment so I wrote a program which parsed the second line (which contains the program name and date). It was CDK, and all of the dates were between 2016-02-24 and 2016-02-26 or on 2017-02-23. I'll leave that implementation to the reader.

It's a bit odd because git annotate cdk/io/MDLV2000Writer.java says CDK could handle that case by 2008:

dc6e3a50d6      (  John May     2014-09-10 15:05:32 +0100       357)        String comment = (String) container.getProperty(CDKConstants.REMARK);
7c5c872c6b      (     egonw     2008-02-22 19:13:45 +0000       358)        if (comment == null) comment = "";
dc6e3a50d6      (  John May     2014-09-10 15:05:32 +0100       359)        if (comment.length() > 80) comment = comment.substring(0, 80);
7a2e5e230f      (     egonw     2008-05-25 07:53:38 +0000       360)        writer.write(comment);

The ChEBI title line is not useful

One thing about the ChEBI SDF distribution is that it doesn't use the title/molecule name field of the SDF record, which is the first line of the record. Using a program similar to the earlier one, I see that 64,942 records have a blank line there, 39,518 have the title null (all from CDK), plus an assortment of Structure #1, Compound 2, IUPAC names, tradenames, SMILES strings, filenames, and more.

Instead, it's stored in the data block as the ChEBI ID. There's also a ChEBI Name and other fields:

> <ChEBI ID>

> <ChEBI Name>

What are the ChEBI SD tags?

The ChEBI ID and ChEBI Name are two of many data items which may be in a ChEBI file. What are the others?

Assuming the records are clean enough, here's a one-liner using Perl which counts the number of occurrences of each unique data item tag:

% gzcat ChEBI_complete.sdf.gz | perl -ne 'if (/^>.*<([^>]+)>/) {$seen{$1}++} END {foreach $key (keys %seen) {print "$seen{$key} $key\n"}}' | sort -nr
113902 Star
113902 ChEBI Name
113902 ChEBI ID
113818 Formulae
113721 PubChem Database Links
113716 Charge
112938 SMILES
44 ArrayExpress Database Links
13 LIPID MAPS class Database Links
4 ChemIDplus Database Links
3 CiteXplore citation Links
1 PPR Links
1 FAO/WHO standards Database Links

chemfp's text_toolkit module has a few functions to extract fields from an SDF record:

I'll use get_sdf_tag_pairs() to get the list of pairs in the first record of the ChEBI file:

>>> from chemfp import text_toolkit as T
>>> for tag, data in T.get_sdf_tag_pairs(record):
...   print(tag, data)
b'ChEBI ID' b'CHEBI:90'
b'ChEBI Name' b'(-)-epicatechin'
b'Star' b'3'
b'Definition' b'A catechin with (2R,3R)-configuration.'
b'Secondary ChEBI ID' b'CHEBI:18484'
b'InChI' b'InChI=1S/C15H14O6/c16-8-4-11(18)9-6-13(20)15(21-14(9)5-8)7-1-2-10(17)12(19)3-7/h1-5,13,15-20H,6H2/t13-,15-/m1/s1'
b'SureChEMBL Database Links' b'SCHEMBL19412'
b'UniProt Database Links' b'Q9SEV0'
b'Last Modified' b'15 Feb 2018'

As before, these are byte strings, not Unicode strings, so Python prints them with special escaping rules.

Putting it together, here's a program which counts the number of times each unique tag is used in the ChEBI release.

from chemfp import text_toolkit as T
from collections import Counter

tag_counts = Counter()
with T.read_sdf_records("ChEBI_complete.sdf.gz") as reader:
    for record in reader:
        tag_counts.update(tag for (tag, value) in T.get_sdf_tag_pairs(record))

for tag, count in tag_counts.most_common():
    print(count, tag.decode("ascii"))

The output is almost the same as the one-liner; the difference is in how ties are broken when two tags have the same count.

What about the performance?

The one-liner with Perl takes 4.55 seconds, of which 5.55 seconds is user time and 0.26 seconds are system time. The user time is larger than the real time because on my multicore MacBook Pro the gzip decompression occurs in parallel with the perl code.

The chemfp program takes 4.85 seconds, of which 4.75 are user time and 0.06 are system time. It's slightly slower than the one-liner in part because everything occurs in the same process. (Chemfp uses libz directly, for better internal performance.)

Chemfp can be configured to use an external process. If I set the CHEMFP_GZIP environment to gzip then the program takes 4.57 seconds, of which 4.83 seconds are user time and 0.17 seconds are system time. This is essentially (within the measurement variability of my testing) the same as the one-liner, and with higher robustness.

Get the record id for a given record

Earlier we saw that one of the records has the comment Structure generated using tools available at www.lipidmaps.org. What's the associated ChEBI ID? We can parse the records to find the one with the comment, then use get_sdf_tag() to get the first value with a given tag name. While I can write it as a one-liner in the shell, I'll split it up as a more readable program that can be pasted into the Python shell or Jupyter notebook:

from chemfp import text_toolkit as T
for record in T.read_sdf_records("ChEBI_complete.sdf.gz"):
    if b"www.lipidmaps.org" in record:
        print(T.get_sdf_tag(record, b"ChEBI ID"))

This program assumes that www.lipidmaps.org only exists once, which happens to be the case - the one line of output is CHEBI:76065.

A more careful program might check if that substring is in the comment line.


An alternative is to let chemfp iterate over the pairs of (id, record) using read_sdf_ids_and_records(). By default the id is the title line, which isn't useful in the ChEBI SDF distribution.

>>> from chemfp import text_toolkit as T 
>>> for count, (id, record) in enumerate(T.read_sdf_ids_and_records("ChEBI_complete.sdf.gz")):
...    print(f"id: {id!r} starts: {record[:40]!r}")
...    if count == 4:
...       break
id: '' starts: b'\n  Marvin  01211310252D          \n\n 22 2'
id: '' starts: b'\n  Marvin  02151210452D          \n\n 11 1'
id: '' starts: b'\n  Mrv0541 12031312382D          \n\n 10  '
id: '' starts: b'\n  Marvin  10251009462D          \n\n 24 2'
id: '' starts: b'\n  Marvin  11020709032D          \n\n 10 1'

Instead, ask it to get the id from the data item with the tag name ChEBI ID:

>>> for count, (id, record) in enumerate(T.read_sdf_ids_and_records("ChEBI_complete.sdf.gz", id_tag="ChEBI ID")):
...    print(f"id: {id!r} starts: {record[:40]!r}")
...    if count == 4:
...       break
id: 'CHEBI:90' starts: b'\n  Marvin  01211310252D          \n\n 22 2'
id: 'CHEBI:165' starts: b'\n  Marvin  02151210452D          \n\n 11 1'
id: 'CHEBI:598' starts: b'\n  Mrv0541 12031312382D          \n\n 10  '
id: 'CHEBI:776' starts: b'\n  Marvin  10251009462D          \n\n 24 2'
id: 'CHEBI:943' starts: b'\n  Marvin  11020709032D          \n\n 10 1'

This simplifies the search program to:

>>> from chemfp import text_toolkit as T
>>> for (id, record) in T.read_sdf_ids_and_records("ChEBI_complete.sdf.gz", id_tag="ChEBI ID"):
...     if b"www.lipidmaps.org" in record:
...         print(id)

ChEBI ID as a function of date

The second line of the SDF record may contains a timestamp, though in practice it can be empty or be incorrectly formatted. I'm curious to see how the ChEBI ID changes as a function of date. I presume it's roughly linear. I started Jypter and pasted the following in the first cell:

import datetime
from chemfp import text_toolkit as T

timestamps = []
chebi_nums = []

num_records = num_missing_timestamp = num_bad_timestamps = 0

for chebi_id, record in T.read_sdf_ids_and_records("ChEBI_complete.sdf.gz", id_tag="ChEBI ID"):
    num_records += 1
    assert chebi_id.startswith("CHEBI:"), chebi_id
    chebi_num = int(chebi_id[6:])
    lines = record.splitlines()
    timestamp_str = lines[1][10:20]
    if not timestamp_str:
        num_missing_timestamp += 1
        timestamp = datetime.datetime.strptime(timestamp_str.decode("ascii"), "%m%d%y%H%M")
    except ValueError:
        #print(f"Cannot parse timestamp for {chebi_id}: {timestamp_str}")
        num_bad_timestamps += 1

# The f'{expr=}' option was added in Python 3.8
# https://docs.python.org/3/whatsnew/3.8.html#f-strings-support-for-self-documenting-expressions-and-debugging
print(f"{num_records=} {num_missing_timestamp=} {num_bad_timestamps=}")

The output was:

num_records=113902 num_missing_timestamp=4855 num_bad_timestamps=1718

Then in the next cell, plot the ids as a function of time:

%matplotlib inline
import matplotlib.pyplot as plt
plt.ylabel("ChEBI ID")
plt.plot(timestamps, chebi_nums, 'o', color='black')
# plt.savefig("ChEBI_ids_over_time.png")

The resulting scatter plot surprised me - I wonder what happened around 2008 and 2011 to have those spikes. But I'll leave it as an idle curiousity.

Atom and bond count distributions

If you know the chemistry is correct then it can be faster to process an SDF as text rather than use a chemically-aware toolkit. OEChem is the fastest chemistry toolkit I use; it takes only 17 seconds to parse the 113,902 ChEBI records.

However, suppose you want to get a distribution of the heavy atom and bond counts for molecules with at most 100 atoms or 100 bonds. That information is available (almost) from the counts line on the fourth line and can be extracted with the following in about 3.5 seconds:

from chemfp import text_toolkit as T

atom_counts = []
bond_counts = []

for record in T.read_sdf_records("ChEBI_complete.sdf.gz"):
    lines = record.splitlines()
    counts_line = lines[3]
    num_atoms = int(counts_line[:3])
    num_bonds = int(counts_line[3:6])
    num_explicit_hydrogens= sum(1 for line in lines[5:5+num_atoms] if line[31:34].strip() == b'H')
    num_atoms -= num_explicit_hydrogens
    num_bonds -= num_explicit_hydrogens
    if num_atoms > 100 or num_bonds > 100:

That num_explicit_hydrogens code handles the almost case where there are explicit hydrogens in the atom block, assuming all hydrogens have one and only one bond! It iterates through the atom block lines, extracts the element symbol (the [31:34]), removes any whitespace, and checks if the result is H for hydrogen. Since the record is a byte string, and I use byte string operations, I need to test the result against the byte string b'H'.

Histogram of atom and bond counts

I visualized the results in a Jupyter notebook using the following:

%matplotlib inline
import matplotlib.pyplot as plt

plt.suptitle("ChEBI records with at most\n100 heavy atoms or bonds")
plt.subplot(1, 2, 1)
plt.ylabel("number of records")
plt.xlabel("atom counts")
NA = max(atom_counts)
plt.hist(atom_counts, range=(0, NA), bins=NA)

plt.subplot(1, 2, 2)
plt.xlabel("bond counts")
NB = max(bond_counts)
plt.hist(bond_counts, range=(0, NB), bins=NB)


Giving the following plot:


Beware! This is getting perilously close to writing your own cheminformatics toolkit! Always ask yourself if the extra performance is worth things like accidentally and silently failing on V3000 records, or allowing invalid chemistry, or including salts, or the other things that a full-fledged toolkit can handle.

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me

Copyright © 2001-2020 Andrew Dalke Scientific AB