Handling the SDF record delimiter

[ previous | newer ] /home/writings/diary/archive/2020/09/18/handling_the_sdf_record_delimiter

Handling the SDF record delimiter

In this essay I'll point out a common difficulty people have when trying to identify the end of an SDFile record. In my previous essay I walked through the major parts of the following record:

CHEMBL504
     RDKit          2D

  4  3  0  0  0  0  0  0  0  0999 V2000
    4.4156   -2.9854    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
    4.4130   -2.1557    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    5.1355   -3.3980    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.6983   -3.4026    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  2  1  1  0
  3  1  1  0
  4  1  1  0
M  CHG  2   1   1   2  -1
M  END
> <chembl_id>
CHEMBL504

$$$$

The $$$$ line at the end is the SDFile record delimiter.

Counting records with grep

How many records are in the ChEMBL 27 SDF distribution? One solution is to count the number of $$$$ lines in the file using the -c flag to fgrep:

% gzcat chembl_27.sdf.gz | fgrep -c '$$$$'
1941411

This is the correct number, which I verified through other means.

However, it's correct only because the ChEMBL release doesn't include some of the ways that can cause that simple fgrep to give the wrong answers. But before I get to those, I'll describe why I used single quotes in the above comment, and why I used fgrep instead of grep.

$ and shell variables

I enclosed the $$$$ in single quotes because my shell uses $ to indicate variable substitution. A $$ is replaced with the shell's process id:

% echo $$
536
% echo $$$$
536536

which means:

% gzcat chembl_27.sdf.gz | fgrep -c $$$$
2

actually counts the number of times the pattern 536536 exists in the SD file.

$ and (f)grep

I used fgrep instead of grep because $ has special meaning in grep. Quoting man re_format on my Mac, $ is an ordinary character except at the end of the RE or= the end of a parenthesized subexpression where it matches the null string at the end of a line.

(The = in that text marks decisions on these aspects that may not be fully portable to other IEEE Std 1003.2 (``POSIX.2'') implementations. FreeBSD's man re_format and GNU's man 7 regex say essentially the same thing as the Mac.)

What that means is that a pattern like 123$ finds lines that end with 123 but a pattern like 12$4 finds lines containing the string 12$4. I didn't know this before writing this essay!

What that means is that if you use the grep pattern $$$$ then what you've asked to do is find all lines ending with $$$ - which gives the right answer for ChEMBL:

% gzcat chembl_27.sdf.gz | grep -c '$$$$'
1941411

but isn't quite the same as searching for $$$$.

I use fgrep so I don't need to worry about special character meanings - a $ is a $.

Another option is to use grep -F because the -F flag tells grep to interpret the pattern as a fixed string.

Issue #1: $$$$ not at the line start

One obvious way to miscount the number of records is if a line includes a $$$$ which isn't at the start of the line. The following modification of the DMSO record is valid, for a molecule named ABC$$$$:

ABC$$$$
     RDKit          2D

  4  3  0  0  0  0  0  0  0  0999 V2000
    4.4156   -2.9854    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
    4.4130   -2.1557    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    5.1355   -3.3980    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.6983   -3.4026    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  2  1  1  0
  3  1  1  0
  4  1  1  0
M  CHG  2   1   1   2  -1
M  END
> <abc_id>
ABC$$$$

$$$$

However, fgrep would find three matches, not the expected one.

The obvious fix is to use a search pattern which requires the $$$$ to be at the start of the line, using the ^ grep metacharacter. As you saw earlier, the $ metacharacter matches the end-of-line when at the end of a pattern, so it's best to simply escape it:

% gzcat chembl_27.sdf.gz | grep -c '^\$\$\$\$'
1941411

(Remember that the specification says that SDF record delimiter only needs to start with $$$$, not that it's the only characters in the line.)

Issue #2: a title line that looks like a SDF record delimiter

Can other lines in an SDF record be $$$$? Yes! Take a look at the following example, which has that delimiter on the title line and on a data line:

$$$$
     RDKit          2D

  4  3  0  0  0  0  0  0  0  0999 V2000
    4.4156   -2.9854    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
    4.4130   -2.1557    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    5.1355   -3.3980    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.6983   -3.4026    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  2  1  1  0
  3  1  1  0
  4  1  1  0
M  CHG  2   1   1   2  -1
M  END
> <price>
$$$$

$$$$

Is this valid? Well, no. The specification says about the title line:

This line must not contain any of the reserved tags that identify any of the other CTAB file types such as $MDL (RGfile), $$$$ (SDfile record separator), $RXN (rxnfile), or $RDFILE (RDfile headers).

but it's not always enforced by software. Here I'll use the RDKit to create a molfile block with an invalid title:

>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles("CS(=O)C")
>>> mol.SetProp("_Name", "$$$$")
>>> print(Chem.MolToMolBlock(mol))
$$$$
     RDKit          2D

  4  3  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.2990    0.7500    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
    1.2990    2.2500    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    2.5981   -0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0
  2  3  2  0
  2  4  1  0
M  END

Since the specification says that the title line may not contain text which can be confused with an SDfile record, it can be okay to ignore this possibility.

Issue #3: a data line that looks like a SDF record delimiter

However, here's a problem that I've heard about in real life. Suppose each record has a price data item with the strings $ for an inexpensive price, up to $$$$ for very expensive.

I gave an example of that in the data item of the previous record, which I'll show again here:

> <price>
$$$$

$$$$

The specification doesn't prohibit it, and software like the RDKit doesn't prohibit it, so it gets used.

However, in practice many software tools don't support it correctly. The person who told me about this real-world case said that when the price category $$$$ was finally used, it broke parts of their data processing pipeline. (I can't remember if they fixed the broken parts or changed how they encoded that information.)

The usual fix is to realize that the previous line must either be M END (if there are no data items) or be a blank line, because each data item must end with a blank line.

However, this is not so easy to specify as a command-line one-liner. One that I came up with (and stretching the definition of one-liner) is:

% gzcat chembl_27.sdf.gz | perl -ne 'BEGIN{$n=$flg=0} \
      if (/^(M  END)?$/) {$flg=1;next} \
      if ($flg && /^\$\$\$\$/) {$n++}; $flg=0; \
      END{print "$n\n"}'
1941411

(My first attempt used awk, but that took entirely too long.)

Issue #4: comment line

Even the above still doesn't work!

The documentation says a blank line can be substituted for line 2 of the molfile, and line 3 is a line for comments. That means the following is valid:

CHEMBL504

$$$$
  4  3  0  0  0  0  0  0  0  0999 V2000
    4.4156   -2.9854    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
    4.4130   -2.1557    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    5.1355   -3.3980    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.6983   -3.4026    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  2  1  1  0
  3  1  1  0
  4  1  1  0
M  CHG  2   1   1   2  -1
M  END
> <chembl_id>
CHEMBL504

$$$$

where the third line (the comment line) contains a $$$$. Indeed the RDKit will parse it without a problem!

But the perl one-liner I wrote reports two record.

It's of course possible to fix this, but it's taking more and more work.

Relax!

I've now pointed out many difficulties in counting the records in an SD file. Should you worry about them?

In nearly all cases the answer is no!. I mean, if you are writing a general purpose SDF record parser which is supposed to handle all of the edge cases, including odd-ball records that rules lawyers like me might give, then certainly yes.

But nearly every record you see (eg, from PubChem or ChEMBL) will be clean, in the sense that $$$$ will only occur at the end of the record, and be the only content on that line.

If you accept a molecule name/title or data items with user-entered values then you must ensure that your data processing pipeline can handle your users entering a $$$$.

Otherwise, use an off-the-shelf tool to handle record processing and (mostly) forget about it. Here's an example with chemfp's text toolkit API:

% python -c 'from chemfp import text_toolkit as T;print(sum(1 for _ in T.read_sdf_records("chembl_27.sdf.gz")))'
1941411

On the Haswell machine I use for timings, bash's time reports:

real	0m18.423s
user	0m17.811s
sys	0m0.612s

By comparision, if I use pigz, which I found was faster than gzip on that machine then time reports:

% cat pigz_grep.sh
pigz -dc chembl_27.sdf.gz | fgrep -c '$$$$'
% time bash pigz_grep.sh
1941411

real	0m8.920s
user	0m14.328s
sys	0m2.014s

This is about twice as fast, measured as elapsed/real time. However, see how the user time is 14 seconds? That's because the script runs pigz and fgrep as two different processes, and the Haswell machine has multiple cores, so there is some parallelization. The total user+sys time is 16.3 seconds, which is only a bit faster than chemfp.

Then again, chemfp can be configured to use pigz as a co-process:

% export CHEMFP_GZIP=pigz
% time python -c 'from chemfp import text_toolkit as T;print(sum(1 for _ in T.read_sdf_records("chembl_27.sdf.gz")))'
1941411

real	0m11.600s
user	0m19.800s
sys	0m2.118s

Still not as fast as grep, or as easy to write, but it is more robust to odd uses of $$$$.

SKP record

P.S. There's one more oddity which I'll cover in the next essay - a S SKP line in the properties block.

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me