Handling the SDF record delimiter
In this essay I'll point out a common difficulty people have when trying to identify the end of an SDFile record. In my previous essay I walked through the major parts of the following record:
CHEMBL504 RDKit 2D 4 3 0 0 0 0 0 0 0 0999 V2000 4.4156 -2.9854 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0 4.4130 -2.1557 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 5.1355 -3.3980 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 3.6983 -3.4026 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 2 1 1 0 3 1 1 0 4 1 1 0 M CHG 2 1 1 2 -1 M END > <chembl_id> CHEMBL504 $$$$
The $$$$
line at the end is the SDFile record delimiter.
Counting records with grep
How many records are in the ChEMBL 27 SDF distribution? One solution
is to count the number of $$$$
lines in the file using the
-c flag to fgrep:
% gzcat chembl_27.sdf.gz | fgrep -c '$$$$' 1941411
This is the correct number, which I verified through other means.
However, it's correct only because the ChEMBL release doesn't include some of the ways that can cause that simple fgrep to give the wrong answers. But before I get to those, I'll describe why I used single quotes in the above comment, and why I used fgrep instead of grep.
$
and shell variables
I enclosed the $$$$
in single quotes because my shell uses
$
to indicate variable substitution. A $$
is replaced
with the shell's process id:
% echo $$ 536 % echo $$$$ 536536
which means:
% gzcat chembl_27.sdf.gz | fgrep -c $$$$ 2
actually counts the number of times the pattern 536536
exists
in the SD file.
$
and (f)grep
I used fgrep instead of grep because $
has
special meaning in grep. Quoting man re_format on my
Mac,
where it matches
$
is an ordinary character except at the end of the RE
or= the end of a parenthesized subexpressionthe null string at the end of a line.
(The =
in that text marks decisions on these aspects that
may not be fully portable to other IEEE Std 1003.2 (``POSIX.2'')
implementations.
FreeBSD's man re_format and GNU's
man 7 regex say essentially the same thing as the Mac.)
What that means is that a pattern like 123$
finds lines that
end with 123
but a pattern like 12$4
finds lines
containing the string 12$4
. I didn't know this before writing
this essay!
What that means is that if you use the grep pattern $$$$
then
what you've asked to do is find all lines ending with
$$$
- which gives the right answer for ChEMBL:
% gzcat chembl_27.sdf.gz | grep -c '$$$$' 1941411
but isn't quite the same as searching for $$$$
.
I use fgrep so I don't need to worry about special character
meanings - a $
is a $
.
Another option is to use grep -F
because the
-F flag tells grep to interpret the pattern as a fixed
string.
Issue #1: $$$$
not at the line start
One obvious way to miscount the number of records is if a line
includes a $$$$
which isn't at the start of the line. The
following modification of the DMSO record is valid, for a molecule
named ABC$$$$
:
ABC$$$$ RDKit 2D 4 3 0 0 0 0 0 0 0 0999 V2000 4.4156 -2.9854 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0 4.4130 -2.1557 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 5.1355 -3.3980 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 3.6983 -3.4026 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 2 1 1 0 3 1 1 0 4 1 1 0 M CHG 2 1 1 2 -1 M END > <abc_id> ABC$$$$ $$$$
However, fgrep would find three matches, not the expected one.
The obvious fix is to use a search pattern which requires the
$$$$
to be at the start of the line, using the ^
grep
metacharacter. As you saw earlier, the $
metacharacter matches
the end-of-line when at the end of a pattern, so it's best to simply
escape it:
% gzcat chembl_27.sdf.gz | grep -c '^\$\$\$\$' 1941411
(Remember that the specification says that SDF record delimiter only
needs to start with $$$$
, not that it's the only
characters in the line.)
Issue #2: a title line that looks like a SDF record delimiter
Can other lines in an SDF record be $$$$
? Yes! Take a look at
the following example, which has that delimiter on the title line and
on a data line:
$$$$ RDKit 2D 4 3 0 0 0 0 0 0 0 0999 V2000 4.4156 -2.9854 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0 4.4130 -2.1557 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 5.1355 -3.3980 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 3.6983 -3.4026 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 2 1 1 0 3 1 1 0 4 1 1 0 M CHG 2 1 1 2 -1 M END > <price> $$$$ $$$$
Is this valid? Well, no. The specification says about the title line:
This line must not contain any of the reserved tags that identify any of the other CTAB file types such as $MDL (RGfile), $$$$ (SDfile record separator), $RXN (rxnfile), or $RDFILE (RDfile headers).but it's not always enforced by software. Here I'll use the RDKit to create a molfile block with an invalid title:
>>> from rdkit import Chem >>> mol = Chem.MolFromSmiles("CS(=O)C") >>> mol.SetProp("_Name", "$$$$") >>> print(Chem.MolToMolBlock(mol)) $$$$ RDKit 2D 4 3 0 0 0 0 0 0 0 0999 V2000 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1.2990 0.7500 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0 1.2990 2.2500 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 2.5981 -0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1 2 1 0 2 3 2 0 2 4 1 0 M END
Since the specification says that the title line may not contain text which can be confused with an SDfile record, it can be okay to ignore this possibility.
Issue #3: a data line that looks like a SDF record delimiter
However, here's a problem that I've heard about in real life. Suppose
each record has a price
data item with the strings $
for
an inexpensive price, up to $$$$
for very expensive.
I gave an example of that in the data item of the previous record, which I'll show again here:
> <price> $$$$ $$$$
The specification doesn't prohibit it, and software like the RDKit doesn't prohibit it, so it gets used.
However, in practice many software tools don't support it correctly.
The person who told me about this real-world case said that when the
price category $$$$
was finally used, it broke parts of their
data processing pipeline. (I can't remember if they fixed the broken
parts or changed how they encoded that information.)
The usual fix is to realize that the previous line must either be
M END
(if there are no data items) or be a blank line,
because each data item must end with a blank line.
However, this is not so easy to specify as a command-line one-liner. One that I came up with (and stretching the definition of one-liner) is:
% gzcat chembl_27.sdf.gz | perl -ne 'BEGIN{$n=$flg=0} \ if (/^(M END)?$/) {$flg=1;next} \ if ($flg && /^\$\$\$\$/) {$n++}; $flg=0; \ END{print "$n\n"}' 1941411
(My first attempt used awk, but that took entirely too long.)
Issue #4: comment line
Even the above still doesn't work!
The documentation says a blank line can be substituted for line 2
of the molfile,
and line 3 is a line for comments
. That means the following is valid:
CHEMBL504 $$$$ 4 3 0 0 0 0 0 0 0 0999 V2000 4.4156 -2.9854 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0 4.4130 -2.1557 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 5.1355 -3.3980 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 3.6983 -3.4026 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 2 1 1 0 3 1 1 0 4 1 1 0 M CHG 2 1 1 2 -1 M END > <chembl_id> CHEMBL504 $$$$
where the third line (the comment line) contains a $$$$
. Indeed
the RDKit will parse it without a problem!
But the perl one-liner I wrote reports two record.
It's of course possible to fix this, but it's taking more and more work.
Relax!
I've now pointed out many difficulties in counting the records in an SD file. Should you worry about them?
In nearly all cases the answer is no!. I mean, if you are writing a general purpose SDF record parser which is supposed to handle all of the edge cases, including odd-ball records that rules lawyers like me might give, then certainly yes.
But nearly every record you see (eg, from PubChem or ChEMBL) will be
clean, in the sense that $$$$
will only occur at the end of the
record, and be the only content on that line.
If you accept a molecule name/title or data items with user-entered
values then you must ensure that your data processing pipeline can
handle your users entering a $$$$
.
Otherwise, use an off-the-shelf tool to handle record processing and (mostly) forget about it. Here's an example with chemfp's text toolkit API:
% python -c 'from chemfp import text_toolkit as T;print(sum(1 for _ in T.read_sdf_records("chembl_27.sdf.gz")))' 1941411
On the Haswell machine I use for timings, bash's time
reports:
real 0m18.423s user 0m17.811s sys 0m0.612s
By comparision, if I use pigz, which I found was faster
than gzip on that machine then time
reports:
% cat pigz_grep.sh pigz -dc chembl_27.sdf.gz | fgrep -c '$$$$' % time bash pigz_grep.sh 1941411 real 0m8.920s user 0m14.328s sys 0m2.014s
This is about twice as fast, measured as elapsed/real time. However,
see how the user time is 14 seconds? That's because the script runs
pigz and fgrep as two different processes, and the Haswell machine has
multiple cores, so there is some parallelization. The total
user+sys
time is 16.3 seconds, which is only a bit faster than
chemfp.
Then again, chemfp can be configured to use pigz as a co-process:
% export CHEMFP_GZIP=pigz % time python -c 'from chemfp import text_toolkit as T;print(sum(1 for _ in T.read_sdf_records("chembl_27.sdf.gz")))' 1941411 real 0m11.600s user 0m19.800s sys 0m2.118s
Still not as fast as grep, or as easy to write, but it is more robust to odd uses of $$$$
.
SKP record
P.S. There's one more oddity which I'll cover in the next essay - a S SKP
line in the properties block.
Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me

Copyright © 2001-2020 Andrew Dalke Scientific AB