SDF record walkthrough
In this essay I'll walk through the major parts of a simple V2000 SDFile
record. (The next
essay bescribes some of the problems in finding the
record delimiter, and the one after that is about the
S SKP line.)
Richard Apodaca summarized the SDfile format a few months ago, with details I won't cover here. You should read it for more background.
Bear in mind that the variety of names for this format name leads to
some confusion. It's often called an
SDF file, which technically
, in the same way that
structure-data file file
PIN number technically means
. I tend to write
SD file, but the term in
the documentation is
CHEMBL504 / dimethyl sulfoxide / DMSO
Here's an example of ChEMBL record CHEMBL504 which I extracted from chembl_27.sdf.gz:
CHEMBL504 RDKit 2D 4 3 0 0 0 0 0 0 0 0999 V2000 4.4156 -2.9854 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0 4.4130 -2.1557 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 5.1355 -3.3980 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 3.6983 -3.4026 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 2 1 1 0 3 1 1 0 4 1 1 0 M CHG 2 1 1 2 -1 M END > <chembl_id> CHEMBL504 $$$$It represents dimethyl sulfoxide, also known as DMSO.
The format is line-oriented, and many of the fields in many lines are
specified by column position, not (for example) by
whitespace-separated fields. This is quite in line with the other
FORTRAN-style formats common when the MDL formats were first developed
in the 1970s. The format has a shallow hierarchical structure: an SDF
record contains a molfile followed by data items followed by the
SDFile record delimiter
$$$$. A molfile contains a header block
followed by a connection table (
Ctab block). A Ctab block
contains a counts line followed by an atom block followed by a bond
block followed by a properties block followed by the molfile delimiter
CHEMBL504 RDKit 2DThe first line of the record is the molecule name, often referred to a the
title. In the DMSO example the name is
CHEMBL504, which combines the data source name and data source identifier. This line may be blank if no name is available.
The second line may contain the user's initials (not given here), the
program name used to create the record (
RDKit in this case),
the date created (not given here), the text
describe if the coordinates are meant for 2D or 3D depiction (a
2d is interpreted as
3d if there are non-zero
z-coordinates). The documentation describes a few other fields which
are blank in this example.
The third line is
for comments. If no comment is available, a blank
line must be present. In this case there are no comments. (While
comments are rare, CHEBI:76065
contains the comment
Structure generated using tools available at
Molfile connection table - the Ctab block
4 3 0 0 0 0 0 0 0 0999 V2000The fourth line is the start of the
Ctab block. There are two Ctab block variants: V2000 and V3000. The V3000 format supports a broader range of chemistry than the V2000 format. Most tools which support both formats will generate a V2000 records unless V3000 features are needed, because that increases portability with tools which only understand the older V2000 format. (You can often configure them to always generate V3000.)
The example here is in V2000 format, which you can see from the
V2000 at the end of the line. (If neither
V3000 is present there then it's interpreted as V2000.) The
first three columns contain the atom count and the second three
columns the bond count. This limits the V2000 ctab block to 999 atoms
and 999 bonds - the V3000 format can handle larger records.
I won't go into the details of the other count fields; most are documented
The counts line in the example indicates four atoms and three bonds. The next sections of the Ctab block are the atom block, with four atom lines, followed by the bond block, with three bond lines.
4.4156 -2.9854 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0 4.4130 -2.1557 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 5.1355 -3.3980 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 3.6983 -3.4026 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 2 1 1 0 3 1 1 0 4 1 1 0
Each atom line contains the X, Y, and (optional) Z coordinate for the atom, the atom symbol (typically the element symbol), and a number of properties. Each bond line list the atom indices for the two ends of the bond, the bond type, and its stereochemistry. See the documentation for full details.
Next are the atom list block and the stext block, but the example doesn't have any and documentation says they aren't in current use.
M CHG 2 1 1 2 -1 M ENDThe last part of the Ctab block is the
Properties Block. The DMSO example has one property,
CHG, which specifies the formal charge one or more atoms. Columns 7-9 contain an atom count (
2), followed by the pair of atom index and charge; atom 1 has a charge of +1 and atom 2 has a charge of -1.
The Ctab block ends with the delimiter
M END; that's
two spaces between the
M and the
> <chembl_id> CHEMBL504After the Ctab block is the
data blockwith zero or more data items. Each data item contains a
data headerline followed by data lines followed by (and this is important!) a blank line. The data header line starts with
>and contains the
field name enclosed in angle brackets, says my documentation.
In the above DMSO example there is a single data item. The field name
chembl_id and the corresponding data is
CHEMBL504. You can think of these as key-value pairs. See Richard
Apodaca's essay for a more in-depth treatment.
SDF record delimiter
$$$$The SDF record ends with the delimiter line
$$$$. To be more accurate, the documentation says:
A line beginning with four dollar signs ($$$$) terminates each complete data block describing a compound.This mean that in principle the terminating delimiter could be
$$$$ABCD, though I've not seen any record using additional characters there.
In the next
essay I'll cover some of the other problems people run into with
Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me
Copyright © 2001-2020 Andrew Dalke Scientific AB