Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2020/09/17/sdf_record_walkthrough

SDF record walkthrough

In this essay I'll walk through the major parts of a simple V2000 SDFile record. (The next essay bescribes some of the problems in finding the $$$$ record delimiter, and the one after that is about the S  SKP line.)

Richard Apodaca summarized the SDfile format a few months ago, with details I won't cover here. You should read it for more background.

Bear in mind that the variety of names for this format name leads to some confusion. It's often called an SDF file, which technically means structure-data file file, in the same way that PIN number technically means personal identification number number. I tend to write SD file, but the term in the documentation is SDFile.

CHEMBL504 / dimethyl sulfoxide / DMSO

Here's an example of ChEMBL record CHEMBL504 which I extracted from chembl_27.sdf.gz:

CHEMBL504
     RDKit          2D

  4  3  0  0  0  0  0  0  0  0999 V2000
    4.4156   -2.9854    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
    4.4130   -2.1557    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    5.1355   -3.3980    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.6983   -3.4026    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  2  1  1  0
  3  1  1  0
  4  1  1  0
M  CHG  2   1   1   2  -1
M  END
> <chembl_id>
CHEMBL504

$$$$
It represents dimethyl sulfoxide, also known as DMSO.

The format is line-oriented, and many of the fields in many lines are specified by column position, not (for example) by whitespace-separated fields. This is quite in line with the other FORTRAN-style formats common when the MDL formats were first developed in the 1970s. The format has a shallow hierarchical structure: an SDF record contains a molfile followed by data items followed by the SDFile record delimiter $$$$. A molfile contains a header block followed by a connection table (Ctab block). A Ctab block contains a counts line followed by an atom block followed by a bond block followed by a properties block followed by the molfile delimiter M  END.

molfile header

CHEMBL504
     RDKit          2D

The first line of the record is the molecule name, often referred to a the title. In the DMSO example the name is CHEMBL504, which combines the data source name and data source identifier. This line may be blank if no name is available.

The second line may contain the user's initials (not given here), the program name used to create the record (RDKit in this case), the date created (not given here), the text 2D or 3D to describe if the coordinates are meant for 2D or 3D depiction (a 2d is interpreted as 3d if there are non-zero z-coordinates). The documentation describes a few other fields which are blank in this example.

The third line is for comments. If no comment is available, a blank line must be present. In this case there are no comments. (While comments are rare, CHEBI:76065 contains the comment Structure generated using tools available at www.lipidmaps.org.)

Molfile connection table - the Ctab block

  4  3  0  0  0  0  0  0  0  0999 V2000
The fourth line is the start of the Ctab block. There are two Ctab block variants: V2000 and V3000. The V3000 format supports a broader range of chemistry than the V2000 format. Most tools which support both formats will generate a V2000 records unless V3000 features are needed, because that increases portability with tools which only understand the older V2000 format. (You can often configure them to always generate V3000.)

The example here is in V2000 format, which you can see from the V2000 at the end of the line. (If neither V2000 nor V3000 is present there then it's interpreted as V2000.) The first three columns contain the atom count and the second three columns the bond count. This limits the V2000 ctab block to 999 atoms and 999 bonds - the V3000 format can handle larger records.

I won't go into the details of the other count fields; most are documented as obsolete.

The counts line in the example indicates four atoms and three bonds. The next sections of the Ctab block are the atom block, with four atom lines, followed by the bond block, with three bond lines.

    4.4156   -2.9854    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
    4.4130   -2.1557    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    5.1355   -3.3980    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.6983   -3.4026    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  2  1  1  0
  3  1  1  0
  4  1  1  0

Each atom line contains the X, Y, and (optional) Z coordinate for the atom, the atom symbol (typically the element symbol), and a number of properties. Each bond line list the atom indices for the two ends of the bond, the bond type, and its stereochemistry. See the documentation for full details.

Next are the atom list block and the stext block, but the example doesn't have any and documentation says they aren't in current use.

M  CHG  2   1   1   2  -1
M  END
The last part of the Ctab block is the Properties Block . The DMSO example has one property, CHG, which specifies the formal charge one or more atoms. Columns 7-9 contain an atom count (   2), followed by the pair of atom index and charge; atom 1 has a charge of +1 and atom 2 has a charge of -1.

The Ctab block ends with the delimiter M  END; that's two spaces between the M and the END.

Data block

> <chembl_id>
CHEMBL504

After the Ctab block is the data block with zero or more data items. Each data item contains a data header line followed by data lines followed by (and this is important!) a blank line. The data header line starts with > and contains the field name enclosed in angle brackets, says my documentation.

In the above DMSO example there is a single data item. The field name is chembl_id and the corresponding data is CHEMBL504. You can think of these as key-value pairs. See Richard Apodaca's essay for a more in-depth treatment.

SDF record delimiter $$$$

$$$$
The SDF record ends with the delimiter line $$$$. To be more accurate, the documentation says: A line beginning with four dollar signs ($$$$) terminates each complete data block describing a compound. This mean that in principle the terminating delimiter could be $$$$ABCD, though I've not seen any record using additional characters there.

In the next essay I'll cover some of the other problems people run into with the $$$$ delimiter.


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2020 Andrew Dalke Scientific AB