SMILES tokens
Okay, so you're a hard-core geek and you want to know how write your
own chemical informatics toolkit, so that you too will get fun, fame,
fortune, and fem----... err, MOTAS. As far as I know, there is no
documentation or textbook on this topic. The chemical informatics
books I've seen all talk about the science of how the fields work, not
the software, and of course almost no computer science book goes into
chemistry, other than some mention that graph theory can be applied to
the valence bond model of molecules. Let me be your trusty guide on
through these uncharted electrons.
I'll start with SMILES since it's such a pretty, well-defined nomenclature. Eventually I'll describe SD and perhaps mol2 files, and a bit of the agony that is the PDB. I'll assume you know what SMILES is. If not, Daylight has a plenty of documentation.
If you have a compound which falls into Daylight's model of chemistry (ie, covalent bonds, in a ground state, etc.) then it can be represented as a SMILES string, or alternatively you can say that it's represented in SMILES. A SMILES string can be broken down into smaller terms; atom, bond, open branch, close branch, ring closure, and dot disconnect. To make the enumeration easier to understand, I'll separate the atom term into "element" and "atom", where the first is an atom in the organic subset and may be written without square brackets.
- element: one of 'c', 'n', 'o', 's', 'p', 'B', 'C', 'N', 'O', 'F', 'P', 'S', 'I', 'Cl', or 'Br'.
- atom: of the form '[' mass? symbol chiral? hcount? charge? ']'
where the '?' means the given component is optional and where
- mass: a non-negative integer
- symbol: one of '*', 'H', 'He', 'Li', 'Be', .... the element symbols
- chiral: one of the chiral symbols (which I won't list right now)
- hcount: a non-negative integer
- charge: the atomic charge, written as '+' followed by either a non-negative integer or 0 or more '+'s, or a '-' followed either a non-negative integer or zero or more '-'s.
- bond: one of '=', '#', '/', '\', ':', '~', or '-'.
- open branch: the character '('
- close branch: the character ')'
- ring closure: either a single digit (including 0, but don't use a 0 when generating a SMILES string) or '%' followed by two digits (and note that %09 and 9 represent the same ring closure).
- dot disconnect: the character '.'
What I've done here is break a SMILES string into its smallest parts. In linguistics these parts are called morphemes (or is that lexemes?). In computer science these are called tokens. Here are some SMILES strings and their tokens.
CCO |
element: 'C' element: 'C' element: 'O' |
CC(=O)O |
element: 'C' element: 'C' open branch: '(' bond: '=' element: 'O' close branch: ')' element: 'O' |
[Na+].[Cl-] |
atom: '[' symbol: 'Na' charge: '+' ']' dot disconnect: '.' atom: '[' symbol: 'Cl' charge: '-' ']' |
[235U] |
atom: '[' mass: '235' symbol 'U' ']' |
Tokens are not randomly placed in a SMILES string. There is a pattern to how these tokens are arranged; an atom can follow another atom but a bond cannot follow another bond, and a close branch cannot follow an open branch. With some work it's possible to build a table listing which terms can and cannot follow another.
atom | bond | open branch | close branch | ring closure | dot disconnect | |
start | C | no | no (see below) | no | no | no (see below) |
atom | CC | C=C | C(=O)[O-] | C(=O)[O-] | c1ccccc1 | C.C |
bond | C=C | no | no | no | C=1CCC=1 | no |
open branch | C(C) | C(=C) | no | no | no | C(.C) |
close branch | C(C)C | C(C)=C | C(C)(C)C | C(C(C))C | C(C)1ONON1 | C(C).C |
ring closure | C1CCC1 | C1=CCC1 | C1(CC)CC1 | C1C(CC)1 | C12CC1C2 | c1ccccc1.C |
dot disconnect | C.C | no | no | no | no | no (see below) |
Notes:
- Daylight's old documentation allows (OC)C as a valid SMILES but their implementation did not handle it. The new documentation (changed in late 2003, I think) doesn't mention that case. OpenEye does handle it, except that when I tested it in Sept. 2003 it didn't seem to handle chirality correctly with cases like (C/F)=C/F. I reported it but haven't tested it with a newer version of OEChem. In any case, no program should ever generate a SMILES like that (since Daylight will call it an error) so I recommend not handling it.
- Daylight doesn't accept anything which starts with a dot disconnect (as ".C") nor anything with two disconnects in a row (as "C..C"). OpenEye does accept them under its permissive parser but not its strict one. For now I'll be strict and say that they aren't allowed.
Even this isn't enough to describe all valid SMILES string. For examples, "C)O" is allowed with the given rules, even though it obviously makes no real sense, and "ccC", while also legal, makes no chemical sense because the aromatic carbons must be in a ring. Still, there's a lot that can be done with just simple tokenization. The next step is to parse the token stream.
Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me
Copyright © 2001-2020 Andrew Dalke Scientific AB