Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2004/01/05/tokens

SMILES tokens

Okay, so you're a hard-core geek and you want to know how write your own chemical informatics toolkit, so that you too will get fun, fame, fortune, and fem----... err, MOTAS. As far as I know, there is no documentation or textbook on this topic. The chemical informatics books I've seen all talk about the science of how the fields work, not the software, and of course almost no computer science book goes into chemistry, other than some mention that graph theory can be applied to the valence bond model of molecules. Let me be your trusty guide on through these uncharted electrons.

I'll start with SMILES since it's such a pretty, well-defined nomenclature. Eventually I'll describe SD and perhaps mol2 files, and a bit of the agony that is the PDB. I'll assume you know what SMILES is. If not, Daylight has a plenty of documentation.

If you have a compound which falls into Daylight's model of chemistry (ie, covalent bonds, in a ground state, etc.) then it can be represented as a SMILES string, or alternatively you can say that it's represented in SMILES. A SMILES string can be broken down into smaller terms; atom, bond, open branch, close branch, ring closure, and dot disconnect. To make the enumeration easier to understand, I'll separate the atom term into "element" and "atom", where the first is an atom in the organic subset and may be written without square brackets.

What I've done here is break a SMILES string into its smallest parts. In linguistics these parts are called morphemes (or is that lexemes?). In computer science these are called tokens. Here are some SMILES strings and their tokens.

CCO element: 'C'
element: 'C'
element: 'O'
CC(=O)O element: 'C'
element: 'C'
open branch: '('
bond: '='
element: 'O'
close branch: ')'
element: 'O'
[Na+].[Cl-] atom: '['
  symbol: 'Na'
  charge: '+'
dot disconnect: '.'
atom: '['
  symbol: 'Cl'
  charge: '-'
[235U] atom: '['
  mass: '235'
  symbol 'U'

Tokens are not randomly placed in a SMILES string. There is a pattern to how these tokens are arranged; an atom can follow another atom but a bond cannot follow another bond, and a close branch cannot follow an open branch. With some work it's possible to build a table listing which terms can and cannot follow another.

  atom bond open branch close branch ring closure dot disconnect
start C no no (see below) no no no (see below)
atom CC C=C C(=O)[O-] C(=O)[O-] c1ccccc1 C.C
bond C=C no no no C=1CCC=1 no
open branch C(C) C(=C) no no no C(.C)
close branch C(C)C C(C)=C C(C)(C)C C(C(C))C C(C)1ONON1 C(C).C
ring closure C1CCC1 C1=CCC1 C1(CC)CC1 C1C(CC)1 C12CC1C2 c1ccccc1.C
dot disconnect C.C no no no no no (see below)


Even this isn't enough to describe all valid SMILES string. For examples, "C)O" is allowed with the given rules, even though it obviously makes no real sense, and "ccC", while also legal, makes no chemical sense because the aromatic carbons must be in a ring. Still, there's a lot that can be done with just simple tokenization. The next step is to parse the token stream.

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me

Copyright © 2001-2020 Andrew Dalke Scientific AB