Craig James of eMolecules is leading the OpenSMILES effort to more fully describe how to use SMILES strings. More specifically, on coming to a consensus on how to interpret and generate SMILES strings as they are currently used. It will codify existing practices and provide suggestions on best practices, and not do anything egregious to break compatibility with Daylight or OpenEye.
SMILES is a well-known text notation for chemical compounds which are described by the valence bond model. That it, by those chemicals which can be written as a graph with atoms represented as nodes and bonds as edges. There are molecules which cannot be represented this way, but those are details I don't want to get the details now.
The original SMILES paper, authored by Dave Weininger, came out in 1988 and many people have implemented readers and writers for it. The authoritative implementation is in the Daylight toolkit which is a commerical, closed-source product by the company co-founded by Dave. The algorithm has never been fully published, and some of the details on how to read and generate SMILES are ambiguous.
For example, the original paper has some limitations. Some are in the syntax: it doesn't support chirality nor isotopes (added later as "isomeric SMILES") nor the more recent class notation. It also doesn't do a good job of defining how to interpret aromaticity. These are fully described on the SMILES page at Daylight.
One ambiguous part about interpretation is in how to handle aromatics. The Weininger paper uses the SSSR (smallest set of smallest rings) along with Hückel's rule. SSSR was popular in the 1980s, but not now. For an opposing viewpoint read OpenEye's "Smallest Set of Smallest Rings (SSSR) considered Harmful" essay, presently here. (That's an autogenerated URL which will change over time.)
If you look at the Daylight release notes you'll see some of the problems worked out. One was in not using a stable sort for one of the calculations, another was in interpreting non-aromatic ring bonds connecting two aromatic atoms (as in fluorene, specifically the base of the 5-membered ring joining the two 6-membered rings).
Because of the ambiguities in SMILES, the different chemical informatics projects aren't fully interoperable. Some of the open source projects are Open Babel (C++), the Chemistry Development Kit (CDK) (Java), RDKit (C++), as well as FROWNS (Python). (I contributed the name and the SMILES and SMARTS parsers to Brian Kelley's FROWNS project.)
Brian's working for OpenEye now and there's no official support for FROWNS, but otherwise there are people from each of the projects involved in OpenSMILES. From the discussions on the mailing list, I think it's very likely to succeed, and many kudo to Craig for actually going through with this and doing the hard part of consensus building and spec writing.
Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me
Copyright © 2001-2010 Dalke Scientific Software, LLC.