OpenSMILES

[ previous | next ] /home/writings/diary/archive/2007/11/30/opensmiles

OpenSMILES

(All you all reading this from a Python blog, this is what a detailed discussion in chemical informatics looks like :)

SMILES [in Wikipedia] has a long and venerable history in chemical informatics. It's a line notation, which means it's a way of representing some aspects of a molecular structure as a line of text. For details take a look in my archive and read the essays between "Drawing Molecules" and "Computers -- History of Chemical Nomenclature."

There are three primary documents regarding how to intepret SMILES: the original Weininger paper in JCICS, 1988; the Daylight web site; and the Weininger article in "The Chemoinformatics Handbook", edited by Gastieger. The original paper is somewhat obsolete as it does not include the relatively new field for atom class number. The web site should be the most authoritative of these documents. All are useful in providing insight into how to use SMILES.

What's wrong with the current SMILES definition(s)?

From an implementation viewpoint, the specification lacks some essential detail. The grammar is mostly defined in English rather than using a more formal langauge like BNF. Take for example the definition for an atom inside of square brackets, which is a BNF:

atom     		:	SYMBOL
			|	[ WEIGHT SYMBOL mods ]
			|	[ WEIGHT SYMBOL mods : CLASS ]
			;
mods                    :       mod mods
			;
mod                     :       HCOUNT | CHARGE | CHIRAL
 			;

CLASS = non-negative integer class value.
WEIGHT = atomic weight.
SYMBOL = atomic symbol.
HCOUNT = Atom hydrogen count specification.
CHARGE = Atom charge specification.
CHIRAL = Atom chirality specification.

Nowhere in the BNF nor the English does the specification say that the weight, the mods and the CLASS field are optional. That's only inferrable from the examples, and it would be nice if that was explicit. The grammar implies that the mods can be in arbitrary order, which Daylight accepts, but the grammar also implies that a mod can be present multiple times, which Daylight rejects. As it turns out, all SMILES I've ever seen are given in the order CHIRAL, HCOUNT, CHARGE and I would prefer that that order be preferred, even to the exclusion of allowing other forms. (Excepting perhaps some compatibility mode.)

There's ambiguity in other parts. Is the empty string a valid SMILES? According to the "Reaction SMILES Grammar":

reaction		:	reactant '>' agent '>' product
			|	reactant '>>' product
			;
reactant,
agent,
product			:	molecule
                        |       <null>
			;
molecule		:	SMILES
			;

This implies that SMILES is not null. I think it should be allowed. The Daylight toolkit does not, but PyDaylight wrapper I wrote on top of the Daylight toolkit does. Note: the SMILES file format does not handle empty SMILES. I find that limitation perfectly acceptable.

Allowing the empty string has other implications, especially related to dot disconnects. Is ".C" allowed? What about "C..C"? The Daylight spec specifically says that dot only applies to "adjacent atoms". However, it doesn't say if "C(=C)(.C)" is allowed or not. After all, in that case the "." is not between adjacent atoms. For that matter, the only way to find out what's allowed inside of the parenthesis is by looking at the examples in the spec and inferring. Nothing in spec mentions the "(.C)" case. But the Daylight toolkit as well as others allow it, and the result is the same as "C=C.C".

Roger Sayle some years back did a thorough analysis of how different toolkits interpret SMILES. OpenEye's toolkit is pretty promiscuous and that table shows that it supports strings which I don't consider to be valid SMILES strings, like "C1CC(1)" and "C(C)1CC1". If you try out the toolkit you'll see that even in strict mode it accepts "C-=C" and interprets it as "C=C" while swapping the bonds yields "CC" - last bond wins. I don't like promiscuity in strict mode for the same reason browser authors don't like having to deal with bad HTML meant to hack around bugs in Netscape.

OpenEye doesn't fully define what it means to be a valid SMILES string for them. They point to Daylight's page and list the extensions they support, but as I just showed they allow strings which most would say are not valid SMILES.

Other toolkits also have extensions to SMILES: Open Babel has a radical notation, and RDKit has special atoms for attachement points, though they would rather now use the atom class in the more recent Daylight SMILES definition. Perhaps the most common extension is to allow more "aromatic" element types, such as aromatic silicon.

I'm not saying that a toolkit shouldn't have some provision to do a best-guess at how to fix an invalid SMILES string, just like web browsers do some cleanup of invalid HTML. But if the ambiguities are never resolved then how does anyone know what is "correct" vs. "cleanup"? If Open Babel wants to implement the OpenEye SMILES, should it be compatible with all of OpenEye's cleanup code as well? Down that way lies madness. Someone who is the right sort of mad could develop a syntax-level tool that does to SMILES what HTML Tidy does to HTML; convert a somewhat invalid SMILES into a valid one, but that effort should not be required in everyone who wants to implement SMILES.

OpenSMILES

The OpenSMILES effort started as a way to resolve these ambiguities, with the hopes that the different open source projects, and perhaps closed ones as well, would switch to that definition. This would help make the open source projects more compatible, so that SMILES is interpreted the same across the different platforms. Craig James has taken the lead in writing the spec and organizing the feedback.

My hope too is that such a specification could even deprecate parts of SMILES, mostly support for charges as "[C--]" instead of "[C-2]." The spec should also provide guidance about how to generate SMILES output, most specifically that a single bond between two aromatic atoms must be written as an explicit "-".

I contributed a BNF grammar for my interpretation of SMILES. As far as I know, no one has reviewed it and implemented it themselves. I wouldn't be surprised as BNF grammars aren't part of what most computational chemists learn. That's why I write essays describing how it works. :) Still, before this specification is complete I would like to see one or two other people use the grammar to implement their own parser. Else I don't think it's been tested enough.

I also wrote a lot of commentary on the mailing list.

Others have posted to the list as well, with different viewpoints. Some want OpenSMILES to be the new de facto group in charge of SMILES. I'm happy to let Daylight keep the keep the lead in that. As I've stated, I'm more of a descriptivist than a proscriptivist.

Some want to extend SMILES to support new types of chemistry. There are well-known limitations in SMILES, many of which are due to its assumption of the valence bond model. General consensus though is to develop a specification for SMILES as it's used now.

Aromaticity

The most difficult problems have been with aromaticity. Aromaticity has no real definition. Different subfields have different definitions. Quoting from Weininger's Wiley article, which starts off a bit tounge-in-cheek:

What does aromatic mean anyway? "Aromatic" means "it smells nice." No kidding, that is the only defensible definition. There is no single rigorous definition of aromticity in chemistry. To a synthetic chemist, aromaticty implies something about reactivity, to a thermodynamicist about heat of formation, to a spectroscopist about NMR ring current, to a molecular modelist about geometrical planarity, to a cosmetic chemist it probably means "smells nice." The SMILES definition of aromaticity has nothing to do with the other definitions, except that we would all agree that benzene is "aromatic".

It then goes on to say that SMILES "was specifically designed to be 'canonicalizable'... This implies a fundamental requirement to express the symmetry of a molecule correctly." Sounds like a very nice goal, but how to do it?

To keep from using the highly loaded work "aromatic", I'll call the Daylight solution "Hückelicity". Find the smallest set of smallest rings (SSSR). For each ring check that all atoms are "sp² hybridized and the number of available 'excess' π-electrons ... satisfy Hückels 4N+2 criterion." Continuing to quote from Weininger's Wiley article:

The rules become slightly more complicated for heterocycles. Oxygen and sulfur can share a pair of π electrons. Nitrogen can also share a pair, if three-connected as in [nH]1cccc1 methylpyrrold, otherwise sp² nitrogen shares just one electron (as in n1ccccc1 pyridine). An exocyclic double bond to an eletronegative atom "consumes" one shared π-electron, as in O=c1ncccc1 2-pyridone or O=c1oc2ccccc2cc1 coumarin. But that's is about it. Add up the electrons in rings (and ring systems, such as azulene); if they meet the 4N+2 criterion, it is "aromatic".

This definition might make sense to a chemist, but I want to see a definition which is more algorithm based. Once I've found the SSSR, how do I figure out which are sp²? How do I count the number of excess π electrons in the ring? What are the electonegative atoms? How do I handle "*"? Is c1c*ccc1 considered valid and aromatic, as per the Daylight spec? (OpenEye does not support it.) What about *1*****1?

The goal of Daylight is to make canonical SMILES. They use it for database searching. Canonicalness is built into their toolkit such that there's no way to have a valid ("mod is off") molecule which disagrees with their Hückelicity model. All molecules that come in get reinterpreted to fit this definition. For example, if the input structure is c1ccc1, which is not aromatic by the Hückel definition, then it will get reinterpreted to C1=CC=C1.

OpenEye adds a new perspective into parsing SMILES strings. The have a molecular graph which expresses the valence bond model but they do not enforce a strong chemistry model. "aromatic" is little more than a boolean flag. Instead of doing the full Hückelicity perception> they do the less complicated and faster Kekulé perception. After converting the SMILES string to the molecular graph it verifies that the aromatic bonds have a self-consistent assignment to single or double bond. Multiple solutions may exist; the only criterion is that one exists.

(As an aside, the current implementation only verifies that bond assignment can be done. It doesn't check that each aromatic atom has an aromatic bond.

>>> mol = OEMol()
>>> OEParseSmiles(mol, "c-1-c-c-c-c-c1")
True
>>> OECreateCanSmiString(mol)
'c-1-c-c-c-c-c1'
>>>

This is probably a bug. Or at least another corner case that should be pinned down.)

With this approach, "aromaticity perception" is simple - it's what's specified in the SMILES, verified for self-consistency. Because different people have different ideas of what "aromatic" means, it's also best that a toolkit provide algorithms to reinterpret the aromaticity, and better yet, to provide multiple algorithms, which the default being the preferred one. OpenEye does exactly this in OEAssignAromaticFlags, which implements the Daylight, Tripos, MMFF and MDL definitions.

Open Babel and CDK follow the Daylight approach. They reperceive Hückelicity on all input structures. I've been advocating that reperception is not an essential part of the requirement for parsing SMILES strings, and this essay is part of my argument.

Several of the people have been against this, saying that without full reperception there's no way to do valid chemistry, and that all toolkit should reperceived in order to ensure that all compounds are interpreted the same way. I point out that OpenEye is a pretty big real-world existence proof to counter the first objection, and if doing a SMILES to MDL or Tripos format conversion then the perception should use the algorithms for those formats, and not for Daylight.

One of the participants has an very different objection. He wants a different interpretation of what lower case symbols means in a SMILES string. His very simple and easy to understand definition is "lower the implicit hydrogen count by one." This makes it easier to support radicals but is incompatible with how current SMILES systems work.

For examples: c1ccccc1 is currently the same as [cH]1[cH][cH][cH][cH][cH]1 but is not the same as [CH]1[CH][CH][CH][CH][CH]1. (The latter is not chemically realistic.) It's incompatible with SMARTS that use aromatic terms, except by doing the full reperception. And without guidance on how to generate the output, an input SMILES of [NH2] might easily be turned into the radical form n, which is not supported by any existing SMILES parser.

Without knowledge of how much and what kinds of SMILES will break, I just can't agree with an untested reinterpretation of SMILES.

Followup to Rich Apodaca

Rich Apodaca wrote an article titled "SMILES and Aromaticity: Broken?" It refers to this ongoing discussion in the OpenSMILES list. Here are my comments to that posting.

SMILES is not a relic of the 80s

In the section "A Little About SMILES", Rich correctly mentions the original reason for SMILES, which as I understand it was to simplify structure input for the work that Weininger was doing at the EPA. SMILES is much easier than WLN, could be inputted by non-specialists with only a few hours of training, and did not require a graphical entry system.

A mark of a useful system is if it's easily repurposed to other tasks. SMILES is useful for more than data entry. In addition to being simple to enter, it takes up less space. Storage space on corporate machines in the 1980s was more than "kilobytes". In 1988 I had a 30MB hard drive in my PC. Other chemistry systems used that disk space to store connection table formats, which are easier to parse than SMILES but not as compact. What made Daylight successful was the ability to store structures in-memory using SMILES so there was no need for storage. Everything could be done in-memory. Without disk seeks, searches become 100x or 1000x faster.

That's still an advantage to this day, and nothing to do with machines in the 1980s being "toys". As of 18 November 2007 Pubchem had 11 million compounds. Assuming 50 bytes per SMILES including a unique key means the entire data set can be held in memory in about 0.6 GB. 50 bytes is smaller than storing the parsed data structure, so if you can parse SMILES really fast then it's better to keep the SMILES in memory and rebuild the data structure when needed instead of doing the page swap or disk read needed to bring it in from disk. CPUs are fast, and doing more work can be a valid trade-off over doing a memory fetch, which is about 100 cycles. Doing a disk read is several orders of magnitude slower still. Daylight and OpenEye have really fast parsers, because those developers come from this mindset. Open Babel and CDK do not.

Roger Sayle, who did the comparison of how different parsers interpret SMILES, wrote that table when it worked at Metaphorics, which was a sister company to Daylight. He and Dave worked on a compression algorithm and to Dave's surprise, parsing from compressed SMILES was faster than from uncompressed SMILES, because the bandwidth to the disk was such that the differential savings in disk read more than made up for the extra work needed to decompress the SMILES.

SMILES remains popular amoung computational chemists because it is small. It can be used in a spreadsheet cell, added as a field in an SD file, and easily passed in on the command-line. A connection table format like an SD file cannot. It's easy to parse and generate by hand (which InChI is not) for those rare times when that's needed. It's flexible, so can be used for simple combinitorial library generation and stranger hacks.

BTW, the MDL file formats are also a well-published de facto standard. I have the multipage document describing the format somewhere on my hard drive, which I downloaded from the MDL site for free.

Interpreting `c1ccc1`; a two part scheme

Rich points out that Daylight supports "c1ccc1" as a valid input structure, and converts it to "C1=CC=C1" despite that the input structure does not obey Hückel's 4N+2 rule. He asks how that can be.

My belief is that it's a two part resolution. The first is to convert the structure into a Kekulé form. It doesn't matter which one (and this causes breakage later on). The second is to use the Kekulé structure as the base for aromaticity perception. This interpretation is compatible with the Daylight behaviour, which does both steps. Leaving out the second step gives the OpenEye behaviour. (As I mentioned earlier, OpenEye does not enforce a given aromaticity interpretation.)

Rich quotes Weininger 1988:

Entries of c1ccc1 and c1ccccccc1 will produce the correct antiaromatic structures for cyclobutadiene and cyclooctatetraene, C1=CC=C1 and C1=CC=CC=CC=C1, respectively

then goes on to assume there's some underlying but unclarified concept of "antiaromaticity" in the Daylight toolkit. I think Rich makes the assumption that Dave's comment is truly correct, when I think it's a surface coincidence. Yes, I think Dave was wrong when he wrote that about 20 years ago.

As a counterexample, tag each of the atoms with an isotopic number. [10cH]1[11cH][12cH][13cH]1. Better yet, tag it two different ways, as [10cH]1[11cH][12cH][13cH]1.[11cH]1[12cH][13cH][10cH]1. These structures are identical, right?, and should be displayed the same in any program which understand the original chemistry rather than reinterpreting it. Here's what Daylight's depict does to it:

You can see that the bond between ¹⁰C and ¹³C in the first is a double and in the second is a single. The Daylight toolkit breaks the symmetry in such a way that the intial atom ordering produces different canonical SMILES. This is because the intial SMILES uses a (probably incorrect) aromaticity model which assumes that cyclobutadiene is aromatic. Daylight's model doesn't, but because of underlying symmetries in the molecule you can't tell that there's a problem if you only look at the non-isotopic SMILES in the original paper.

For comparison, here's OpenEye's OEChem on the same topic:

>>> mol = OEMol()
>>> OEParseSmiles(mol, "[10cH]1[11cH][12cH][13cH]1")
True
>>> OECreateIsoSmiString(mol)
'[10cH]1[11cH][12cH][13cH]1'
>>> OEAssignAromaticFlags(mol)
>>> OECreateIsoSmiString(mol)
'[10CH]1=[13CH][12CH]=[11CH]1'
>>> 
>>> mol = OEMol()
>>> OEParseSmiles(mol, "[11cH]1[12cH][13cH][10cH]1")
True
>>> OECreateIsoSmiString(mol)
'[10cH]1[11cH][12cH][13cH]1'
>>> OEAssignAromaticFlags(mol)
>>> OECreateIsoSmiString(mol)
'[10CH]1=[11CH][12CH]=[13CH]1'
>>>

Notice how OpenEye does not by default assert its chemistry model on the system, so that parsing preserves the original aromatic assignment? Assuming the aromaticity was correct lets OEChem generate a correct canonical SMILES for that system. But since the structure does not have a Hückel form, the aromaticity assignment must break the symmetry, which is why the post-assignment canonical SMILES are different. And it's why ultimately the user must be responsible for deciding which aromaticity model to use.

I forced the problem by using isotopes, but here's an example without them:

c12c3c4c1.C2=O.S3.C4O
c14c3c2c1.C2=O.S3.C4O

I wrote this in a strange way (one of the nice hacks you can do with SMILES) to emphasize the similarities between them. The only differences are in the ring closure numbers in the ring term. The result, according to Daylight, is

Daylight breaks the original symmetry. The fact that Dave is correct for cyclobutadiene is mostly a coincidence. Daylight's acceptance of c1ccc1 is almost certainly because they convert the input to Kekulé form before doing aromaticity perception, which is then used for depiction. Therefore there's no contradiction with quinone represented as non-aromatic. By Daylight's rules, it's not.

BTW, if you give Daylight's depict O=c1ccc(=O)cc1 which asserts that quinone is aromatic, it will display the result in non-aromatic form. OpenEye accepts it as well. Then again, both toolkits accept c1cccCC1, which I wouldn't consider as valid, and Daylight depicts it as C1C=CC=CC1.

Followup to some questions of Egon

Egon Willighagen in a blog post pointed out the CDK definition of aromaticity, which he wrote. In the latter he makes the argument that lower case is the same as sp² hybridized and asks some questions:

Why use Hueckle detection if lower case does it already?
Because different systems and different people have different ideas of what it means to be aromatic. Daylight's philosophy is to reintepret all SMILES using their own definition. This is not always appropriate; eg, read SMILES but generate Tripos mol2 files, or read two SMILES each generated from a different chemistry program and verify that they implement the same aromaticity perception algorithm. (Can't do that comparison if you reinterpret all input.)
So lower case would mean 'aromatic OR anti-aromatic'
I think it means "make sure you can generate a Kekulé structure." I haven't figure out what's entailed by that, and it may be the same as "aromatic OR anti-aromatic." However, Daylight and OpenEye allow all of c1ccccCCC#CCCccc1, c1ccccCCC=CCCccc1, c1ccccCCC-CCCccc1 and c1ccccCCC.CCCccc1. So it's perhaps nothing to do with anti-aromatic, and more likely just bond type assignment.
Try 'cc' in Daylights Depict
Is this something you consider a valid SMILES? Or is this Daylight (and OpenEye) accepting an invalid SMILES? Or a SMILES being used to describe something non-physical? OpenEye accepts cc(c)(c)c but Daylight does not. Daylight accepts nn and interprets it as [nH]=[nH], [n][n] as [N]=[N] and [nH2][nH2] as [NH2]-[NH2]. Are the 'n's in each case sp²?

Maybe the definition is to say "they are sp²." I just want a more concrete, algorithm-oriented and written down way to figure out which SMILES strings are valid and how to do bond type assignment based on aromaticity in such a way that the result is closely compatible with OpenEye. Say, less than 99.9% of the compounds differ, up to isomorphism, when doing round-trip processing of entires from PubChem, which uses OEChem under the covers. Some specification that I or others can easily translate to code, and be able to compare the results to each other, without having to make a lot of guesses.

OpenSMILES

What's wrong with the current SMILES definition(s)?

OpenSMILES

Aromaticity

Followup to Rich Apodaca

SMILES is not a relic of the 80s

Interpreting c1ccc1; a two part scheme

Followup to some questions of Egon

Interpreting `c1ccc1`; a two part scheme