Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2007/11/11/gcc_poster

Poster text for the German Conference on Chemoinformatics

I will be presenting a poster at the 2007 German Computational Chemistry conference. titled Parsing SMILES and SMARTS. When I make a talk or a poster I often write down what I want to talk about. This helps me figure out how I want to structure things, clarify the details, and work on flow. What follows is my preparation text for the GCC poster, in its draft state. There's quite a bit of it, and while I'm publishing to an A0 poster the point is to have enough text and big enough that people can figure out if they want to talk with me about it.

Parsing SMILES and SMARTS

SMILES and SMARTS are two line notations developed by Daylight and implemented in a number of chemical informatics software tools. For the most part these are hand-written parsers and quite complicated and hard to read, modify, maintain, optimize or reuse. A traditional computer science approach would use a parsing system like lex/yacc but that has not made much inroad in computational chemistry. The tools have been difficult to use, especially for error reporting and recovery, and most of the developers have a chemistry background and don't know the language theory underlying this approach.

Modern programming languages and parser systems have made many of the difficulties disappear. I have been working with people from OpenSMILES and several of the existing open source toolkits (Open Babel, CDK and RDKit) to develop valid, useful grammars for SMILES and SMARTS. I have also been evaluating how to implement those grammars using parsing systems like ANTLR, PLY and ragel. My plan is to fold that work back into the different projects so there is a broader and more consistent support for these two important notations. I expect also that resulting code will be faster, more maintainable, and more flexible for trying new ideas. By documenting the different parts I hope the knowledge of how to use parser frameworks is disseminated into the computational chemistry development community and helps to develop the next generation of chemistry toolkits and line notations like MQL.

Existing SMILES parsers

What does a hand-written parser look like?

This comes from the CDK parser for the atoms in the "organic subset", which can be used outside of [square brackets]

mychar = smiles.charAt(position);
logger.debug("");
logger.debug("Processing: " + mychar);
if (lastNode != null) {
        logger.debug("Lastnode: ", lastNode.hashCode());
}
if ((mychar >= 'A' && mychar <= 'Z') || (mychar >= 'a' && mychar <= 'z') ||
                (mychar == '*')) {
    status = 1;
    logger.debug("Found a must-be 'organic subset' element");
    // only 'organic subset' elements allowed
    atom = null;
    if (mychar == '*') {
        currentSymbol = "*";
        atom = builder.newPseudoAtom("*");
    } else {
        currentSymbol = getSymbolForOrganicSubsetElement(smiles, position);
        if (currentSymbol != null) {
            if (currentSymbol.length() == 1) {
                if (!(currentSymbol.toUpperCase()).equals(currentSymbol)) {
                    currentSymbol = currentSymbol.toUpperCase();
                    atom = builder.newAtom(currentSymbol);
                    atom.setHybridization(CDKConstants.HYBRIDIZATION_SP2);
                } else {
                    atom = builder.newAtom(currentSymbol);
                }
            } else {
               atom = builder.newAtom(currentSymbol);
            }
            logger.debug("Made atom: ", atom);
        } else {
            throw new InvalidSmilesException(
        "Found element which is not a 'organic subset' element. You must " +
        "use [" + mychar + "].");
            ....

Here's the corresponding code from RDKit, which is in C++ and uses yacc and lex. The lexer matches the atoms:

B  | C  | N  | O  | P  | S  | F  | Cl | Br | I {
        yysmiles_lval.atom = new Atom( PeriodicTable::getTable()->getAtomicNumber( yytext ) );
        return ORGANIC_ATOM_TOKEN;
}
H  {  return H_TOKEN; }
c   { yysmiles_lval.atom = new Atom ( 6 );
      yysmiles_lval.atom->setIsAromatic(true);
      return AROMATIC_ATOM_TOKEN; 
    }
n   { yysmiles_lval.atom = new Atom( 7 );
      yysmiles_lval.atom->setIsAromatic(true);
      return AROMATIC_ATOM_TOKEN; 
    }
   ... and similar for 'o', 'p', and 's' ...
and the parser has
atomd:  simple_atom
  | ATOM_OPEN_TOKEN charge_element ATOM_CLOSE_TOKEN  {
    $$ = $2;
    $2->setNoImplicit(true);
};

simple_atom:      ORGANIC_ATOM_TOKEN | AROMATIC_ATOM_TOKEN ;

Disadvantages to either approach

Disadvantages to hand-written parsers:

Disadvantages in using parser systems:

Parsers

There are a huge number of parsing tools and different ways to parse. The traditional solution is the yacc/lex combination which is a bottom-up parser that tends to generate parsing code that humans can't read. Error handling with yacc/lex is very hard, especially since they generate output for C/C++, which are languages without garbage collection support.

PLY is a Python tool in the yacc family which I'll use for a few examples here. It has very good error reporting. It uses Python so there's no worries about leaking memory, but Python is not a fast language for general purpose parsing.

ANTLR is a more modern Java tool for parser generation. It generates top-down/recursive descent parsers which are easier to read but traditionally slower than bottom-up parsers. ANTLR is a much more complete system for parsing and includes an IDE and debugger. ANTLR's most notable feature for this project is its support for multiple output languages.

Lexers and parser state

SMILES and SMARTS are context free grammars because they require balanced parenthesis. Excepting that, the language can be represented as a regular expression and parsed using DFA tools along with some extra processing to match parenthesis. Most hand-written parsers use this approach.

SMILES and SMARTS are actually somewhat tricky to parse with the standard CS parser. The token "H" may mean "hydrogen atom" or "implicit hydrogen count" and "-" may mean "charge of -1" or "single bond". As RDKit shows, this can be solved by using parser states: in essence, use a different lexer for different parts of the string. FROWNS goes one step further and replaces the parser with a simple state machine, followed by post-processing.

Because this parsing problem is so heavily lexer dependent, I've been evaluating ragel as a parser generator. It's used in many high-speed parsing projects and produces parsers for C, C++, Objective C, D, Java and Ruby.

(Another possibility is to remove the lexer entirely, pass single character tokens to the parser and let it handle everything. I know of no one who has tried this.)

OpenSMILES

Weininger's 1988 JCICS paper defined most of the SMILES grammer and its interpretation. It contained some ambiguities, especially related to handling aromaticity, which were worked out over time. Various groups extended SMILES to support, for example, atom classes, quadruple bonds, R-groups and an extended aromatic model. Craig James of eMolecules started the OpenSMILES effort, under the aegies of the Blue Obelisk project, to develop an updated SMILES specification. The participants include people from several of the open source chemistry projects and pharmas. For information, see OpenSMILES.org.

A secondary goal is to develope a set of guidelines and recommendations for implementors. In my review of different SMILES parsers I see they handle corner cases and error cases differently. For example, is "%00" a valid ring closure, and does the parser properly identify "%1A" as an error? I plan to include a test set to help the OpenSMILES members identify ambiguous areas and so developers can validate their parsers.

The grammar here is based on the current OpenSMILES specification. It's written for PLY, a Python parser based on yacc, and should be easily to port to other yacc-style LALR(1) or SLR parsers.

tokens = ("ISOTOPE", "SYMBOL", "HCOUNT", "CHARGE", "CHIRAL",
          "END_ATOM", "BEGIN_ATOM", "BOND", "ORGANIC", "OPEN",
          "CLOSE", "DOT", "CLASS", "RNUM")

# Use three lexer states (the default plus two more) to simplify
# tokenization.  Eg, does "-" mean "charge -1" or "single bond"?
states = (
    ('pre', 'exclusive'),  # after parsing a '[', up to the atomic symbol
    ('post', 'exclusive'), # after parsing the atomic symbol, up to the ']'
)

def t_error(t):
    raise SyntaxError(t.value[0])
t_pre_error = t_post_error = t_error

#### The section parses the terms inside of []s

def t_BEGIN_ATOM(t):
    r"\["
    t.lexer.begin("pre")  # switch to a new lexer state
    return t

# The isotope number
t_pre_ISOTOPE = r"\d+"

# All of the atomic symbols
def t_pre_SYMBOL(t):
    (r"C[laroudsemf]?|Os?|N[eaibdpo]?|S[icernbmg]?|P[drmtboau]?|"
     r"H[eofgs]?|c|n|o|p|se?|as|A[lrsgutcm]|B[eraikh]?|D[ybs]|E[urs]|"
     r"F[erm]?|G[aed]|I[nr]?|Kr?|L[iaur]|M[gnodt]|R[buhenafg]|"
     r"T[icebmalh]|U|V|W|Xe|Yb?|Z[nr]|\*")
    t.lexer.begin("post")
    return t

# Implicit hydrogen count
t_post_HCOUNT = r"H\d?"

# Charges
t_post_CHARGE = r"\+([0-9]|\+)?|-([0-9]|-)?"

# Chiral terms
t_post_CHIRAL = (r"@(@|TH[12]|AL[12]|SP[123]|TB(1[0-9]?|20?|[3-9])|"
                 r"OH(1[0-9]?|2[0-9]?|30?|[4-9]))?")

# Class membership
t_post_CLASS = r":\d+"

# End of the ']'; back to the main lexer
def t_post_END_ATOM(t):
    r"\]"
    t.lexer.begin("INITIAL")
    return t

### The lexer terms outside of []

# These are the atoms in the "organic" subset
t_ORGANIC = r"Cl?|Br?|[cnospNOFPSI]"

# open and close branch
t_OPEN = r"\("
t_CLOSE = r"\)"

# Includes the bond type as part of the token
t_RNUM = r"[/\\=#$~-]?(\d|%\d\d)"

# Match the bonds.  Because of the first-match-wins
# matching rule this must come after RNUM.
t_BOND = r"[/\\=#$~-]"

# Dot disconnect
t_DOT = r"\."

# Ignore everything after a whitespace (space or tab)
# This is the common practice, and is specified in the
# original SMILES paper
def t_COMMENT(t):
    r"[ \t](.|\n)*"
    pass

####### The grammar rules

# Allow "" as a valid SMILES string
def p_smiles(p):
    """smiles : chain
              | empty
              """
def p_empty(p):
    'empty :'

def p_branch(p):
    "branch : OPEN chain CLOSE"    # (C)

def p_branch_with_bond(p):
    "branch : OPEN BOND chain CLOSE"  # (=C)

def p_branch_with_dot(p):
    "branch : OPEN DOT chain CLOSE"  # (.C)

# A "chain" is a list of 1 or more "connected_atom"s.
#   The "connection" is a bond or a dot.
def p_chain_start(p):
    "chain : connected_atom"  # C

def p_chain_implicit_bond(p):
    "chain : chain connected_atom"  # CC

def p_chain_explicit_bond(p):
    "chain : chain BOND connected_atom" # C-C

def p_chain_dot_disconnect(p):
    "chain : chain DOT connected_atom"  # C.C

# An "atom" is an atom in [] or an ORGANIC
# A "ringed_atom" is an atom followed by 1 or more ring connections
# A "branched_atom" is an atom or ringed_atom followed by one or more branches
# A "connected_atom" is any of an atom, ringed_atom, or branched_atom
def p_connected_atom(p):
    """connected_atom : branched_atom   # C(C)
                      | ringed_atom     # C1
                      | atom     # C
                      """
def p_branched_atom(p):
    """branched_atom : atom branch   # 'C' '(C)'
                     | ringed_atom branch   # 'C1' '(C)'
                     | branched_atom branch  # 'C(C)' '(C)'
                     """
def p_ringed_atom(p):
    """ringed_atom : atom RNUM  # 'C' '2'
                   | ringed_atom RNUM  # 'C2' '3'
                   """
# The full combinitorics.  (PLY does not support '?' otherwise this would be
# atom: BEGIN_ATOM ISOTOPE? SYMBOL CHIRAL? HCOUNT? CHARGE? CLASS? END_ATOM
def p_atom(p):
    """atom : BEGIN_ATOM         SYMBOL                            END_ATOM
            | BEGIN_ATOM ISOTOPE SYMBOL                            END_ATOM
            | BEGIN_ATOM         SYMBOL CHIRAL                     END_ATOM
            | BEGIN_ATOM ISOTOPE SYMBOL CHIRAL                     END_ATOM
            | BEGIN_ATOM         SYMBOL        HCOUNT              END_ATOM
            | BEGIN_ATOM ISOTOPE SYMBOL        HCOUNT              END_ATOM
            | BEGIN_ATOM         SYMBOL CHIRAL HCOUNT              END_ATOM
            | BEGIN_ATOM ISOTOPE SYMBOL CHIRAL HCOUNT              END_ATOM
            | BEGIN_ATOM         SYMBOL               CHARGE       END_ATOM
            | BEGIN_ATOM ISOTOPE SYMBOL               CHARGE       END_ATOM
            | BEGIN_ATOM         SYMBOL CHIRAL        CHARGE       END_ATOM
            | BEGIN_ATOM ISOTOPE SYMBOL CHIRAL        CHARGE       END_ATOM
            | BEGIN_ATOM         SYMBOL        HCOUNT CHARGE       END_ATOM
            | BEGIN_ATOM ISOTOPE SYMBOL        HCOUNT CHARGE       END_ATOM
            | BEGIN_ATOM         SYMBOL CHIRAL HCOUNT CHARGE       END_ATOM
            | BEGIN_ATOM ISOTOPE SYMBOL CHIRAL HCOUNT CHARGE       END_ATOM
            | BEGIN_ATOM         SYMBOL                      CLASS END_ATOM
            | BEGIN_ATOM ISOTOPE SYMBOL                      CLASS END_ATOM
            | BEGIN_ATOM         SYMBOL CHIRAL               CLASS END_ATOM
            | BEGIN_ATOM ISOTOPE SYMBOL CHIRAL               CLASS END_ATOM
            | BEGIN_ATOM         SYMBOL        HCOUNT        CLASS END_ATOM
            | BEGIN_ATOM ISOTOPE SYMBOL        HCOUNT        CLASS END_ATOM
            | BEGIN_ATOM         SYMBOL CHIRAL HCOUNT        CLASS END_ATOM
            | BEGIN_ATOM ISOTOPE SYMBOL CHIRAL HCOUNT        CLASS END_ATOM
            | BEGIN_ATOM         SYMBOL               CHARGE CLASS END_ATOM
            | BEGIN_ATOM ISOTOPE SYMBOL               CHARGE CLASS END_ATOM
            | BEGIN_ATOM         SYMBOL CHIRAL        CHARGE CLASS END_ATOM
            | BEGIN_ATOM ISOTOPE SYMBOL CHIRAL        CHARGE CLASS END_ATOM
            | BEGIN_ATOM         SYMBOL        HCOUNT CHARGE CLASS END_ATOM
            | BEGIN_ATOM ISOTOPE SYMBOL        HCOUNT CHARGE CLASS END_ATOM
            | BEGIN_ATOM         SYMBOL CHIRAL HCOUNT CHARGE CLASS END_ATOM
            | BEGIN_ATOM ISOTOPE SYMBOL CHIRAL HCOUNT CHARGE CLASS END_ATOM"""

def p_organic_atom(p):
    """atom : ORGANIC"""

def p_error(p):
    raise SyntaxError(p)

Parsing with DFAs

I've also developed a ragel grammar based on the FROWNS approach of using a DFA to transition between different parse states. In the "bond" example below, if in the bond state (after finding a bond token), then test to see if an "atom" or "ring closure" occurs next, which are the only valid possibilities after a bond.

table = {
    "start": ("atom",),

    # CC, C=C, C(C)C, C(C)C, C.C, C1CCC1
    "atom": ("atom", "bond", "close_branch", "open_branch", "dot", "closure"),

    # C=C, C=1CCC=1
    "bond": ("atom", "closure"),

    # C(C)C, C(C)=C, C(C).C, C(C(C))C, C(C)(C)C
    "close_branch": ("atom", "bond", "dot", "close_branch", "open_branch"),

    # C(C), C(=C), C(.C) (really!)
    "open_branch": ("atom", "bond", "dot"),

    # C.C
    "dot": ("atom",),

    # C1CCC1, C1=CCC1, C12CC1C2, C1C(CC)1, C1(CC)CC1, c1ccccc1.[NH4+]
    "closure": ("atom", "bond", "closure", "close_branch", "open_branch", "dot"),
}

This style is much closer to how existing hand-written parsers work. The advantages over the lex/yacc or ANTLR approach is that it's easier to understand and easier to add context specific error reporting. The disadvantage is the difficulty of verifing that the grammar is correct (ie, that it doesn't accept invalid SMILES) and the extra work needed to handle balanced parenthesis.

SMILES parsing performance

I extracted the parser code from OpenBabel as a timing reference. This code only verifies that the syntax of the SMILES is valid. I tested it against 126,002 compounds from the NCI data set (converted by OEChem)

0.10s   Ragel, with "really fast goto-driven FSM"
0.26s   OpenBabel, hand-written parser in C++

46 s FROWNS (regular expressions tokenizer, DFA parser)
90. s PLY grammar LALR(1) parser in Python

110 s OpenBabel, from SMILES to OBMol, including perception

Profiling verifies that 0.4% of the time to convert a SMILES to an OBMol is spent in the parser. 28% is spent in SMARTS matches, used for atom typing and finding the SSSR for aromaticity perception takes 15% of the time. About 25% is just in memory allocation, and another 20% is in data structure overhead.

Low-level parsing of a SMILES string is nowhere near the bottleneck for SMILES parsing, which suggests a way to speed up certain tasks. Consider extracting SMILES from a flat-file based on atomic and bond queries, such as "total charge is positive and contains two or more double bonds". This selection can be done purely on the syntax level, giving an very large speedup.

SMARTS parsers

SMARTS is a more complex language than SMILES. Few toolkits support it, and fewer still support the full grammar.

What's the problem I'm trying to solve?

Two of the top three open source chemical informatics projects do not support the full subset of SMARTS that the underlying graph matching algorithm could handle. (Eg, supporting chirality in the grammar doesn't help if the graph matcher doesn't.) In CDK it's because the new parser hasn't been fully tested. In RDKit it's because of problems implementing the grammar using lex/yacc.

In no toolkit is it easy to add extensions to the grammar. MQL (another line notation for substructure matching) supports matches against user-defined properties of an atom or bond, but experimenting with that idea using most SMARTS parsers is hard.

In OpenBabel the SMARTS matcher used for atom typing takes a large fraction of the runtime. What optimizations could the parser do at the parse tree level?

Making the SMARTS parse tree or graph available as part of the library would help those who develop filters based on SMARTS graphs, outlined earlier.

Implementation approaches

Performance of the SMARTS parser is not an issue so I started with ANTLR3. It supports parse tree rewrite rules, which would be very useful in developing a cross-platform parser with optimizations. Unfortunately ANTLR does not yet handle changing lexer states while parsing. The recommendation from the ANTLR list is to try using ANTLR2.

The SMARTS parser in FROWNS uses the same DFA-based style as its SMILES parser so a ragel implementation should be possible.

Progress and documentation

I write essays on Python, software development, and chemical informatics, available from http://dalkescientific.com/writings/ , which includes progress reports as I develop parsers in different languages.

I believe most chemical informaticians learn programming on the job but don't have the domain-specific training that would make them more effective, because it doesn't exist. Most of my essays are explanatory in nature and designed with that person in mind.


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB