Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2007/10/30/parser_generators_and_smiles

Parser generators and SMILES

If you look at the SMILES parsers for the different chemical informatics libraries you'll see that each is hand-written. Here's the Open Babel code for parsing "C", "Cl", "N", or "O" when using the short-hand notation for an atom in the organic subset:

    if (isupper(*_ptr))
        case 'C':
          if (*_ptr == 'l')
              element = 17;
              symbol[0] = 'C';
              element = 6;

        case 'N':
          element = 7;
          symbol[0] = 'N';
        case 'O':
          element = 8;
          symbol[0] = 'O';

Very tedious, low-level code. The CDK and RDKit libraries have similar code, as do the commercial and proprietary Daylight and OpenEye libraries. (I know this based on knowing the people involved in both projects.)

Why hand-write code like this when parser generators exist? I think there are two reasons. Parser generators are difficult to use. There's a complex theory behind how they work, and even after taking a course in grammar theory it's not always easy to see how to apply that in practice. Most developers in chemical informatics come from a chemistry background, and don't even have that training. I speak from experience - I have a CS (and physics and math) background and took a course on language theory, but despite that I still found it hard. Especially since my first attempts used yacc; and to this day I don't know how to to do error handling and prevent memory leaks in a yacc-generated parser. I loath using yacc.

The second reason the different packages write the parsers by hand is for speed. A SMILES string isn't that complicated, so a hand-written parser should go pretty fast. Perhaps even faster than a machine-generated one. Some of the CS-trained people who come into this field are really into performance, so why use a parser generator when it's "probably going to be slow"?

Using a parser generator means the code is easier to write and maintain. The OpenBabel code contains some read overruns which are *not* serious because they can't trigger any problems, but are nevertheless annoying. Grammar information can't be shared between the different projects. (While each parser generator uses a different syntax, the underlying semantics are pretty consistent so it's relatively easy to change from one format to another.)

On top of that, and in theory, a parser generator can be faster than hand-written code. I did a rough test extracting just the parser from Open Babel (less the data structure building code) and a lexer using ragel. Here are my results:

 using a ragel definition with defaults - 0.585u 0.009s time
 stripped out OB parser -                 0.264u 0.008s
 ragel w/ "Faster flat table-driven FSM"  0.193u 0.010s
 ragel w/ "Really fast goto-driven FSM"   0.094u 0.008s
You see that the hand-written code is more than twice as slow as the best auto-generated lexer code, so long as the right options are enabled.

I'm interested in two things about parser generated SMILES parsers: 1) How fast are they, and can they beat hand-written parsers, and 2) can the parser generators with multiple language support help the OpenSMILES project?


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me

Copyright © 2001-2013 Andrew Dalke Scientific AB