ANTLR rules
Previously I showed how to use ANTLR to build a AST from a molecular formula then evaluate that AST to calculate the molecular weight. For complex grammars it's often useful to work with and transform parse trees, which I'll probably talk about when I get into developing a SMARTS grammar.
For doing molecular weight calculations though, there's no reason to generate an intermediate AST. I can calculate the weight during the parsing by using action rules. Here's an example of using actions in lexer and parser rules to print something out.
grammar MolecularFormulaWithPrint; options { language=Python; } parse_formula : species* EOF; species : ATOM DIGITS? { print "Species defined", $ATOM.text, # // My first use of Python's new (in 2.5) ternary operator print $DIGITS.text if $DIGITS else "default=1" } ; ATOM : 'H' { print "H = 1.00794" } | 'C' { print "C = 12.001" } // Added 'Cl' to see how that interacts with 'C' | 'Cl' { print "Cl = 35.453" } | 'O' { print "O = 15.999" } | 'S' { print "S = 32.06" } ; // I need a local variable name so the rule can refer to the match DIGITS : count='0' .. '9'+ {print " repeat", $count};I generated the lexer and the grammar as normal:
java -cp /Users/dalke/Downloads/ANTLRWorks.app/Contents/Resources/Java/antlrworks.jar \ org.antlr.Tool MolecularFormulaWithPrint.g
Some notes about this grammar. ANTLR does some parsing of the code inside of an action block so while you can use '#' for a Python comment, it interpreted the apostrophe in "Python's" as the start of a string. To work around that I added the leading '//' so ANTLR really thought it was a comment.
I added "Cl" as a possible atom type (it wasn't in the previous code)
because I wanted to see how the lexer handles terms with a common
prefix. You can see how in the syntax diagram:
and in the generated lexer:
LA1 = self.input.LA(1) if LA1 == u'H': alt1 = 1 elif LA1 == u'C': LA1_2 = self.input.LA(2) if (LA1_2 == u'l') : alt1 = 3 else: alt1 = 2 elif LA1 == u'O': alt1 = 4 elif LA1 == u'S': alt1 = 5 else: nvae = NoViableAltException("16:1: ATOM : ( 'H' | 'C' | 'Cl' | 'O' | 'S' );", 1, 0, self.input)Man! That's going to be some slow code when I get around to doing timings.
I'm also showing off the new ternary operator in Python 2.5. For the record, I'm against it, but because it's present I need to learn when it's appropriate to use, and I think this is one such case.
print $DIGITS.text if $DIGITS else "default=1" }is the same as
if $DIGITS: print $DIGITS.text else: print "default=1"The DIGITS term is optional, and if it's not present then that associated variable in Python is None. What this test does is print the count number if it's present, otherwise prints "default=1", because 1 is the default count if not explicitly given.
Continuing on to using the new grammar, my driver code is pretty simple, because I'm not really doing anything except setup and requesting the parse:
import sys import antlr3 from MolecularFormulaWithPrintParser import MolecularFormulaWithPrintParser from MolecularFormulaWithPrintLexer import MolecularFormulaWithPrintLexer formula = "CH3COOH" if len(sys.argv) > 1: formula = sys.argv[1] char_stream = antlr3.ANTLRStringStream(formula) lexer = MolecularFormulaWithPrintLexer(char_stream) tokens = antlr3.CommonTokenStream(lexer) parser = MolecularFormulaWithPrintParser(tokens) parser.parse_formula()which with the formula "H2SO4" gives.
H = 1.00794 repeat 3 S = 32.06 O = 15.999 repeat 4 Species defined H 3 Species defined S default=1 Species defined O 4You can see that the lexer actions are executed, at least for this case, before the parser actions.
Parser rules can return something
A lexer rule always returns a Token. A parser rule by default returns a Tree but I can have it return something else. In this I want the atom parser to return the molecular weight rather than the atomic symbol. (I don't need to do that. I could use a table lookup on the symbol to get the molecular weight. But the parser already knows which atom it parsed so it feels needless to do that lookup again. As a consequence, the parser loses track of the token location, but there are ways to handle that if needed.)
I need to turn the "ATOM" lexer rule into an "atom" parser rule. In ANTLR, lexer rules are in uppercase and parser rules are lower case, so the conversion is pretty easy in this case - change the case of the name. It works here because the pattern in the rule is a string. In general that doesn't work. For example, I changed DIGITS to a parser rule and got these warning messages:
warning(200): MWGrammar.g:10:9: Decision can match input such as "'C'" using multiple alternatives: 1, 2 As a result, alternative(s) 2 were disabled for that inputI don't know what that means, but I decided not to worry much about it. My general rule will be to keep things in the lexer, because I understand lexers a lot better than grammars.
With the change in place, the grammar is
grammar MWGrammar; options { language=Python; } parse_formula : species* EOF; species : atom DIGITS? { print "Species defined", $atom.weight, print $DIGITS.text if $DIGITS else "default=1" } ; atom returns [float weight] : 'H' { $weight = 1.00794 } | 'C' { $weight = 12.001 } | 'Cl' { $weight = 35.453 } | 'O' { $weight = 15.999 } | 'S' { $weight = 32.06 } ; DIGITS : count='0' .. '9'+ ;I declared that the 'atom' rule sets a 'weight'. The 'float' is needed because ANTLR supports languages like Java and C++ which need to know the data type of the value returned. The 'weight' is how other rules, like 'species', can get the new value, in this case via $atom.weight. In general an ANTLR rule can declare that it returns multiple values.
Using return values from a parser rule
Computing the total molecular weight for a species is very simple. The only difference in the following is the 'species' rule:
grammar MWGrammar; options { language=Python; } parse_formula : species* EOF; species : atom DIGITS? { count = int($DIGITS.text) if $DIGITS else 1 species_weight = $atom.weight * count print "Species weight", species_weight } ; atom returns [float weight] : 'H' { $weight = 1.00794 } | 'C' { $weight = 12.001 } | 'Cl' { $weight = 35.453 } | 'O' { $weight = 15.999 } | 'S' { $weight = 32.06 } ; DIGITS : count='0' .. '9'+ ;
Using an @init action
Next I'll make "species" return a value, a float named "species_weight". But how do I access it inside of parse_formula? The definition is
parse_formula : species* EOF;so how do I get a rule executed once for every time it matches? The answer is very elegant. I can have rules attached to part of the expression like this:
parse_formula : (species { print "species", $species.species_weight})* EOF;will execute the action for each 'species' that matches. That action is included in the "*" so the match and action are done 0 or more times. The new grammar is:
grammar MWGrammar; options { language=Python; } parse_formula : (species { print "species", $species.species_weight})* EOF; species returns [float species_weight] : atom DIGITS? { count = int($DIGITS.text) if $DIGITS else 1 $species_weight = $atom.weight * count } ; atom returns [float weight] : 'H' { $weight = 1.00794 } | 'C' { $weight = 12.001 } | 'Cl' { $weight = 35.453 } | 'O' { $weight = 15.999 } | 'S' { $weight = 32.06 } ; DIGITS : count='0' .. '9'+ ;The last step is to sum each of species weights into a total molecular weight and return that sum. I'm going to rename "parse_formula" into "calculate_mw" and have it return a "mw", so the rule becomes
calculate_mw returns [float mw] : (species { $mw += $species.species_weight})* EOF ;Don't forget to change the driver code! My new driver ends:
... tokens = antlr3.CommonTokenStream(lexer) parser = MWGrammarParser(tokens) print "MW is", parser.calculate_mw()
Okay, does it work? Err, ummm, no.
Traceback (most recent call last): File "compute_mw2.py", line 14, in <module> print "MW is", parser.calculate_mw() File "/Users/dalke/src/dayparsers/MWGrammarParser.py", line 65, in calculate_mw mw += species1 TypeError: unsupported operand type(s) for +=: 'NoneType' and 'float'Taking a look at MWGrammarParser:
def calculate_mw(self, ): mw = None species1 = NoneAhh, the default value of 'mw' is None, and I want it to be 0.0. I want to set the value before any of the other actions run, which I can do with an "@init" action. That's a special directive to ANTLR. There's also '@after' for adding code after all of the rule code. With the @init in place, here's the code
grammar MWGrammar; options { language=Python; } calculate_mw returns [float mw] @init { $mw = 0.0 } : (species { $mw += $species.species_weight})* EOF ; species returns [float species_weight] : atom DIGITS? { count = int($DIGITS.text) if $DIGITS else 1 $species_weight = $atom.weight * count } ; atom returns [float weight] : 'H' { $weight = 1.00794 } | 'C' { $weight = 12.001 } | 'Cl' { $weight = 35.453 } | 'O' { $weight = 15.999 } | 'S' { $weight = 32.06 } ; DIGITS : count='0' .. '9'+ ;and the driver code, which includes some self-tests. (I didn't quite feel like making it work under unittest or py.test or similar code.)
import sys import antlr3 from MWGrammarParser import MWGrammarParser from MWGrammarLexer import MWGrammarLexer formula = "H2SO4" if len(sys.argv) > 1: formula = sys.argv[1] def calculate_mw(formula): char_stream = antlr3.ANTLRStringStream(formula) lexer = MWGrammarLexer(char_stream) tokens = antlr3.CommonTokenStream(lexer) parser = MWGrammarParser(tokens) return parser.calculate_mw() print "MW is", calculate_mw(formula) print "Running self-tests" # Run random tests to validate the parser and results _mw_table = { 'H': 1.00794, 'C': 12.001, 'Cl': 35.453, 'O': 15.999, 'S': 32.06, } # Generate a random molecular formula and calculate # it's molecular weight. yield the weight and formula def _generate_random_formulas(): import random # Using semi-random values so I can check a wide space # Possible number of terms in the formula _possible_lengths = (1, 2, 3, 4, 5, 10, 53, 104) # Possible repeat count for each formula _possible_counts = tuple(range(12)) + (88, 91, 106, 107, 200, 1234) # The available element names _element_names = _mw_table.keys() for i in range(1000): terms = [] total_mw = 0.0 # Use a variety of lengths for j in range(random.choice(_possible_lengths)): symbol = random.choice(_element_names) terms.append(symbol) count = random.choice(_possible_counts) if count == 1 and random.randint(0, 2) == 1: pass else: terms.append(str(count)) total_mw += _mw_table[symbol] * count yield total_mw, "".join(terms) _selected_formulas = [ (0.0, ""), (1.00794, "H"), (1.00794, "H1"), (32.06, "S"), (12.001+1.00794*4, "CH4"), ] for expected_mw, formula in (_selected_formulas + list(_generate_random_formulas())): got_mw = calculate_mw(formula) if expected_mw != got_mw: raise AssertionError("%r expected %r got %r" % (formula, expected_mw, got_mw))
% python calculate_mw.py H2O MW is 18.01488 Running self-testsWhaddya know, it works! Comments?
Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me

Copyright © 2001-2020 Andrew Dalke Scientific AB