Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2007/11/01/antlr_rules

ANTLR rules

Parts 1, 2, 3, and comments.

Previously I showed how to use ANTLR to build a AST from a molecular formula then evaluate that AST to calculate the molecular weight. For complex grammars it's often useful to work with and transform parse trees, which I'll probably talk about when I get into developing a SMARTS grammar.

For doing molecular weight calculations though, there's no reason to generate an intermediate AST. I can calculate the weight during the parsing by using action rules. Here's an example of using actions in lexer and parser rules to print something out.

grammar MolecularFormulaWithPrint;

options {
	language=Python;
}

parse_formula : species* EOF;

species	
	: ATOM DIGITS? {
		print "Species defined", $ATOM.text,
		# // My first use of Python's new (in 2.5) ternary operator
		print $DIGITS.text if $DIGITS else "default=1" }
	;

ATOM
	: 'H' { print "H = 1.00794" }
	| 'C' { print "C = 12.001" }
	// Added 'Cl' to see how that interacts with 'C'
	| 'Cl' { print "Cl = 35.453" }
	| 'O' { print "O = 15.999" }
	| 'S' { print "S = 32.06" }
	;

// I need a local variable name so the rule can refer to the match
DIGITS	: count='0' .. '9'+ {print "  repeat", $count};
I generated the lexer and the grammar as normal:
java -cp /Users/dalke/Downloads/ANTLRWorks.app/Contents/Resources/Java/antlrworks.jar \
  org.antlr.Tool MolecularFormulaWithPrint.g

Some notes about this grammar. ANTLR does some parsing of the code inside of an action block so while you can use '#' for a Python comment, it interpreted the apostrophe in "Python's" as the start of a string. To work around that I added the leading '//' so ANTLR really thought it was a comment.

I added "Cl" as a possible atom type (it wasn't in the previous code) because I wanted to see how the lexer handles terms with a common prefix. You can see how in the syntax diagram:

and in the generated lexer:

            LA1 = self.input.LA(1)
            if LA1 == u'H':
                alt1 = 1
            elif LA1 == u'C':
                LA1_2 = self.input.LA(2)

                if (LA1_2 == u'l') :
                    alt1 = 3
                else:
                    alt1 = 2
            elif LA1 == u'O':
                alt1 = 4
            elif LA1 == u'S':
                alt1 = 5
            else:
                nvae = NoViableAltException("16:1: ATOM : ( 'H' | 'C' | 'Cl' | 'O' | 'S' );", 1, 0, self.input)
Man! That's going to be some slow code when I get around to doing timings.

I'm also showing off the new ternary operator in Python 2.5. For the record, I'm against it, but because it's present I need to learn when it's appropriate to use, and I think this is one such case.

    print $DIGITS.text if $DIGITS else "default=1" }
is the same as
    if $DIGITS:
        print $DIGITS.text
    else:
        print "default=1"
The DIGITS term is optional, and if it's not present then that associated variable in Python is None. What this test does is print the count number if it's present, otherwise prints "default=1", because 1 is the default count if not explicitly given.

Continuing on to using the new grammar, my driver code is pretty simple, because I'm not really doing anything except setup and requesting the parse:

import sys
import antlr3
from MolecularFormulaWithPrintParser import MolecularFormulaWithPrintParser
from MolecularFormulaWithPrintLexer import MolecularFormulaWithPrintLexer

formula = "CH3COOH"
if len(sys.argv) > 1:
    formula = sys.argv[1]
    
char_stream = antlr3.ANTLRStringStream(formula)
lexer = MolecularFormulaWithPrintLexer(char_stream)
tokens = antlr3.CommonTokenStream(lexer)
parser = MolecularFormulaWithPrintParser(tokens)
parser.parse_formula()
which with the formula "H2SO4" gives.
H = 1.00794
  repeat 3
S = 32.06
O = 15.999
  repeat 4
Species defined H 3
Species defined S default=1
Species defined O 4
You can see that the lexer actions are executed, at least for this case, before the parser actions.

Parser rules can return something

A lexer rule always returns a Token. A parser rule by default returns a Tree but I can have it return something else. In this I want the atom parser to return the molecular weight rather than the atomic symbol. (I don't need to do that. I could use a table lookup on the symbol to get the molecular weight. But the parser already knows which atom it parsed so it feels needless to do that lookup again. As a consequence, the parser loses track of the token location, but there are ways to handle that if needed.)

I need to turn the "ATOM" lexer rule into an "atom" parser rule. In ANTLR, lexer rules are in uppercase and parser rules are lower case, so the conversion is pretty easy in this case - change the case of the name. It works here because the pattern in the rule is a string. In general that doesn't work. For example, I changed DIGITS to a parser rule and got these warning messages:

warning(200): MWGrammar.g:10:9: Decision can match input such as "'C'"
using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
I don't know what that means, but I decided not to worry much about it. My general rule will be to keep things in the lexer, because I understand lexers a lot better than grammars.

With the change in place, the grammar is

grammar MWGrammar;

options {
	language=Python;
}

parse_formula : species* EOF;

species	
	: atom DIGITS? {
		print "Species defined", $atom.weight,
		print $DIGITS.text if $DIGITS else "default=1" }
	;

atom returns [float weight]
	: 'H' { $weight = 1.00794 }
	| 'C' { $weight = 12.001 }
	| 'Cl' { $weight = 35.453 }
	| 'O' { $weight = 15.999 }
	| 'S' { $weight = 32.06 }
	;

DIGITS	: count='0' .. '9'+ ;
I declared that the 'atom' rule sets a 'weight'. The 'float' is needed because ANTLR supports languages like Java and C++ which need to know the data type of the value returned. The 'weight' is how other rules, like 'species', can get the new value, in this case via $atom.weight. In general an ANTLR rule can declare that it returns multiple values.

Using return values from a parser rule

Computing the total molecular weight for a species is very simple. The only difference in the following is the 'species' rule:

grammar MWGrammar;

options {
	language=Python;
}

parse_formula : species* EOF;

species	
	: atom DIGITS? {
		count = int($DIGITS.text) if $DIGITS else 1
		species_weight = $atom.weight * count
		print "Species weight", species_weight
		}
	;

atom returns [float weight]
	: 'H' { $weight = 1.00794 }
	| 'C' { $weight = 12.001 }
	| 'Cl' { $weight = 35.453 }
	| 'O' { $weight = 15.999 }
	| 'S' { $weight = 32.06 }
	;

DIGITS	: count='0' .. '9'+ ;

Using an @init action

Next I'll make "species" return a value, a float named "species_weight". But how do I access it inside of parse_formula? The definition is

parse_formula : species* EOF;
so how do I get a rule executed once for every time it matches? The answer is very elegant. I can have rules attached to part of the expression like this:
parse_formula : (species { print "species", $species.species_weight})* EOF;
will execute the action for each 'species' that matches. That action is included in the "*" so the match and action are done 0 or more times. The new grammar is:
grammar MWGrammar;

options {
	language=Python;
}

parse_formula : (species { print "species", $species.species_weight})* EOF;

species	returns [float species_weight]
	: atom DIGITS? {
		count = int($DIGITS.text) if $DIGITS else 1
		$species_weight = $atom.weight * count
		}
	;

atom returns [float weight]
	: 'H' { $weight = 1.00794 }
	| 'C' { $weight = 12.001 }
	| 'Cl' { $weight = 35.453 }
	| 'O' { $weight = 15.999 }
	| 'S' { $weight = 32.06 }
	;

DIGITS	: count='0' .. '9'+ ;
The last step is to sum each of species weights into a total molecular weight and return that sum. I'm going to rename "parse_formula" into "calculate_mw" and have it return a "mw", so the rule becomes
calculate_mw returns [float mw]
	: (species { $mw += $species.species_weight})* EOF
	;
Don't forget to change the driver code! My new driver ends:
 ...
tokens = antlr3.CommonTokenStream(lexer)
parser = MWGrammarParser(tokens)
print "MW is", parser.calculate_mw()

Okay, does it work? Err, ummm, no.

Traceback (most recent call last):
  File "compute_mw2.py", line 14, in <module>
    print "MW is", parser.calculate_mw()
  File "/Users/dalke/src/dayparsers/MWGrammarParser.py", line 65, in calculate_mw
    mw += species1
TypeError: unsupported operand type(s) for +=: 'NoneType' and 'float'
Taking a look at MWGrammarParser:
    def calculate_mw(self, ):

        mw = None

        species1 = None
Ahh, the default value of 'mw' is None, and I want it to be 0.0. I want to set the value before any of the other actions run, which I can do with an "@init" action. That's a special directive to ANTLR. There's also '@after' for adding code after all of the rule code. With the @init in place, here's the code
grammar MWGrammar;

options {
	language=Python;
}

calculate_mw returns [float mw]
@init {
  $mw = 0.0
}
	: (species { $mw += $species.species_weight})* EOF
	;

species	returns [float species_weight]
	: atom DIGITS? {
		count = int($DIGITS.text) if $DIGITS else 1
		$species_weight = $atom.weight * count
		}
	;

atom returns [float weight]
	: 'H' { $weight = 1.00794 }
	| 'C' { $weight = 12.001 }
	| 'Cl' { $weight = 35.453 }
	| 'O' { $weight = 15.999 }
	| 'S' { $weight = 32.06 }
	;

DIGITS	: count='0' .. '9'+ ;
and the driver code, which includes some self-tests. (I didn't quite feel like making it work under unittest or py.test or similar code.)
import sys
import antlr3
from MWGrammarParser import MWGrammarParser
from MWGrammarLexer import MWGrammarLexer

formula = "H2SO4"
if len(sys.argv) > 1:
    formula = sys.argv[1]

def calculate_mw(formula):
    char_stream = antlr3.ANTLRStringStream(formula)
    lexer = MWGrammarLexer(char_stream)
    tokens = antlr3.CommonTokenStream(lexer)
    parser = MWGrammarParser(tokens)
    return parser.calculate_mw()

print "MW is", calculate_mw(formula)

print "Running self-tests"
# Run random tests to validate the parser and results
_mw_table = {
    'H': 1.00794,
    'C': 12.001,
    'Cl': 35.453,
    'O': 15.999,
    'S': 32.06,
}
# Generate a random molecular formula and calculate
# it's molecular weight.  yield the weight and formula
def _generate_random_formulas():
    import random
    # Using semi-random values so I can check a wide space
    # Possible number of terms in the formula
    _possible_lengths = (1, 2, 3, 4, 5, 10, 53, 104)
    # Possible repeat count for each formula
    _possible_counts = tuple(range(12)) +  (88, 91, 106, 107, 200, 1234)
    # The available element names
    _element_names = _mw_table.keys()
    for i in range(1000):
        terms = []
        total_mw = 0.0
        # Use a variety of lengths
        for j in range(random.choice(_possible_lengths)):
            symbol = random.choice(_element_names)
            terms.append(symbol)
            count = random.choice(_possible_counts)
            if count == 1 and random.randint(0, 2) == 1:
                pass
            else:
                terms.append(str(count))

            total_mw += _mw_table[symbol] * count
        yield total_mw, "".join(terms)

_selected_formulas = [
    (0.0, ""),
    (1.00794, "H"),
    (1.00794, "H1"),
    (32.06, "S"),
    (12.001+1.00794*4, "CH4"),
    ]
for expected_mw, formula in (_selected_formulas +
                             list(_generate_random_formulas())):
    got_mw = calculate_mw(formula)
    if expected_mw != got_mw:
        raise AssertionError("%r expected %r got %r" %
                             (formula, expected_mw, got_mw))

% python calculate_mw.py H2O
MW is 18.01488
Running self-tests
Whaddya know, it works!

Comments?

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2020 Andrew Dalke Scientific AB