Dalke Scientific Software: More science. Less time. Products
[ previous | next ]     /home/writings/diary/archive/2007/11/02/antlr_java

Once again, in Java

Parts 1, 2, 3, and comments.

I've been writing a series of articles on parsing a molecular formula (like H2SO4) using ANTLR as a parser. See the first and second essays.

There are several reasons to use ANTLR over one of the Python parsers like PLY and PyParsing. The GUI interface is very nice, it has a sophisticated understanding of how to define a grammar, it's top-down approach means the generated code is easier to understand, tree-grammers simplify some otherwise tedious parts of writing parsers, and the one I'll talk about now - it supports multiple output languages.

I chose to generate Python code but ANTLR also supports generating parsers in C, C# and Java. Python is actually not quite as supported as those three languages, and there's some support for yet other languages.

Converting a grammar to a new language is straight-forward: rewrite the action code in the new language. To show you what I mean, here's the molecular weight evaluator in Python, from earlier:

grammar MWGrammar;

options {
	language=Python;
}

calculate_mw returns [float mw]
@init {
  $mw = 0.0
}
	: (species { $mw += $species.species_weight})* EOF
	;

species	returns [float species_weight]
	: atom DIGITS? {
		count = int($DIGITS.text) if $DIGITS else 1
		$species_weight = $atom.weight * count
		}
	;

atom returns [float weight]
	: 'H' { $weight = 1.00794 }
	| 'C' { $weight = 12.001 }
	| 'Cl' { $weight = 35.453 }
	| 'O' { $weight = 15.999 }
	| 'S' { $weight = 32.06 }
	;

DIGITS	: count='0' .. '9'+ ;
and here's it is rewritten for Java:
grammar JMWGrammar;

options {
	language=Java;
}

calculate_mw returns [double mw]
@init {
  $mw = 0.0;
}
	: (species { $mw += $species.species_weight;} )* EOF
	;

species	returns [double species_weight]
@init {
  int count = 0;
}
	: atom DIGITS? {
		if ($DIGITS == null) {
			count = 1;
		} else {
			count = Integer.parseInt($DIGITS.text);
		}
		$species_weight = $atom.weight * count;
		}
	;

atom returns [double weight]
	: 'H' { $weight = 1.00794; }
	| 'C' { $weight = 12.001; }
	| 'Cl' { $weight = 35.453; }
	| 'O' { $weight = 15.999; }
	| 'S' { $weight = 32.06; }
	;

DIGITS	: '0' .. '9'+ ;
I added a bunch of semicolons, changed a few function name lookups, and used Java's 'null' instead of Python's None. I also decided not to use Java's ternary operator and instead have an 'if' statement. Oh, and I changed everything to 'double' instead of 'float', and had to declare a type for the 'count' variable. I suppose I should go back to the Python grammar and change everything to 'double' there, but for the Python code it doesn't actually matter.

It's nice I can do that, but there's almost certainly a price to pay. The generated Python code looks slow. There are a lot of function calls, which are cheap in Java but expensive in Python. What I'll do for my next essay is compare times between the ANTLR generated Python parser and one using PLY.

Comments?


Copyright © 2001-2007 Dalke Scientific Software, LLC.