Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2006/08/29/parser_generators_and_decorators

Parser Generators and Decorators

Years ago John Aycock's SPARK introduced the idea of using docstring to store the pattern and rule definitions for a parser generator. Rather, that's where I first heard of it; someone else could have been first.

Here's part of the scanner definition from Biopython for parsing GenBank locations:

    def t_caret(self, input):
        r" \^ "
        self.rv.append(Token("caret"))
    def t_comma(self, input):
        r" \, "
        self.rv.append(Token("comma"))
    def t_integer(self, input):
        r" -?[0-9]+ "
        self.rv.append(Integer(int(input)))
Here's part of the parser
    def p_base_position(self, args):
        """
        base_position ::= integer
        base_position ::= low_base_bound
        base_position ::= high_base_bound
        base_position ::= two_base_bound
        """
        return args[0]

    def p_low_base_bound(self, args):
        """
        low_base_bound ::= greater_than integer
        """
        return LowBound(args[1])

The idea of using the docstring was rather cute but a bit of a hack. That was the easiest way to associate data with a function. Modern Python has decorators and function object attributes are writeable. Has anyone revisted the approach except using a decorator instead of a docstring?

The above two examples would look something like this:

    @exact_token("^")
    def caret(self, input):
        self.rv.append(Token("caret"))

    @exact_token(",")
    def comma(self, input):
        self.rv.append(Token("comma"))

    @token(r"-?[0-9]+")
    def integer(self, input):
        self.rv.append(Integer(int(input)))
where I made things a bit easier for authors by having a decorator for exact string matches and a decorator for re-based tokenization.

the parser code would look like

    @rule("base_position", "integer")
    @rule("base_position", "low_base_bound")
    @rule("base_position", "high_base_bound")
    @rule("base_position", "two_base_bound")
    def base_position(self, args):
        return args[0]

    @rule("low_base_bound", "greater_than integer")
    def low_base_bound(self, args):
        return LowBound(args[1])
or possibly allow a list for the rhs part of the rule, as in
    @rule("base_position", ["integer",
                            "low_base_bound,
                            "high_base_bound",
                            "two_base_bound"])
    def base_position(self, args):
        return args[0]

    @rule("low_base_bound", "greater_than integer")
    def low_base_bound(self, args):
        return LowBound(args[1])

I looked at SPARK for about 2 minutes to figure out if it was doable. It is, but would take a couple hours to do. SPARK is stable, meaning the code hasn't changed in years. John has updated the page recently but only to add a reference to another parsing project. I don't know if he's interested in this and will take diffs.

PLY has a similar structure to SPARK and decorators would work there. Perhaps that's a better choice.

Then again, I want to try out PyParsing, which looks like it uses ideas similar to that in my Martel project, which I took from Greg Ewing's Plex. I recall that ?!ng was also an early advocate of that style but would need to dig through archives for verification. (In other words, I was a late-comer to the game and make no suggestion that PyParsing based anything on Martel.)


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB