Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2006/08/29/parser_generators_and_decorators

Parser Generators and Decorators

Years ago John Aycock's SPARK introduced the idea of using docstring to store the pattern and rule definitions for a parser generator. Rather, that's where I first heard of it; someone else could have been first.

Here's part of the scanner definition from Biopython for parsing GenBank locations:

    def t_caret(self, input):
        r" \^ "
        self.rv.append(Token("caret"))
    def t_comma(self, input):
        r" \, "
        self.rv.append(Token("comma"))
    def t_integer(self, input):
        r" -?[0-9]+ "
        self.rv.append(Integer(int(input)))
Here's part of the parser
    def p_base_position(self, args):
        """
        base_position ::= integer
        base_position ::= low_base_bound
        base_position ::= high_base_bound
        base_position ::= two_base_bound
        """
        return args[0]

    def p_low_base_bound(self, args):
        """
        low_base_bound ::= greater_than integer
        """
        return LowBound(args[1])

The idea of using the docstring was rather cute but a bit of a hack. That was the easiest way to associate data with a function. Modern Python has decorators and function object attributes are writeable. Has anyone revisted the approach except using a decorator instead of a docstring?

The above two examples would look something like this:

    @exact_token("^")
    def caret(self, input):
        self.rv.append(Token("caret"))

    @exact_token(",")
    def comma(self, input):
        self.rv.append(Token("comma"))

    @token(r"-?[0-9]+")
    def integer(self, input):
        self.rv.append(Integer(int(input)))
where I made things a bit easier for authors by having a decorator for exact string matches and a decorator for re-based tokenization.

the parser code would look like

    @rule("base_position", "integer")
    @rule("base_position", "low_base_bound")
    @rule("base_position", "high_base_bound")
    @rule("base_position", "two_base_bound")
    def base_position(self, args):
        return args[0]

    @rule("low_base_bound", "greater_than integer")
    def low_base_bound(self, args):
        return LowBound(args[1])
or possibly allow a list for the rhs part of the rule, as in
    @rule("base_position", ["integer",
                            "low_base_bound,
                            "high_base_bound",
                            "two_base_bound"])
    def base_position(self, args):
        return args[0]

    @rule("low_base_bound", "greater_than integer")
    def low_base_bound(self, args):
        return LowBound(args[1])

I looked at SPARK for about 2 minutes to figure out if it was doable. It is, but would take a couple hours to do. SPARK is stable, meaning the code hasn't changed in years. John has updated the page recently but only to add a reference to another parsing project. I don't know if he's interested in this and will take diffs.

PLY has a similar structure to SPARK and decorators would work there. Perhaps that's a better choice.

Then again, I want to try out PyParsing, which looks like it uses ideas similar to that in my Martel project, which I took from Greg Ewing's Plex. I recall that ?!ng was also an early advocate of that style but would need to dig through archives for verification. (In other words, I was a late-comer to the game and make no suggestion that PyParsing based anything on Martel.)



Copyright © 2001-2008 Dalke Scientific Software, LLC.