Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2004/04/12/site_updated

Parsing - oh the humanity.

I got pretty frustrated working on this project for one of my clients. Not with them but with the code I wrote.

They have some commercial analyis code that generates a classification model and produces a SAS script as output. The script is used to make predictions based on the model.

They want to use this model on the server, which doesn't have SAS but does have Python (the server is written in Python :). Looking at it, the generated script produced code which is similar in style to C. There's a lot of gotos, a somewhat strange call syntax which reminds me of GOSUB, and a couple other minor differences, but nothing exotic.

I figured it should be relatively simple to translate into a C extension module for Python. It took me four attempts. But did I learn anything new? Here are the attempts.

Attempt 1: I've done a huge amount of parsing. It's the bread and butter of most of my jobs and contracts. (Sad, but true. One of the reasons I think people want XML is because after a decade of parsing yet another file format it starts to get boring.) Most of the time I use a bunch of regexps, with plenty of sanity checks to make sure the file formats don't stray too much from my expectations. In fact, usually I'm very stringent in the testing. This means my code breaks loudly when the format changes from underneath me, as compared to breaking silently.

This case is different. Most of the data files I read aren't designed for automatic context-free parser generators so trying to get the normal lexer/parser combo working with it is hard. There's lots of state to communicate between them. But in this case I was seduced by the data being in the SAS language, and designed for parsing.

I got a copy of SPARK (I know there are newer projects, but I've used SPARK before so know what to expect), wrote a Scanner and a Parser and after a few hours got the SAS code parsed, and another hour or so adding support so errors are reported as "line 34 position 56" instead of byte position. With only a bit more work I got a code generator to traverse the parse tree and emit C.

I didn't like the C output. The original SAS code contains comments which provide information to the reader about what was going on. For example, the top of the file had comments about the source of the data used to generate the model and the input parameters, and bits of the code itself were also commented. I think the people who wrote the code which generated the SAS output did a good job of making the result look nice.

By comparison, my C code was flat and unreadable. I lost the comments and the spaces between sections. I tried a workaround. It turns out that the comments were only used where a SAS statement could occur so I told the tokenizer to return a "comment" token and added that as a valid statement. I had to do the same with blank lines.

The result was still ugly. Sometimes the comments are on the same line and to the right of the original code. My hack didn't allow that; those comments ended up below the corresponding C code.

Attempt 2: So I tried another tack. I had the scanner capture the line numbers of each token and comment and return the token list and comment list as independent objects. I figured when I generated the C output from the tokens I could also tell it to add the comments which corresponded to that line number. I worked on this for a while, and fixed a BNF rule which was right-recursive instead of left-recursive (or vice versa) which really sped up the parsing.

The resulting code just felt like ooze. Part of the complication was support for code rearrangment. The SAS code used forward references for its "link" statements (those GOSUBS) which I didn't want in the C code because I think code is more graceful without prototypes. It's different for code meant to be used as a library. In this case though everything is in a single file and the only code exposed to the outside world is through Python's initmodule function.

Once I got it working I needed to make it deal with a bit of syntax rewriting. The SAS code uses a test against ".z" to tell if a variable is defined or not. if VAR > .z then ... means "if VAR is defined". My Python code supports the same idea by testing against None, so the equivalent would be if VAR is not None.

C doesn't support the concept of "undefined value". One way to fake it is to have a special bit pattern to indicated undefined, but I didn't want to do that. (Capt'n Crunch taught well the problems of in-band signalling.) I decided to have two variables, "VAR_defined" is 1 when VAR is defined, 0 otherwise, and "VAR_value" is the value of the Python variable when VAR is defined, otherwise a special bit pattern (just to be on the safe side).

Only some of the variables may be undefined. It depends on what parameters were used to make the model. For variables which are allowed to be undefined I want the C/Python interface to set the corresponding values, but for those which must be defined I want to raise a TypeError if the Python value is None. To figure out which variables are which I need to scan the parse tree (uggh, yes, it's just the visitor pattern but still uggh). And to generate the right code I need to special case test the BinaryOp node in case the left is a variable and the right is UNDEF. Or I could do both in one step with some tree rewriting to convert the BinaryOp subtree into a special "IsDefined('VAR')" node. I know there are tree rewrite languages but the only one I've tried was XSLT, which I found nasty, brutish, and long.

Also, some of the variables may be 'continuous' while others are 'categorical', which are handled as Python floats and ints and C doubles and longs. These are documented in the SAS code through structured comments.

After thinking about the amount of code I would need to add to support tree rewriting I decided to chuck it and do ...

Attempt 3: Remember I started by saying that I've done a huge amount of data parsing? I even wrote a parser generator called Martel, which uses a regular expressions to convert a flat-file into SAX 2 events for XML. The idea of Martel is that formats which aren't designed to be parsed by a scanner/parser combination are almost all regular grammers -- highly stateful, in that the same text likely means different things in different parts of the file -- but regular.

What if I used the same approach here? I started setting up a Martel grammer. One for the structured comments in the header, another for the main driver code, and one for the GOSUB units. I also wrote an adapter for the effbot's elementtree to turn those SAX events into a Python data stucture which also supports a subset of XPath.

Ahh, much nicer. The biggest initial problem (as usual) was getting the right regexp definitions. One of Martel's failings is that it doesn't do a good job of error reporting. What I would like is some GUI where I have my Martel definition in one pane, target text in another, and something to show where the parsing failed and why. ActiveState has something like that based on Perl regexps, but they aren't as powerful (for my needs) as Martel; I want to see the parsed XML tree as well.

The next problem was what to do with the code blocks once I got them. While SAS is a context-free language, the code I'm dealing with has at most two levels, caused by an if-statement. A regexp language definition could handle it easily, but the result would still require manipulations of a tree.

The last problem is my clients don't have Martel installed on their system. I would rather keep everything limited to a single file, an executable Python program. Leading us to ...

Attempt 4: This time I treated the whole program as one string. The code is machine generated and very consistent and well-structured so I can assume there won't be variations in the line structure. This means I can use a few regexps (with 'findall') to extract data from the structured comments and to find out which variables are allowed to be undefined.

Even better, it turns out that each line can be uniquely categorized solely by looking at the line and the category test can be a regular expression. And even better, the conversion to C code can almost always be done as an re.sub! So the following works

_conversions = (
    (re.compile(r"\s*$"), None),
    (re.compile(r"\s*/\*.*\*/$"), None),
    (re.compile(r"tnscore = 0\.0;"), None),
    (re.compile(r"TN(\d+):$"), r"=FUNCTION=TN\1="),

    # Leave all the labels
    (re.compile(r"[A-Za-z_][A-Za-z0-9_]*:$"), None),

    # The (\s+) is used to keep the indentation the same.
    # The test against '.z' is a check for undefined values
    (r"(\s+)if (%s) > \.z then (goto %s);" % (_varname, _varname),
      r"\1if (\2_defined) {\3;}"),

def convert_text(s, conversions):
    lines = s.split("\n")
    for lineno in range(len(lines)):
        line = lines[lineno]
        for pat, conv in conversions:
            if pat.match(line):
                if conv is not None:
                    lines[lineno] = pat.sub(conv, line)
            raise TypeError("What do I do with %r?" % (line,))
    return "\n".join(lines)

Granted, it's O(number of lines * number of patterns) but there are under 20 patterns. And it requires the input be in the right format, but I'm guaranteed that that's the case. I do have some sanity tests to catch and report the major errors, like passing in a totally wrong data, and of course I do verify that every line starts as expected.

In the above I place a special marker, '=FUNCTION=TN\1=' in the converted code. This is where the function definition should be, which is done later in the processing. Oh, and notice that my definitions stop at the ';' (end of statement delimiter in SAS) and don't go to the end of the line. The code I'm dealing with has at most one statement per line, so this ignores any comments which might be on the line. Both SAS and C use /* this comment style */ so that was a nice way to keep the comments unchanged. Though were I feeling more stringent I would also require the comments fit in the regexp for the line.

Some 700 more lines later and 430 line of regression tests and I'm done. This compares quite favorably to attempt 2 where my incomplete solution (didn't have the tests for undefined values, nor support for categorical variables, nor tests, nor command-line driver) was already at 930 lines.

Okay, so what's the take-home lesson, and why are you the reader still here after about 1,700 words of essay? I still haven't figured it out myself. Why did I choose the context-free approach? Because the source data is a context-free language and you're supposed to use parser generators to read them. And because SPARK and related tools are cool, so perhaps I was distracted by their power.

I wrote Martel in part because presentation information is important. Most parsers keep only the semantic data, tossing syntax. One of my driving use cases was to mark up a file for HTML, leaving the original format untouched, adding only hyperlinks and color. But in my case I wanted to convert the input and include the syntactic information about where comments and extra newlines are located. And in this SAS conversion case I also want to keep the non-semantic information about the commments and the newlines intact.

Maybe there's some standard way to do this for CF systems I just haven't learned about? It seems if everything (token, comment, newline) had a location, and if the translation system looked at a element in the token stream, got the location, and said "emit text at this location". But no, that doesn't work because of the code rearrangement, unless there was better system support for retargeting locations.

One lesson is that if you're dealing with machine generated code then take advantage of the guarantee that the code won't be that varied. Part of the reason the regexp approach works is because there are fewer than 20 line types.

*sigh* I don't think there is a good lesson here. Just another week in the school of experience.

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me

Copyright © 2001-2013 Andrew Dalke Scientific AB