I experimented with PyParsing but I couldn't figure out how to use it to parse an indentation-based language like Python. I gave up and tried PLY, which has an API very much like the SPARK library. After using it for a while now I prefer PLY over SPARK. Its error messages are better and its documentation was exactly right for me.
I looked around but could find no examples of how to use a lex/parser pair to parse an indentation-based language. Python uses its own specialized tokenizer and parser designed for Python and it didn't look easy to port to another parsing system. With some work I figured it out.
I ended up writing a filter (or rather three filters) between the Plex tokenizer and its parser. Plex sees all newlines and whitespace but knows to ignore them when inside of (parens) and to return only leading whitespace. My filters watch the tokenizer output stream and tweak a flag so can the tokenizer can filter out non-leading whitespace.
What took the longest time was figuring out that there are three possible indentation states in Python: INDENT not allowed, INDENT may occur, INDENT required. I only had the first and last and without the middle one I couldn't come up with a set of conditions to make it work.
I got the tokenizer mostly working using a trivial language. Python's grammar is a bit more complicated so I decided to implement a subset of Python which captures most of the indentation cases. What I eventually did was use the parser rules to create a Python AST for this new language. Let Python by my back-end. Doing that found several flaws in my logic, which I hope are all now fixed.
I used Python's woefully underdocumented "compiler" module for this. I know it just well enough to use it but not enough to help improve the documentation. There's parts of it which I just do because that's what other code does. (Eg, do I have to do syntax.check(tree)?)
I decided to call the new language GardenSnake. It's a small snake you can play with. Here's the GardenSnake code with tokenizer, filters, parser, code generator and demo all in a single 695 line file.
Here's some bullet points about GardenSnake, from the comments at the top of the file:
- only 'def', 'return' and 'if' statements
- 'if' only has 'then' clause (no elif nor else)
- single-quoted strings only, content in raw format and encoded as "swapcase"
- numbers are decimal.Decimal instances (not integers or floats)
- no print statment; use the built-in 'print' function
- only < > == + - / * implemented (and unary + -)
- assignment and tuple assignment work
- no generators of any sort
- no ... well, no quite a lot
Here's the demo program at the end of the file
print('LET\'S TRY THIS \\OUT') #Comment here def x(a): print('called with',a) if a == 1: return 2 if a*2 > 10: return 999 / 4 # Another comment here return a+2*3 ints = (1, 2, 3, 4, 5) print('mutiline-expression', ints) t = 4+1/3*2+6*(9-5+1) print('predence test; should be 34+2/3:', t, t==(34+2/3)) print('numbers', 1,2,3,4,5) if 1: 8 a=9 print(x(a)) print(x(1)) print(x(2)) print(x(8),'3') print('this is decimal', 1/5) print('BIG DECIMAL', 1.234567891234567e12345)and with the runtime for 'print' support the output is
--> let's try this \out --> MUTILINE-EXPRESSION (Decimal("1"), Decimal("2"), Decimal("3"), Decimal("4"), Decimal("5")) --> PREDENCE TEST; SHOULD BE 34+2/3: 34.66666666666666666666666667 True --> NUMBERS 1 2 3 4 5 --> CALLED WITH 9 --> 249.75 --> CALLED WITH 1 --> 2 --> CALLED WITH 2 --> 8 --> CALLED WITH 8 --> 249.75 3 --> THIS IS DECIMAL 0.2 --> big decimal 1.234567891234567E+12345 Done
Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me
Copyright © 2001-2013 Andrew Dalke Scientific AB