Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2008/03/09/python4ply_tutorial_1

python4ply tutorial, part 1

The following is an except from the python4ply tutorial. python4ply is a Python parser for the Python language using PLY and the 'compiler' module from the standard library to parse Python code and generate bytecode for the Python virtual machine.

What is it python4ply?

python4ply is a Python parser for the Python language. The grammar definition uses PLY, a parser system for Python modelled on yacc/lex. The parser rules use the "compiler" module from the standard library to build a Python AST and to generate byte code for .pyc file.

You might use python4ply to experiment with variations in the Python language. The PLY-based lexer and parser are much easier to change than the C implementation Python itself uses or even the ones written in Python which are part of the standard library. This tutorial walks through examples of how to make changes in different levels of the system.

If you only want access to Python's normal AST, which includes line numbers and byte position for the code fragements, you should use the _ast module.

Reminiscing, fabrications, and warnings

Back long time ago I had a class assignment to develop a GUI interface using drawpoint and drawtext primitives only. Everything - buttons, text displays, even the mouse pointer itself - was built on those primitives. It gave the strange feeling of knowing that GUIs are completely and utterly fake. There's no there there, and it's only through a lot of effort that it feels real. Those that aren't as old and grizzled as I am might get the same feeling with modern web GUIs. Those fancy sliders and cool UI effects are built on divs and spans and CSS and a lot of hard work. They aren't really there.

This package gives you the same feeling about Python. It contains a Python grammar definition for the PLY parser. The file python_lex.py is the tokenizer, along with some code to synthesize the INDENT, DEDENT and ENDMARKER tags. The file python_yacc.py is the parser. The result is an AST compatible with that from the compiler module, which you can use to generate Python byte code (".pyc" files).

There's also a python_grammer.py file which makes a nearly useless concrete syntax tree. This parser was created by grammar_to_ply.py, which converts the Python "Grammar" definition into a form that PLY can more easily understand. I keep it around to make sure that the rules in python_yacc.py stay correct. You might also find it useful if you want to port the grammar directly to yacc or some similar parser system.

What this means is this package gives you, if you put work into it, the ability to create a Python variant that works on the Python VM, or if you put a lot of work into it (like the Jython, PyPy, and IronPython developers), a first step into making your own Python implementation.

If you think this sounds like a great idea, you're probably wrong. Down this path lies madness. Making a new language isn't just a matter of adding a new feature. The parts go together in subtle ways, and if you tweak the language and someone else tweaks the language a different way, then you quickly stop being able to talk to each other.

Lisp programmers are probably thinking now that this is just a half-formed macro system for Python. They are right. Once you have an AST you can manipulate it in all sorts of ways. But many experienced Lisp programmers will caution against the siren call of macros. Don't make a new language unless you know what dangerous waters you can get into.

On the other hand, it's a lot fun. Someone has to make the new cool langauge for the future so you've got to practice somewhere. And there are a few times when changing things at the AST or code generation levels might make good sense.

Steve Yegge is right when he wrote "When you write a compiler, you lose your innocence."

Getting started

I'll start with the simple thing, to make sure everything works. Create the file "owe_me.py" with the following:

# owe_me.py
amount = 10000000
print "You owe me", amount, "dollars"
To bytecompile it use the provided "compile.py" file. This is similar to "py_compile.py" from the standard library.
% python compile.py owe_me.py
Compiling 'owe_me.py'
% ls -l owe_me.pyc 
-rw-r--r--   1 dalke  staff  165 Feb 17 19:21 owe_me.pyc
%
Running this is a bit tricky because the .pyc file is only used when the file is imported as a module. The easiest way around that is to import the module via a comment-line call.
% python -c 'import owe_me'
You owe me 10000000 dollars
%
(I thought it would be best to use the '-m' option but that seems to import the .py file before the .pyc file. Hmm, I should check into that some more.)

If you want to prove that it's using the .pyc generated by this "compile.py", try renaming the file

% rm owe_me.pyc
% python compile.py owe_me.py
Compiling 'owe_me.py'
% mv owe_me.pyc you_owe_me.pyc
% python -c 'import you_owe_me'
You owe me 10000000 dollars
%
The compile module also supports a '-e' mode, which executes the file after byte compiling it, instead of saving the byte compiled form to a file.
% python compile.py -e owe_me.py
You owe me 10000000 dollars
%

Numbers like 1_000_000 - changing the lexer

Reading "10000000" is tricky, at least for humans. Is that 1 million or 10 million? You might be envious of Perl, which supports using "_" as a separator in a number

% perl
$amount = 10_000_000;
print "You owe me $amount\n";
^D
You owe me 10000000
%

You can change the python4ply grammar to support that. The tokenization pattern for base-10 numbers is in python_lex.py in the function "t_DEC_NUMBER":

def t_DEC_NUMBER(t):
    r'[1-9][0-9]*[lL]?'
    t.type = "NUMBER"
    value = t.value
    if value[-1] in "lL":
        value = value[:-1]
        f = long
    else:
        f = int
    t.value = (f(value, 10), t.value)
    return t

Why do I return the 2-tuple of (integer value, original string) in t.value? The python_yacc.py code contains commented out code where I'm experimenting with keeping track of the start and end character positions for each token and expression. PLY by default only tracks the start position, so I use the string length to get the end position. I'm also theorizing that it will prove useful for those doing round-trip conversions and want to keep the number in its original presentation.

Okay, so change the pattern to allow "_" as a character after the first digit, like this:

    r'[1-9][0-9_]*[lL]?'
then modify the action to remove the underscore character. The new definition is:
def t_DEC_NUMBER(t):
    r"[1-9][0-9]*[lL]?"
    t.type = "NUMBER"
    value = t.value.replace("_", "")
    if value[-1] in "lL":
        value = value[:-1]
        f = long
    else:
        f = int
    t.value = (f(value, 10), t.value)
    return t

To see if it worked I changed owe_me.py to use underscores, and I changed the value to prove that I'm using the new file instead of some copy of the old

# owe_me.py
amount = 20_000_000
print "You owe me", amount, "dollars"
% python compile.py -e owe_me.py
You owe me 20000000 dollars
%

Questions or comments?


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2020 Andrew Dalke Scientific AB