[ previous | newer ]     /home/writings/diary/archive/2008/03/09/python4ply_tutorial_2

## python4ply tutorial, part 2

The following is an except from the python4ply tutorial. python4ply is a Python parser for the Python language using PLY and the 'compiler' module from the standard library to parse Python code and generate bytecode for the Python virtual machine.

## Syntax support for decimal numbers

How about something more complicated? Python's "decimal" module is a fixed point numeric type using base 10, which is especially useful for those dealing with money. Here's an obvious limitation of doing base 10 calculations in base 2. I stole it from the decimal documentation.

>>> 1.0 % 0.1
0.09999999999999995
>>> import decimal
>>> d = decimal.Decimal("1.0")
>>> d
Decimal("1.0")
>>> d / decimal.Decimal("0.1")
Decimal("10")
>>>
The normal way to create a decimal number is to "import decimal" then use "decimal.Decimal". I'm going to add grammar-level support so that "0d12.3" is the same as decimal.Decimal("12.3"). There's a few complications so I'll walk you through how to do this.

I need a new DECIMAL token type that matches "0[dD][0-9]+(\.[0-9]+)?". This allows "0d1.23" and "0D1" and "0d0.89" but not "0d.2" nor "0d6." Feel free to change that if you want. Bear in mind possible ambiguities; does "0d1.x" mean the valid "Decimal('1').x" or the syntax error "Decimal('1.') x". What about "0d1..sqrt()"?

Designing a new programming language really means having to pay attention to nuances like this.

The DECIMAL rule is simple, in part because limitations of what can be saved the byte code means the creation of the decimal object must be deferred until later. Just like with the t_BIN_NUMBER rule, this new t_DECIMAL rule must go before t_OCT_NUMBER so there's no confusion.

def t_DECIMAL(t):
r"0[dD][0-9]+(\.[0-9]+)?"
t.value = t.value[2:]
return t

def t_OCT_NUMBER(t):
r"0[0-7]*[lL]?"
t.type = "NUMBER"

If you save this and try it out on the following program

# div.py
print "float", 1.0 % 0.1
print "decimal", 0d1.0 % 0d0.1
you'll see
% python compile.py -e div.py
Traceback (most recent call last):
File "compile.py", line 76, in <module>
execfile(args[0])
File "compile.py", line 43, in execfile
tree = python_yacc.parse(text, source_filename)
File "/Users/dalke/src/python4ply-1.0/python_yacc.py", line 2607, in parse
parse_tree = parser.parse(source, lexer=lexer)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/ply/yacc.py", line 237, in parse
lookahead = get_token()     # Get the next token
File "/Users/dalke/src/python4ply-1.0/python_lex.py", line 657, in token
x = self.token_stream.next()
File "/Users/dalke/src/python4ply-1.0/python_lex.py", line 609, in add_endmarker
for tok in token_stream:
File "/Users/dalke/src/python4ply-1.0/python_lex.py", line 534, in synthesize_indentation_tokens
for token in token_stream:
File "/Users/dalke/src/python4ply-1.0/python_lex.py", line 493, in annotate_indentation_state
for token in token_stream:
File "/Users/dalke/src/python4ply-1.0/python_lex.py", line 435, in create_strings
for tok in token_stream:
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/ply/lex.py", line 305, in token
func.__name__, newtok.type),lexdata[lexpos:])
ply.lex.LexError: /Users/dalke/src/python4ply-1.0/python_lex.py:203: Rule 't_DECIMAL' returned an unknown token type 'DECIMAL'
The list of known token type names is given in the 'token' variable, defined at the top of python_lex.py. I'll add "DECIMAL" to the list
tokens = tuple(python_tokens.tokens) + (
"NEWLINE",

"NUMBER",
"NAME",
"WS",
"DECIMAL",

"STRING_START_TRIPLE",
"STRING_START_SINGLE",
....

With that change I get a new error message. Whoopie for me!

% python compile.py -e div.py
Traceback (most recent call last):
File "compile.py", line 76, in <module>
execfile(args[0])
File "compile.py", line 43, in execfile
tree = python_yacc.parse(text, source_filename)
File "/Users/dalke/src/python4ply-1.0/python_yacc.py", line 2607, in parse
parse_tree = parser.parse(source, lexer=lexer)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/ply/yacc.py", line 346, in parse
tok = self.errorfunc(errtoken)
File "/Users/dalke/src/python4ply-1.0/python_yacc.py", line 2488, in p_error
python_lex.raise_syntax_error("invalid syntax", t)
File "/Users/dalke/src/python4ply-1.0/python_lex.py", line 27, in raise_syntax_error
_raise_error(message, t, SyntaxError)
File "/Users/dalke/src/python4ply-1.0/python_lex.py", line 24, in _raise_error
raise klass(message, (filename, lineno, offset+1, text))
File "div.py", line 3
print "decimal", 0d1.0 % 0d0.1
^
SyntaxError: invalid syntax
That's because the parser doesn't know what to do with a DECIMAL. What do you think it should it do? The ast.Const node only takes a string or a built-in numeric value. It doesn't take general Python objects because those can't be marshalled into bytecode.

I'll wait a moment for you to think about it.

Thought enough? No? Okay, just a moment more.

This new token should correspond to making a new Decimal object at that point. You might think you could be more clever than that and create the decimals during module imports, like I will do for the regular expression definitions coming later on in this tutorial. That would make the object creation occur only once, instead of once for each function call or for every time through a loop. But a decimal object depends on a global/thread-local context, and if I move the decimal creation then I might create it in the wrong context.

To make my life easier, I'm going to import the Decimal class as the super seekret module variable "_\$Decimal". This is a variable name that can't occur in normal Python (because of the "\$") and which is hidden from "... import *" statements (because of the leading "_"). That way the object creation is mostly a matter of calling "_\$Decimal(s)" in the right place, which I can only do by constructing the AST myself.

What will that look like? I'll use the compiler package to show what that AST should look like:

>>> import compiler
>>> compiler.parse("from decimal import Decimal as D")
Module(None, Stmt([From('decimal', [('Decimal', 'D')], 0)]))
>>> compiler.parse("Decimal('12.345')")
Module(None, Stmt([Discard(CallFunc(Name('Decimal'),
[Const('12.345')], None, None))]))
>>>

The new DECIMAL token can go anywhere a NUMBER and NAME can go. That's an "atom" in the Python grammar.

atom: ('(' [yield_expr|testlist_gexp] ')' |
'[' [listmaker] ']' |
'{' [dictmaker] '}' |
'`' testlist1 '`' |
NAME | NUMBER | STRING+)
The last three of these are defined in python_yacc.py as:
def p_atom_9(p):
'atom : NAME'
p[0] = ast.Name(p[1])
locate(p[0], p.lineno(1))#, text_bounds(p, 1))

def p_atom_10(p):
'atom : NUMBER'
value, orig_text = p[1]
p[0] = ast.Const(value)
locate(p[0], p.lineno(1))#, (p.lexpos(1), p.lexpos(1) + len(orig_text)))

def p_atom_11(p):
'atom : atom_plus'
# get the STRING (atom_plus does the string concatenation)
s, lineno, span = p[1]
p[0] = ast.Const(s)
locate(p[0], lineno)#, span)
They are simple because the AST nodes are designed for Python. Nearly every token type and statement type maps directly to an AST node. The "locate" function assigns a line number to each created node, and you can see some of my experimental work also assign a start and end byte location.

Here's the new definition for DECIMAL, which is a bit more complex because I need to call _\$Decimal. Remember that I can't simply use an ast.Const containing a decimal.Decimal because the byte code generation only supports strings and numbers.

def p_atom_12(p):
"atom : DECIMAL"
decimal_string = p[1]
p[0] = ast.CallFunc(ast.Name("_\$Decimal"),
[ast.Const(decimal_string)], None, None)
locate(p[0], p.lineno(1))

At this point running the code should fail because _\$Decimal doesn't exist.

% python compile.py -e div.py
yacc: Warning. Token 'WS' defined, but not used.
yacc: Warning. Token 'STRING_START_SINGLE' defined, but not used.
yacc: Warning. Token 'STRING_START_TRIPLE' defined, but not used.
yacc: Warning. Token 'STRING_CONTINUE' defined, but not used.
yacc: Warning. Token 'STRING_END' defined, but not used.
/Users/dalke/src/python4ply-1.0/python_yacc.py:2473: Warning. Rule 'encoding_decl' defined, but not used.
yacc: Warning. There are 5 unused tokens.
yacc: Warning. There is 1 unused rule.
yacc: Symbol 'encoding_decl' is unreachable.
yacc: Generating LALR parsing table...
float 0.1
decimal
Traceback (most recent call last):
File "compile.py", line 76, in <module>
execfile(args[0])
File "compile.py", line 48, in execfile
exec code in mod.__dict__
File "div.py", line 3, in <module>
print "decimal", 0d1.0 % 0d0.1
NameError: name '_\$Decimal' is not defined

Why are the 'yacc:' messages there? PLY uses a cached parsing table for better performance. When it notices a change in the grammar it invalidates the cache and rebuilds the table based on the new grammar. What you're seeing here are the messages from the rebuild.

Why is the exception there? Because the function call uses _\$Decimal but that name doesn't exist. Why does it report line 3 even through I only assigned a line number to the ast.CallFunc and not the ast.Name, which is what acutally failed? Because the AST generation code in the compiler module doesn't always assign line numbers so the byte code generation step assumes it's the same as the line number for the previously generated instruction.

For extra credit, why does the following report the error on line 3 instead of line 1?

def p_atom_12(p):
"atom : DECIMAL"
decimal_string = p[1]
p[0] = ast.CallFunc(ast.Name("_\$Decimal"),
[ast.Const(decimal_string)], None, None)
locate(p[0], 1)  # Why doesn't this report the error on line 1?

The last bit of magic is to import the Decimal constructor correctly. The root term in the Python grammar is "file_input". (There's another root if you're doing an 'eval'.) One case is for an empty file and the other is for a file that contains statements. The code as distributed looks like this:

def p_file_input_1(p):
"file_input : ENDMARKER"
# Empty file
stmt = ast.Stmt([])
locate(stmt, 1)#, (None, None))
p[0] = ast.Module(None, stmt)
locate(p[0], 1)#, (None, None))

def p_file_input_2(p):
"file_input : file_input_star ENDMARKER"
stmt = ast.Stmt(p[1])
locate(stmt, p[1][0].lineno)#, bounds(p[1][0], p[1][-1]))
docstring, stmt = extract_docstring(stmt)
p[0] = ast.Module(docstring, stmt)
locate(p[0], 1)#, (None, None))
By definition the empty file can't have any Decimal statements in it so I'll only worry about p_file_input_2. But I won't worry much. For instance, for now I won't worry that the file can contain __future__ statements. These must go before any statement other than the doc string. (If you really want to worry about that then feel free to worry. And also worry that in older Pythons "as" and "with" were not reserved words.)

I'll insert the new import statement as the first statement in the created module.

def p_file_input_2(p):
"file_input : file_input_star ENDMARKER"
stmt = ast.Stmt(p[1])
locate(stmt, p[1][0].lineno)#, bounds(p[1][0], p[1][-1]))
docstring, stmt = extract_docstring(stmt)
stmt.nodes.insert(0, ast.From("decimal", [("Decimal", "_\$Decimal")], 0))
p[0] = ast.Module(docstring, stmt)
locate(p[0], 1)#, (None, None))
That's it.

That was it?

Yes, that was it. Want to see it work?

% cat div.py
# div.py
print "float", 1.0 % 0.1
print "decimal", 0d1.0 % 0d0.1
% python compile.py -e div.py
float 0.1
decimal 0.0
%

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me

Copyright © 2001-2013 Andrew Dalke Scientific AB