Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2008/03/10/python4ply_tutorial_3

python4ply tutorial, part 3

The following is an except from the python4ply tutorial. python4ply is a Python parser for the Python language using PLY and the 'compiler' module from the standard library to parse Python code and generate bytecode for the Python virtual machine.

Creating regular expression pattern objects

Regular expressions are fun. The first contact I had with them was through DOS globbing, where "*.*" matched all files with an extension. Then I started using Unix, and started using Archie, which supported regular expressions. Hmm, that was in 1990. I read the documentation for regexps but I didn't understand them. Instead I mentally translated the glob "?" to "." and the glob "*" to ".*".

Luckily for me I was in college and I took a theory of automata course. I loved that course. It taught me a lot about how to think about computers as what they are - glorified state machines.

Other programmers also really like regular expressions, and languages like Perl, Ruby, and Javascript consider them so important that are given syntax level support. Python is not one of those languages, and someone coming from Ruby, where you can do

# lines.rb
File.open("python_yacc.py").each do |line|
  if line =~ /def (\w+)/
    puts "#{$1}\n"
  end  
end
will probably find the corresponding Python both tedious and (because of the separation between the pattern definition and use) harder to read:
# lines.py
import re

pattern = re.compile(r"def (\w+)")

for line in open("python_yacc.py"):
    m = pattern.match(line)
    if m is not None:
        print m.group(1)
This code is generally considered the best practice for Python. It could be made a bit shorter by using re.match instead of the precompiled pattern, but at the cost of some performance.

I'll give Perl5 regular expressions (as implemented by the 're' module) first-class syntax support for creating patterns. That will shorten the code by getting rid of the "import re" and the "re.compile()" call. Here's how I want the pattern creation to look like

pattern = m/def (\w+)/
This new syntax is vaguely modelled after Perl's. It must start with a "m/" and end with a "/" on the same line. Note that my new syntax might break existing code because
m=12
a=3
i=2
print m/a/i
is already valid.

The new token definition goes before the t_NAME definition, to prevent the NAME from matching first. This token returns a 2-ple of the regular expression pattern as a string, and the flags to pass to re.compile. I need to pass it back as basic types and not a pattern object because the bytecode generation only understands the basic Python types.

import re
_re_flags = {
    "i": re.IGNORECASE,
    "l": re.LOCALE,
    "m": re.MULTILINE,
    "s": re.DOTALL,
    #"x": re.VERBOSE, # not useful in this context
    "u": re.UNICODE,
}
def t_PATTERN(t):
    r"m/[^/]*/[a-z]*"
    m, pattern, opts = t.value.split("/")
    
    flags = 0
    for c in opts:
        flag = _re_flags.get(c, None)
        if flag is None:
            # I could pin this down to the specific character position
            raise_syntax_error(
                "unsupported pattern modifier %r" % (c,), t)
        flags |= flag
    # test compile to make sure that it's a valid pattern
    try:
        re.compile(pattern, flags)
    except re.error, err:
        # Sadly, re.error doesn't include the error position
        raise_syntax_error(err.message, t)
    t.value = (pattern, flags)
    return t


# This goes after the strings otherwise r"" is seen as the NAME("r")
def t_NAME(t):
    r"[a-zA-Z_][a-zA-Z0-9_]*"
    t.type = RESERVED.get(t.value, "NAME")
    return t

This PATTERN will be a new "atom" at the grammar level, which will correspond to a call to re.compile("pattern", options).

def p_atom_13(p):
    'atom : PATTERN'
    pattern, flags = p[1]
    p[0] = ast.CallFunc(ast.Name("_$re_compile"), [ast.Const(pattern),
                                                   ast.Const(flags)])
    locate(p[0], p.lineno(1))

See how I'm using the impossible variable name '_$re_compile'? That's going to be "re.compile" and I'll use the same trick I did with the DECIMAL support and insert the AST corresponding to

from re import compile as _$compile
at the start of the module definition,
def p_file_input_2(p):
    "file_input : file_input_star ENDMARKER"
    stmt = ast.Stmt(p[1])
    locate(stmt, p[1][0].lineno)#, bounds(p[1][0], p[1][-1]))
    docstring, stmt = extract_docstring(stmt)
    stmt.nodes.insert(0, ast.From("re", [("compile", "_$re_compile")], 0))
    p[0] = ast.Module(docstring, stmt)
    locate(p[0], 1)#, (None, None))

I'll test this with a simple program

# pattern_test.py
data = "name: Andrew Dalke   country:  Kingdom of Sweden "
pattern = m/Name: *(\w.*?) *Country: *(\w.*?) *$/i
m = pattern.match(data)
if m:
    print repr(m.group(1)), "lives in", repr(m.group(2))
else:
    print "unknown"
% python compile.py -e pattern_test.py 
'Andrew Dalke' lives in 'Kingdom of Sweden'
%
and to see that it generates byte code
% python compile.py  pattern_test.py
Compiling 'pattern_test.py'
% rm pattern_test.py
% python -c 'import pattern_test'
'Andrew Dalke' lives in 'Kingdom of Sweden'
%

Adding a match operator

These changes make it easier to define a pattern, but not to use it. As another example of (fake?) Perl envy. I'm going to support its "=~" match syntax so that the following is valid:

# count_atoms.py
import time

# Count the number of atoms in a PDB file
# Lines to match looks like:
# ATOM   6312  CB  ALA 3 235      24.681  54.463 137.827  1.00 51.30
# HETATM 6333  CA  MYR 4   1       6.722  54.417  88.584  1.00 50.79
count = 0
t1 = time.time()
for line in open("nucleosome.pdb"):
  if line =~ m/(ATOM  |HETATM)/:
      count += 1
print count, "atoms in", time.time()-t1, "seconds"

This turned out to be very simple. I need a new token for "=~". Most of the simple tokens are defined in "python_tokens.py". I added "EQUALMATCH" in the list of tokens in the place shown here

 ...
PERCENTEQUAL %=
AMPEREQUAL &=
CIRCUMFLEXEQUAL ^=
EQUALMATCH =~

COLON :
COMMA ,
 ...

Note that this will break legal existing code, like

>>> a=~2
>>> a
-3
>>> 
The lexer doesn't need anything else because I've already defined a PATTERN token.

I need to decide the precedence level of =~. Is it as strong as "**" or as weak as "or", or some place in between? I decided to make it as weak as "or", which is defined by the "test" definition. Here's my new "p_test_4" function:

def p_test_4(p):
    'test : or_test EQUALMATCH PATTERN'
    # pattern.search(or_test)
    sym = gensym("_$re-")
    pattern, flags = p[3]
    p.parser.patterns.append((sym, pattern, flags))
    p[0] = ast.Compare(
        ast.CallFunc(ast.Getattr(ast.Name(sym), 'search'), [p[1]], None, None),
        [("is not", ast.Name("None"))])
    locate(p[0], p.lineno(2))

I got the AST definition by looking at

>>> from compiler import parse
>>> parse("pat.search(line) is not None")
Module(None, Stmt([Discard(Compare(CallFunc(Getattr(Name('pat'), 'search'),
[Name('line')], None, None), [('is not', Name('None'))]))]))
>>> 

And that's it! Well, I could add an optimization in this case and move the ".search" outside the loop, but that's an exercise left for the student.

Now I'll put a toe into evil, just to see how cold it is. I'm going to add support for

# get_function_names.py
for line in open("python_yacc.py"):
    if line =~ m/def (\w+)/:
        print repr($1)
That is, if the =~ matches then $1, $2, ... will match group 1, 2. Oh, and while I'm at it, if there's a named group then $name will retrieve it. And '$' will mean to get the match object itself.

To make it work I need some way to do assignment in the expression. Python doesn't really support that except through a hack I don't want to use, so I'll use another hack and change the bytecode generation stage.

I created a new AST node called "AssignExpr" which is like an "Assign" node except that it can be used in an expression. The compiler module doesn't know about it and it's hard to change the code through subclassing, so I patch the compiler and its bytecode generation code so it understands the new node type. These changes are in "compiler_patches.py" and the patches are done when the module is imported. Take a look at the module if you want to see what it does.

It doesn't escape my notice that with AssignExpr there's only a handful of lines needed for support assignment in an expression, like

if line = readline():
    print repr(line)
Before you do that yourself, read the Python FAQ for why Python doesn't support this.

To support the new pattern match syntax I need to make two changes to python_yacc.py. The first is to import the monkeypatch module:

import compiler_patches
then make the changes to the p_test_4 function to save the match object to the variable "$".
def p_test_4(p):
    'test : or_test EQUALMATCH PATTERN'
    # pattern.search(or_test)
    sym = gensym("_$re-")
    pattern, flags = p[3]
    p.parser.patterns.append((sym, pattern, flags))
    p[0] = ast.Compare(
        ast.AssignExpr([ast.AssName("$", "OP_ASSIGN")],
                       ast.CallFunc(ast.Getattr(ast.Name(sym), 'search'),
                                    [p[1]], None, None)),
        [("is not", ast.Name("None"))])
    locate(p[0], p.lineno(2))

Does it work? Try this program, which is based on the Ruby code I started with at the start of this tutorial section, oh so long ago.

# get_function_names.py
for line in open("python_yacc.py"):
    if line =~ m/def (\w+)/:
        # I don't yet have syntax support to get to the special '$'
        # variable so I have to get it from the globals dictionary.
        print repr(globals()["$"].group(1))
% python compile.py -e get_function_names.py
'gensym'
'raise_syntax_error'
'locate'
'bounds'
'text_bounds'
'extract_docstring'
'__init__'
'__init__'
'add_arg'
'add_star_arg'
'p_file_input_1'
'p_file_input_2'
'p_file_input_star_1'
'p_file_input_star_2'
'p_file_input_star_3'
    ...

Sweet!

With a bit more work (described in detail in the tutorial), I changed the parser to allow this Perl/Python fusion syntax.

# get_function_names.py
for line in open("python_yacc.py"):
    if line =~ m/def (?P<name>\w+) *(?P<args>\(.*\)) *:/:
        print repr($1), repr($args)


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB