Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2006/08/28/levels_of_abstraction

Levels of abstraction

I decided to compare Twisted and Allegra for a simple task using asynchronous I/O. As a start for that I wanted to download a web page and display the results to stdout.

Fetch a web page with Twisted

I did some Twisted programming in the 1.x days but haven't done async I/O since then. Here's the Twisted code. The interface I'm using passes the response body into the callback as a single string.

from twisted.internet import reactor
from twisted.web import client

def handleCallback(response):
    print response

def handleErrback(err):
    print "Error:", err

get_page = client.getPage("http://localhost:8081/")
get_page.addCallbacks(handleCallback, handleErrback)

It wasn't that hard to figure out, mostly because once I found the "getPage" method I used Google to find working code by Richard Townsend. It did take a while to find that method. Here's my tale. I went to the API page and looked for "http". Nothing. Going down the list the best fit by name was web but that's labeled "Twisted Web: a Twisted Web Server" and I want client. I checked out "internet" -- nope, that's where the reactors are and interfaces to TCP, threads, the serial port and quite a few non-internet things. I assume this is historical with TCP support first and the others only added later.

Looking under "protocols" I see the following for "http":

This module is DEPRECATED. It has been split off into a third party package, Twisted Web. Please see http://twistedmatrix.com/projects/web.
The URL goes to a bz2 file dated 2005-03-22 with version number 0.5.0. Assuming it's wrong I took the name "web" as a hit to look in the "twisted.web" package which I skipped earlier because it's "a Twisted Web Server." That title's wrong and it does contain client code. Oh, and I looked at "web2 but that's apparently incomplete. The "log" module description says This is still in flux (even moreso than the rest of web2) and quite a few modules, including "client", say "Undocumented".

The documentation irks me with its use of "I". Consider these from twisted.web:
clientHTTP client.
demoI am a simple test resource.
distribDistributed web servers.
errorI am the Twisted.Web error resources and exceptions.
google"I'm Feeling Lucky" with Google.
htmlI hold HTML generation helpers.
serverThis is a web-server which integrates with the twisted.internet
soapSOAP support for twisted.web.
staticI deal with static resources.
sux*S*mall, *U*ncomplicated *X*ML.
tapI am the support module for creating web servers with 'mktap'
trpI contain ResourceUnpickler, which will unpickle any python object
twcgiI hold resource classes and helper classes that deal with CGI scripts.
vhostI am a virtual hosts implementation.
widgetsA twisted web component framework.
I can't recall any other library documented in first person from the view of the code. There might be one but it's rare. I don't like it but suppose it's because of my lack of experience with it. If it's useful then it should be consistent. I see no consistency here. Why isn't "client" documented as "I contain HTTP client functionality" or something like that? The description for "errors" is grammatically incorrect. It looks like someone liked first-person and prefixed "I am the" to the front of "Twisted.Web error resources and exceptions". The latter alone would be grammatical, shorter, easier to read and more consistent with existing practices.

In English there are differences between "I am", "I hold" and "I contain". Is it important here? I don't think so. The descriptions would be no less useful as
tapSupport module for creating web servers with 'mktap'
trpContains ResourceUnpickler, which unpickles any python object
twcgiResource classes and helper classes that deal with CGI scripts.
Or better, IMO, as
tapSupport web server creation with 'mktap'
twcgiCGI script resource and helper classes
I left out "trp" because the description makes no sense (Python objects aren't unpicklable nor are they named files) and the function is otherwise undocumented.

To make my point clear, using first person singular like this in the documentation adds nothing but noise and its inconsistent usage makes it all the more jarring. Luckily, it seems mostly limited to the twisted.web code.

For those keeping track at home:

Fetch a web page with Allegra

I've not used Allegra before. It's author, Laurent Szyster, started with Sam Rushing's old Medusa code, which were incorporated into Python's standard library as asyncore and asynchat. Quoting the author:

Twisted and Allegra are two very different things. Twisted is a large set of complex libraries with support for a vast number of protocols and systems. Allegra is a small set of simple modules that supports only a minimal collection of web protocols and focuses on a single application.

Allegra's core delivers marginal but practical improvements over the original library, in all directions. So, even stripped off its applications, it still fully deserves its own name. Precisely because it is as simple as its predecessor.

There have been various vocal back-and-forths in blogspace between the Allegra developer and some of the Twisted people. The details are easy enough to dig up so I'm not going to bother with additional links.

What got me interested in Allegra is its support for HTTP/1.1 pipelining. In another project we have a search service which returns document hits. I wanted it to return a list of URLs, one per record, and have the client fetch the URLs it needs. Others pointed out that pipelining support isn't common enough for our goals so we decided the default would return all records combined into a single response.

They were right too. Twisted doesn't support HTTP/1.1 pipelining and neither does urllib. According to the comments the twisted.web2 code will support 1.1 but I don't know the schedule nor if that includes pipeline support. In the debate Laurent makes the strong claim that full HTTP/1.1 support in Twisted is hard. I am not competant enough to evaluate those claims. I just want to try out HTTP 1.1

I also want to try out chunked-encoding. In implementing my server (in TurboGears) I found that while I know the number of records to return I don't directly know the total byte size of the response. I'm going to precompute the size of each record but I would like to use return, say, 100 records at a time. That puts an upper limit in memory use no matter the total number of records in the search results, and it means I don't have possible mismatch between the precomputed size and the actual size.

Here's the Allegra code for fetching a page. It uses a freshly checked out version from Subversion.

import sys
from allegra import (http_client, async_loop, finalization)

class CopyToFile(object):
    collector_is_simple = True
    def __init__(self, file=sys.stdout):
        self.file = file

    def collect_incoming_data(self, text):

    def found_terminator(self):
        return True

dispatcher = http_client.connect("", 8081)
req = http_client.GET(dispatcher, "/")

I used my own collector class "CopyToFile" instead of the standard allegra.collector.File because the latter closes the file in found_terminator and I want to continue using sys.stdout after I've received the file.

Allegra's documentation pretty much does not exist. There are some hints in Laurent's blog but as I've not used asyncore I'm missing the basic understanding of how to put things together. There are no examples of using the HTTP client library, not even tests.

I am not a test-driven developer. I've tried to write the tests before writing the code. I almost invariably hate it. My understanding of how the code is supposed to be implemented changes while I write it. I end up spending more time rewriting the tests than I like and I find no benfit to that approach. I wait until the code has started to firm up before putting those tests in.

Perhaps that's the case here as Allegra is very new code. I don't think so as the API looks pretty stablized. As a hint to the author, tests can also make for a good demo of how to use the library. (Though those are more like functional tests than unit tests.)

I still don't know the purpose of the collector object or how to compose collectors.

When I run the above code I get a wait of a few seconds before the program ends. I assume it's waiting for the pipeline timeout but I've not looked in to that. I can't find something which says "I'm done; stop everything and shut down" which might be used by the proverbial "big red button" in a server's administrative interface.

Allegra enables debug messages so I got to see things like

  http-client-pipeline id="72a50"

in the output. Those messages weren't helpful to me. The author says
# The Loginfo interface and implementation provide a simpler, yet more
# powerfull and practical logging facility than the one currently integrated
# with Python.
Why is it that everyone thinks they have a simpler, more powerful and more practical logging system? Then again I don't like logging systems. The ones I've seen are usually configured to dump just about everything ending up in a spew of data which people end up ignoring. Like warnings from lint.

I don't like two things about Allegra's code style. I don't like the space after the function name in the def statement and I don't like the 8 character indentation. Here's an example of both along with the uncommon use of "== None" instead of "is None".

def GET (pipeline, url, headers=None):
        if headers == None:
                headers = {}
        return Request (pipeline, url, headers, 'GET', None)

For those keeping track at home:

The lack of documentation really is that big of a problem. It took me some guessing and experimentation to figure out how things work.

Error handling in Allegra

Here's an example of Allegra logging an exception. In this case I used the wrong port number. (Newlines added to prevent overly wide text.)

  http-client-pipeline id="72a50"
    (61, 'Connection refused')
site-packages/allegra/async_chat.py | handle_read | 152
site-packages/allegra/async_core.py | recv | 160
After a decade of working with Python's normal traceback text I can parse it by eye very quickly. This new format is more terse but harder to read because it's new, and I hope that errors are rare enough that I won't get all that much practice reading a new format string.

Emacs and other IDEs can parse Python's normal traceback message and bring up the correct file to see the error location. They can't do that with this new format, at least not without someone writing new parser code.

Here's the context so you can see what the code looks like.

        def send (self, data):
                "try to send data through a stream socket"
                        result = self.socket.send (data)
                        return result
                except socket.error, why:
                        if why[0] == EWOULDBLOCK:
                                return 0
                        return 0

This code makes sense to me.

Error handling in Twisted

By comparison, here's what the same error produces in Twisted. (I added a couple of newlines to make it easier to see on this page.)

Error: [Failure instance: Traceback (failure with no frames):
twisted.internet.error.ConnectionRefusedError: Connection was refused
by other side: 22: Invalid argument.
The "Error: " prefix was from me; the rest is from Twisted.

That's all it shows me. There are "no frames" so I don't even know which part of the code gave the problem. That's probably why the error object is so richly decorated. Each low-level error maps to a high-level error class which has some information about the error, though not the location of the error.

I noticed something strange about the exception. It reports error code 22 "Invalid argument" and not error code 61 "Connection refused" even through the error class is correct. That's strange.

I tried to track down why through code inspection. About an hour later I gave up. There's too much abstraction going on for my straight and narrow brain. In addition to being decomposable in every which way some of the protocols are restartable, like twisted.internet.tcp.BaseClient which has

        # If I have reached this point without raising or returning, that means
        # that the socket is connected.
        del self.doWrite
        del self.doRead
        # we first stop and then start, to reset any references to the old doRead
Why does a newly created connection need to stop reading and writing? And that's an ugly trick setting instance variables doRead and doWrite to the bound method doConnect at the start, shadowing the class method of the same name. I assume so the connection starts automatically on doRead or doWrite. Why not just use a "I've initialized" flag?

It can't be for the performance. If I read this code correctly then Twisted HTTP requests have a lot of overhead. There are dozens of constructors and method calls. Doing a connectTCP creates a tcp.Connector which is-a abstract.FileDescriptor and _SocketCloser. A FileDescriptor is-a log.Logger and styles.Ephemeral.

Anyway, I gave up and inserted some code in the ConnectionRefusedError class to print the traceback. Here's the traceback (with extra newlines).

  File "async_blast.py", line 55, in ?
  File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/
site-packages/twisted/internet/posixbase.py", line 218, in run
  File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/
site-packages/twisted/internet/posixbase.py", line 229, in mainLoop
  File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/
site-packages/twisted/internet/selectreactor.py", line 133, in doSelect
    _logrun(selectable, _drdw, selectable, method, dict)
  File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/
site-packages/twisted/python/log.py", line 53, in callWithLogger
    return callWithContext({"system": lp}, func, *args, **kw)
  File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/
site-packages/twisted/python/log.py", line 38, in callWithContext
    return context.call({ILogContext: newCtx}, func, *args, **kw)
  File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/
site-packages/twisted/python/context.py", line 59, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/
site-packages/twisted/python/context.py", line 37, in callWithContext
    return func(*args,**kw)
  File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/
site-packages/twisted/internet/selectreactor.py", line 139, in _doReadOrWrite
    why = getattr(selectable, method)()
  File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/
site-packages/twisted/internet/tcp.py", line 535, in doConnect
    self.failIfNotConnected(error.getConnectError((connectResult, os.strerror(connectResult))))
  File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/
site-packages/twisted/internet/error.py", line 160, in getConnectError
    return klass(number, string)
  File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/
site-packages/twisted/internet/error.py", line 105, in __init__
This is stacked too deep for my tastes but I'm getting used to that with TurboGears. The raised error comes from
            connectResult = self.socket.connect_ex(self.realAddress)
        except socket.error, se:
            connectResult = se.args[0]
        if connectResult:
            if connectResult == EISCONN:
            # on Windows EINVAL means sometimes that we should keep trying:
            # http://msdn.microsoft.com/library/default.asp?url=/library/en-us/winsock/winsock/connect_2.asp
            elif ((connectResult in (EWOULDBLOCK, EINPROGRESS, EALREADY)) or
                  (connectResult == EINVAL and platformType == "win32")):
                self.failIfNotConnected(error.getConnectError((connectResult, os.strerror(connectResult))))
When I displayed the value for connectResult I was surprised
connectResult 36 realAddress ('', 8082)
connectResult 22 realAddress ('', 8082)
I got two connectResult attempts, and neither are ECONNREFUSED(61); they are EINPROGRESS(36) and EINVAL(22). Let's see - non-blocking so the EINPROGRESS says to come back a bit later. That makes sense. But why EINVAL? My best guess comes from the possible error conditions mentioned on this man page:
The AF_INET socket is of type SOCK_STREAM, and a previous connect() has already completed unsuccessfully. Only one connection attempt is allowed on a connection-oriented socket.
My intuition suggests next examining if the shutdown/restart of a newly opened Connection ends up doing duplicate connect_ex calls on the file handle. However, at the level of abstraction in Twisted it feels like I'm looking at the world through distantly separated tiny windows and I'm having a hard time figuring out what's going on. I don't want to figure it out and only got this far through stubborness.

At least I was able to figure out why I get a ConnectionRefusedError class even when the errno is EINVAL:

    errnoMapping = {
        errno.ENETUNREACH: NoRouteError,
        errno.ECONNREFUSED: ConnectionRefusedError,
        errno.ETIMEDOUT: TCPTimedOutError,
        # for FreeBSD - might make other unices in certain cases
        # return wrong exception, alas
        errno.EINVAL: ConnectionRefusedError,
I'm on a MacOS X 10.3.9 box, which is a BSD derivative.

Twisted's deferred are based on the except/else model of Python exceptions. That's good for control flow but it doesn't capture the execution stack, which is useful for debugging. Twisted is such a maze of twisty little functions all different that the lack of traceback makes it hard for me to debug or even understand the source of errors. Allegra's shallower stack and lack of framework generality made it much easier for me to see what actually does occur vs. seeing all of the alternatives which are not precluded from occuring.

Allegra and the GPL

I don't like the GPL. My clients include pharmaceutical companies doing drug research. I work for computational chemists. Some develop methods to model how a chemical compound works in people and (hopefully) identify ones which might be good drug leads. These models may take weeks and months to develop.

My clients do not (usually) sell software. They are consumers of software. But they do buy and sell companies. Here's an example based somewhat on group I worked with years ago.

Biotech X developed some interesting new technology so Pharma A decided to buy X. The employees of the interminged company started working together, and some of the people from A started using software developed by people X.

Suppose one of those is a web server which combines a GPL'ed Python library and a set of chemical prediction models. Clause 3 in the GPL says that the prediction models need not be put under the GPL because they are "reasonably considered independent and separate works in themselves." Does the purchase of X by A count as a distribution of that web server under the GPL? Why or why not?

Suppose now Pharma A sells part of what was Biotech X to Pharma B. There are now people at A and B using the software. Assume that the agreement of sale says that both sides can use the chemistry software. Does the inclusion of a GPL'ed Python library in the web server affect things? Will Pharma B's use of the chemistry software be under the GPL?

Some research code circulates for decades. I have used code written in the 1960s. Rarely is the provenance of such code well tracked. I've see academic software include parts of source-available commercial software without attribution. (It was a hand translation from Fortran to C and the variable names and ordering of operations were identical.) There is the potential that a company could, through mergers and aquisition, discover that most of its research software has become GPL'ed and that such a discovery may prevent the company from being bought in the future.

I don't know if the above is a valid legal consequence. I do know that it's something I am obligated to mention to my clients should I ever want to use a GPL licensed library. I have used GPL-based systems but these are through binary executable interfaces or web services where the GPL v2 does not apply. The GPL code and the chemistry code never comingle in the same process space.

I bring this up because Laurent said:

When people start to dismiss a library because of its licence, it's a sure sign that they don't have much else to say about its sources.

Anyway, let's make that licencing issue clear.

There are three ways to go with the GNU Public Licence:

  1. If you want to write free software for a greater good using Allegra sources, the GPL will suite your needs perfectly. That's the GNU way.
  2. If you want to make a buck installing or distributing Allegra's applications, you're free to do so as long as you comply with the GPL. That's the Linux distro way.
  3. If you want to use Allegra sources to write commercial applications to make a profit, buy a commercial licence. That's the MySQL way.

He's right. My clients won't care much about the technical superiority of Allegra for a given task over Twisted. They'll rightly do risk management and probably say 1) it's too expensive to consult our lawyers over this and/or 2) the likely benefits aren't worth the uncertain costs.

#1 is right out. Pharmas are developing drugs "for the greater good" and there's nothing other than faith which says that GPL leads to at least as good results as existing practices. There's nothing the other way either, but it's a wager few wait to take. (As companies pharmas have some nasty practices. The people doing drug development want to cure diseases, understand how organisms work, do good science and get paid well for their hard and honest work.)

#2 is not going to happen. There's not enough software to make a distro worthswhile. With perhaps 10,000 puchasers world-wide that's a sale price of about $300 to keep someone employed at a decent salary for several years. There's only a small body of existing free software upon which to base such a distribution and it's unlikely that such a distribution will have an incremental advantage worth the cost.

#3? This is the most likely. It worked for Sam Rushing, and the Bobo folks (now Zope) finally paid him to donate the code to Python. What's the price? There's no sales link on the Allegra page, no documentation, no estimate of the prices. The latter being just like MySQL - "Call our sales teams and we'll work with you to figure out how much you can pay .. err, figure out the best deal for your unique requirements."

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me

Copyright © 2001-2013 Andrew Dalke Scientific AB