Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2006/11/06/iterparse_filter

IterParseFilter

IterParseFilter

XPath-like filtering of ElementTree's iterparse event stream
  1. Background
  2. SAX processing
  3. Using Stackless to convert callbacks into an event stream
  4. ElementTree's "iterparse" hybrid event stream parser
  5. Using XPath selectors
  6. The query syntax
  7. Callback-oriented XPath-filtering parsing
  8. Listening to events
  9. Implementation
  10. The Code!
  11. Bonus - tag structure of an XML document

Background

Here's a nice little XML document.

<document>
 <title>People working for me</title>
 <person xmlns:foaf="http://xmlns.com/foaf/0.1/"
         xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
   <name>Jim</name>
   <title>Porch Light Switcher</title>
   <foaf:homepage rdf:resource="http://example.com/his_page" />
 </person><person>
   <name>Joe</name>
   <title>Bottle Washer, 3rd class</title>
   <nick xmlns="http://xmlns.com/foaf/0.1/">Joe-Joe</nick>
 </person>
</document>
I'll make it be a StringIO object so I can handle it as an in-memory file:
from cStringIO import StringIO

f = StringIO('''<document>
 <title>People working for me</title>
 <person xmlns:foaf="http://xmlns.com/foaf/0.1/"
         xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
   <name>Jim</name>
   <title>Porch Light Switcher</title>
   <foaf:homepage rdf:resource="http://example.com/his_page" />
 </person><person>
   <name>Joe</name>
   <title>Bottle Washer, 3rd class</title>
   <nick xmlns="http://xmlns.com/foaf/0.1/">Joe-Joe</nick>
 </person>
</document>
''')

I want to get the titles from the document. There are two types of titles; the title for the document and the job title. It would be nice if the two elements had different names but such is life. The easiest standard way to do this in Python 2.5 is with the built-in ElementTree module:

>>> import xml.etree.ElementTree 
>>> from xml.etree import ElementTree as ET
>>> tree = ET.parse(f)
>>> tree.findall("person/title")
[<Element title at 1094b20>, <Element title at 1094c10>]
>>> for title in tree.findall("person/title"):
...   print repr(title.text)
... 
'Porch Light Switcher'
'Bottle Washer, 3rd class'
>>> 

This is fine for small data. What if I want to parse a huge document? In this case, suppose I had 100,000 people working for me, each with a set of standardized fields of 10K each. That's not realistic. A better example is where I had to parse bibliographic records from a large (140MB) PubMed file. I expect most people reading this have needed to parse large data-oriented XML files. (By "data-oriented" I mean files containing many records with very similar internal structures.)

SAX processing

Traditionally there are two ways to process XML: SAX-style, with callbacks for each event, and DOM-style, building a single data structure representing the entire XML file. These are not the only ways and I'll discuss some hybrid approaches in a bit. ElementTree is DOM-style and reads everything into memory. If there's a large file then the traditional answer is to be SAX oriented. Here's the above code rewritten for SAX.

>>> from xml.sax import handler  
>>> class ShowPeopleTitle(handler.ContentHandler):
...   def startDocument(self):
...     self.where = []
...     self.in_person_title = False
...   def startElement(self, tag, attrs):
...     self.where.append(tag)
...     if self.where[-2:] == ["person", "title"]:
...       self.in_person_title = True
...       self.text = ""  
...   def characters(self, s):
...       if self.in_person_title:
...         self.text = self.text + s
...   def endElement(self, tag):       
...     if self.where[-2:] == ["person", "title"]:
...       print repr(self.text)
...       del self.text
...       self.in_person_title = False
...     del self.where[-1]
... 
>>> import xml.sax
>>> show = ShowPeopleTitle()
>>> f.seek(0)
>>> xml.sax.parse(f, show)
u'Porch Light Switcher'
u'Bottle Washer, 3rd class'
>>> 
A scary thing is I wrote the ShowPeopleTitle without rewrite. I got it right on the first attempt. I have a parser-generator named Martel which generated SAX events. I spent a lot of time writing callback handlers for it.

Most people don't like callbacks. They are harder to think about and it feels like your software is no longer in control of things. I think that's one reason why people feel uncomfortable with Twisted. There are some use problems with how SAX does callbacks. You can have only one handler attached to the processor, which makes it tricker to have specialized for different parts of the document. (You have to implement an intermediate dispatcher.) If you have multiple handlers then each has to listen to and ignore essentially all of the events, causing a rather large performance hit.

Using Stackless to convert callbacks into an event stream

SAX is a special case of "event programming". Each XML construct generates an event. "<person>" generates a "startElement" event with arguments ("person", {}), where the {} is the dictionary of element attributes, the close tag "</person>" generates an "endElement" event with arguments ("person",), etc.

These events can be stream oriented instead. By "stream" I mean that it works like a Python iterator; you can do a for-loop over it. There is a trivial way to convert any callback-based system into an event stream. In standard Python you have to use threads, which are a bit clumsy and don't scale well if you have more than a hundred or so. Instead, I'll show what it's like with Stackless Python, which add some very cool features to Python. The general concepts are the same in this case between a thread solution and Stackless'.

#!/usr/local/bin/spython
# This uses Stackless Python
import stackless
import sys

import xml.sax
from xml.sax import handler
from cStringIO import StringIO

# Parse using the normal SAX parser.
# Catch any errors and forward those to the event stream
def parse(f, handler, event_channel):
    try:
        xml.sax.parse(f, handler)
    except:
        event_channel.send( ('error', sys.exc_info()) )

# Forward parse events through the event channel
class StacklessHandler(handler.ContentHandler):
    def __init__(self, event_channel):
        self.event_channel = event_channel
    def startDocument(self):
        self.event_channel.send( ('startDocument', None) )
    def endDocument(self):
        self.event_channel.send( ('endDocument', None) )
    def startElement(self, tag, attrs):
        self.event_channel.send( ('start', (tag, attrs)) )
    def endElement(self, tag):
        self.event_channel.send( ('end', tag) )
    def characters(self, text):
        self.event_channel.send( ('characters', text) )
    

def itersax(f):
    # Create the handler along with a channel used to return events
    event_channel = stackless.channel()
    stackless_handler = StacklessHandler(event_channel)
    
    # Start the parsing in another tasklet
    stackless.tasklet(parse)(f, stackless_handler, event_channel)
    while 1:
        x = event_channel.receive()
        if x[0] == "error":
            raise x[1][0], x[1][1], x[1][2]
        yield x
        if x[0] == "endDocument":
            break


f = StringIO('''<document>
 <title>People working for me</title>
 <person xmlns:foaf="http://xmlns.com/foaf/0.1/"
         xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
   <name>Jim</name>
   <title>Porch Light Switcher</title>
   <foaf:homepage rdf:resource="http://example.com/his_page" />
 </person><person>
   <name>Joe</name>
   <title>Bottle Washer, 3rd class</title>
   <nick xmlns="http://xmlns.com/foaf/0.1/">Joe-Joe</nick>
 </person>
</document>
''')

def main():
    where = []
    in_person_title = False

    # loop over the SAX event stream
    for (event, args) in itersax(f):
        if event == "start":
            where.append(args[0])
            if where[-2:] == ["person", "title"]:
                in_person_title = True
                text = ""
        elif event == "characters":
            if in_person_title:
                text = text + args
        elif event == "end":
            if where[-2:] == ["person", "title"]:
                print repr(text)
                text = None
                in_person_title = False
            del where[-1]

if __name__ == "__main__":
    stackless.schedule(main)()
    stackless.run()
As you can see, the handler code is essentially the same. It still acts on events but in this case the user code controls the main loop, or appears to do so. There's no need to store state through class variables because you can use local variables. Incidentally, local variables are much faster to access in Python than instance variables.

ElementTree's "iterparse" hybrid event stream parser

There's are hybrid approaches, and ElementTree includes a nice one. It uses a feed-oriented SAX parser to create the ElementTree. While it's creating the tree it can pass parts of the tree back to the caller, as an event stream.

This is more easily seen than explained. In the following I've asked iterparse to generate "start" and "end" elements. The default only generates "end" elements:

>>> from xml.etree import ElementTree as ET
>>> f.seek(0)
>>> for (event, ele) in ET.iterparse(f, ("start", "end")):
...   print repr(event), repr(ele.tag), repr(ele.text)
... 
'start' 'document' '\n '
'start' 'title' 'People working for me'
'end' 'title' 'People working for me'
'start' 'person' '\n   '
'start' 'name' 'Jim'
'end' 'name' 'Jim'
'start' 'title' 'Porch Light Switcher'
'end' 'title' 'Porch Light Switcher'
'start' '{http://xmlns.com/foaf/0.1/}homepage' None
'end' '{http://xmlns.com/foaf/0.1/}homepage' None
'end' 'person' '\n   '
'start' 'person' '\n   '
'start' 'name' 'Joe'
'end' 'name' 'Joe'
'start' 'title' 'Bottle Washer, 3rd class'
'end' 'title' 'Bottle Washer, 3rd class'
'start' '{http://xmlns.com/foaf/0.1/}nick' 'Joe-Joe'
'end' '{http://xmlns.com/foaf/0.1/}nick' 'Joe-Joe'
'end' 'person' '\n   '
'end' 'document' '\n '
>>> 
It's roughly the same as my Stackless-based parser, except that it returns ElementTree nodes instead of SAX event data. I was surprised to see that the 'start' element already contains the text after the start tag. This means it's already process the following character events up to the next element's start.

Here's how to get the text for the each person's title using iterparse

>>> f.seek(0)
>>> where = []
>>> for (event, ele) in ET.iterparse(f, ("start", "end")):
...   if event == "start":
...     where.append(ele.tag)
...   elif event == "end":
...     if where[-2:] == ["person", "title"]:
...       print repr(ele.text)
...     del where[-1]
... 
'Porch Light Switcher'
'Bottle Washer, 3rd class'
>>> 
This code is simpler than that previous cases because ElementTree is iterating over the tree while the tree is being built. The 'end' element will contain all the information about the tree underneath it. For example,
>>> f.seek(0)
>>> for (event, ele) in ET.iterparse(f):
...   # only asked for 'end' events, and there is only one type of 'person' tag
...   if ele.tag == 'person':
...     print repr(ele.find("name").text), "is a", repr(ele.find("title").text)
...     ele.clear()
... 
'Jim' is a 'Porch Light Switcher'
'Joe' is a 'Bottle Washer, 3rd class'
>>> 
See the ele.clear() in this example? That's the ElementTree method to remove all children from the element; to prune the tree while processing it. It's very handy when parsing record-oriented documents because in most cases once you've processed the record you don't need it any more and likely want to discard it so it doesn't take up memory.

By the way, the event-based methods (both stream and callback) are also nice because processing can occur while fetching data. For example, you can start processing the response from a URL request as the data comes in rather than waiting until the response is fully received and processed.

Using XPath selectors

None of the event-based schemes were as simple as using the ElementTree's "findall" method. In most examples I had to track the element stack myself to know that the "title" was under "person" and not under "document".

Years ago I heard about a streaming XML protocol which allowed XPath-based filters, just like "findall" takes an XPath. I decided to implement something like that myself. I've given the module the ever-so-graceful name 'iterparse_filter'. Use an IterParseFilter to define which fields you are interested in (via a very limited subset of XPath). In this case "person/title"

>>> f.seek(0)
>>> import iterparse_filter
>>> filter = iterparse_filter.IterParseFilter()
>>> filter.iter_end("person/title")
>>> for (event, ele) in filter.iterparse(f):
...   print repr(ele.text)
... 
'Porch Light Switcher'
'Bottle Washer, 3rd class'
>>> 
If I wanted all foaf elements I can filter on that namespace:
>>> f.seek(0)
>>> filter = iterparse_filter.IterParseFilter(
...    namespaces = {"foaf": "http://xmlns.com/foaf/0.1/"} )
>>> filter.iter_end("foaf:*")
>>> for (event, ele) in filter.iterparse(f):
...   print repr(ele.tag)
... 
'{http://xmlns.com/foaf/0.1/}homepage'
'{http://xmlns.com/foaf/0.1/}nick'
>>> 
(I also support Clark notation so I could have written the previous xpath as:
filter.iter_end("{http://xmlns.com/foaf/0.1/}*")
)

I can set up multiple filters

>>> f.seek(0)
>>> filter = iterparse_filter.IterParseFilter()
>>> filter.iter_end("/document/title")
>>> filter.iter_end("/document/person")
>>> for (event, ele) in filter.iterparse(f):
...   if ele.tag == "title": 
...     print "the title is", repr(ele.text)
...   else:
...     print "The person's name is", repr(ele.find("name").text)
... 
the title is 'People working for me'
The person's name is 'Jim'
The person's name is 'Joe'
>>> 
and have filters allowing the start element events:
>>> f.seek(0)
>>> filter = iterparse_filter.IterParseFilter()
>>> filter.iter_start("/document/person")
>>> filter.iter_end("person/title")  
>>> for (event, ele) in filter.iterparse(f):
...   if event == "start":  # can only be a person
...     print "Person:"
...   else:
...     print "  ", repr(ele.text)  # must be the title
... 
Person:
   'Porch Light Switcher'
Person:
   'Bottle Washer, 3rd class'
>>> 

The query syntax

The supported query syntax is:

Pass in a 'namespaces' dictionary in the constructor to define allowed namespace prefixes. If the value is None then it does not define a namespace. Pass in a 'default_namespace' string to define a default namespace for elements which have no Clark notation namespace nor namespace prefix. If default_namespace == None (the default) then unqualified element names do not belong to a namespace. The query term "*" always matches any tag, even if a default_namespace is specified.

Callback-oriented XPath-filtering parsing

What I don't like about the this new code is the need to figure things out again. The filtering code did all the work to figure out that an element in a given location should be placed in the event stream only to have the body of the code figure things out again. It should be less complicated because of extra guarantees made by the filter for the event types and elements returned, but it's still a bit cumbersome.

I've instead been experimenting with a hybrid iterator/callback-handler combination. I can also register a callback handler based on event type and xpath location match. In the following I'll call the 'save_titles' function for every end event on an 'person/title' element:

>>> f.seek(0)
>>> filter = iterparse_filter.IterParseFilter()
>>> def save_titles(event, ele, state):
...     state.append(ele.text)
... 
>>> filter.on_end("person/title", save_titles)
>>> filter.handler_parse(f, titles)
>>> titles
['Porch Light Switcher', 'Bottle Washer, 3rd class']
>>> 
This uses a variant method named 'handler_parse' which takes an optional 'state' parameter. This is not used by the iterparse_filter code except to pass it to the callbacks. It should be used as a place to store information. In this case to store the list of titles. (There are other ways to solve this problem: pass in a local function which gets the list from its closure, or use a class method storing the data in the instance.)

Listening to events

Here is the list of listenable events:

The "start-default" and "end-default" events are meant for some validator work I'm doing. I can use it to find unexpected children nodes or to find elements in a reserved namespace.
>>> g = StringIO("""<list>
...   <item x='1' />
...   <item y='2' />
...   <strange/>
...   <item z='3' />
... </list>""")
>>> filter = iterparse_filter.IterParseFilter()
>>> def unknown_node(event, ele, state):
...     raise AssertionError("Unknown element %r" % (ele.tag,))
... 
>>> def ignore(event, ele, state):
...     pass
... 
>>> filter = iterparse_filter.IterParseFilter()
>>> filter.on_end("/list/item", ignore)
>>> filter.on_end_default("/list/*", unknown_node)
>>> filter.handler_parse(g)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "iterparse_filter.py", line 251, in handler_parse
    return self.create_fa().handler_parse(file, state)
  File "iterparse_filter.py", line 344, in handler_parse
    for x in self.parse(file, state):
  File "iterparse_filter.py", line 391, in parse
    end_handler(event, ele, state)
  File "<stdin>", line 2, in unknown_node
AssertionError: Unknown element 'strange'
>>>
though to report it nicely I need the position information, which I don't think I can get from ElementTree.

I can register multiple handlers for a given node. Here are the total number of 'title' elements and the total number of 'person/title' elements:

>>> f.seek(0)
>>> filter = iterparse_filter.IterParseFilter()
>>> class Counter(object):
...   def __init__(self): self.count = 0
...   def __call__(self, event, tag, state): self.count += 1
... 
>>> title_counter = Counter()
>>> job_title_counter = Counter()
>>> filter.on_start("title", title_counter) 
>>> filter.on_start("person/title", job_title_counter)
>>> filter.handler_parse(f)
>>> title_counter.count 
3
>>> job_title_counter.count
2
>>> 
Start handlers are called in order they are registered. If a handler H1 was registered before H2 then H1 will be called before H2. End handlers are called in reverse order, so H2 will be called before H1.

The 'iterparse' and 'handler_parse' methods use the same underlying 'parse' method which supports both styles of parsing. You can request iterator-style token stream and register callbacks. Though I'm not sure that's a useful feature.

Implementation

I started my making an NFA to evalute the XPath but didn't like the overhead in pure Python nor the amount of backtracking needed for something like //A//B. One of the first times I used XPath I tried something like that and the XPath engine took a long time to find what I thought was an easy answer. I posted to the list and was told my query was naive and didn't reflect knowledge of how XPath worked. I thought it was a perfectly sensible. I knew the going down this route would cause problems.

I then started to make a DFA. Fast, but potentially exponential space instead of exponential time. Pretty standard knowledge because it's the same problem with regular expressions. Indeed, in another project I had hacked a solution based on converting my XPath into a regexp and tag stack into a string and using Python's regular expression engine to do the test for me. But it was slow and not something I wanted to do for every step. I worked on the DFA solution for a bit and realized there was a bug in my code.

Looking around the web I came across a developerWorks article by Benoit Marchal about his "Handler Compiler". He ran into the same problem I did. His article gave some pointers for other tools.

The most helpful was the PPT presentation "From Searching Text to Querying XML Streams" by Dan Suciu. It was very clear and explained exactly the NFA vs. DFA tradeoff. Best of all it used the phrase "Compute the DFA Lazily" and explained why that works.

In short, data-oriented XML has a relatively simple structure. There may only be a few hundred unique stacks in a document. Each only needs to be tested once. (This would be more complicated if I supported more of XPath.) The overhead of my regular expression-based tester is amortized away, and it all becomes simple. Well, except for the actual code to convert from XPath and tag stacks into regular expression and strings, but it's not as hard as manipulting finite state machines.

I'm not sure I have the convertsion code correct. I don't have much experience with XPath outside of ElementTree and I kinda faked it. Let me know of any problems.

For future work, read "Optimizing The Lazy DFA Approach for XML Stream Processing" by Danny Chen and Raymond K. Wong and see xmltk: An XML Toolkit for Lightweight XML Stream Processing.

The Code!

I consider this code still rather experimental so I'm leaving it here in this writings page instead of making a dedicated page for it.

Download iterparse_filter.py

TODO:

Bonus - tag structure of an XML document

The parsing methods from IterParseFilter create an intermediate "FilterAutomata" class which does the actual processing. I split it into two classes because the former allows mutable data structures while the latter does not. It caches handler information in a local state machine (a dfa) and changing the xpath definitions may corrupt that information.

If you parse the same file format several times you might want to create and use the FilterAutomata yourself. This preserves the state machine making for much less initialization overhead after the first call. You can also use it to view the internal tag structure of the XML document. Here's one way:

import iterparse_filter

def _show_structure(dfa, depth):
    items = dfa.items()
    items.sort()
    for (k, v) in items:
        print "  "*depth + k+ ":"
        _show_structure(v[0], depth+1)

def show_structure(fa):
    dfa = fa.dfa
    _show_structure(dfa, 0)

filter = iterparse_filter.IterParseFilter()
fa = filter.create_fa()

filename = "/Users/dalke/nbn_courses/nbn_web/ecoli.xml"
fa.handler_parse(open(filename))

show_structure(fa)

The "ecoli.xml" file is a BLAST-XML document. Its internal structure is:

BlastOutput:
  BlastOutput_db:
  BlastOutput_iterations:
    Iteration:
      Iteration_hits:
        Hit:
          Hit_accession:
          Hit_def:
          Hit_hsps:
            Hsp:
              Hsp_align-len:
              Hsp_bit-score:
              Hsp_evalue:
              Hsp_gaps:
              Hsp_hit-frame:
              Hsp_hit-from:
              Hsp_hit-to:
              Hsp_hseq:
              Hsp_identity:
              Hsp_midline:
              Hsp_num:
              Hsp_positive:
              Hsp_qseq:
              Hsp_query-frame:
              Hsp_query-from:
              Hsp_query-to:
              Hsp_score:
          Hit_id:
          Hit_len:
          Hit_num:
      Iteration_iter-num:
      Iteration_query-ID:
      Iteration_query-def:
      Iteration_query-len:
      Iteration_stat:
        Statistics:
          Statistics_db-len:
          Statistics_db-num:
          Statistics_eff-space:
          Statistics_entropy:
          Statistics_hsp-len:
          Statistics_kappa:
          Statistics_lambda:
  BlastOutput_param:
    Parameters:
      Parameters_expect:
      Parameters_filter:
      Parameters_gap-extend:
      Parameters_gap-open:
      Parameters_matrix:
  BlastOutput_program:
  BlastOutput_query-ID:
  BlastOutput_query-def:
  BlastOutput_query-len:
  BlastOutput_reference:
  BlastOutput_version:


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB