Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2006/08/22/pyprotocols_for_output_generation

PyProtocols for output generation

I've been doing some experimentation with PyProtocols.

A few years ago I wrote PyRSS2Gen. Most of my other modules are rather domain-specific so by comparison that's proved to be a rather popular download, especially when scaled by development time.

PyRSS2Gen uses a 2-step process for publishing a feed. Make a Python data structure (directly based on the RSS 2.0 data model) then serialize it to the right format.

I believe a data container should know nothing about I/O. The logic for loading and saving should be done elsewhere. My primary reason is to simplify support for alternate formats. That's not a strong argument for PyRSS2Gen because it's designed to emit a single format. Even worse, when done as an external function the output generation logic ends up looking like the data structure, including switch statements to handle polymorphic children correctly.

Hence PyRSS2Gen does not use an external function for output generation. Each class implements a "publish" method which serializes self through a SAX 2 handler. Adding a new data type for an existing slot is simple. Make sure it implements the publish method. If it's a new slot (an extension to the spec) then you'll also have to modify the parent so it knows to call the new child. I faked it in PyRSS2Gen by using a "publish_extensions" method hook.

Using PyProtocols I can go back to my prefered style. Because RSS2 is complicated I'll use a subset for my example.

class RSS2(object):
    rss_attrs = {"version": "2.0"}
    element_attrs = {}
    def __init__(self, title, link, items):
        self.title = title
        self.link = link
        self.items = items

class RSSItem(object):
    def __init__(self, title, link, description):
        self.title = title
        self.link = link
        self.description = description

rss = RSS2(
    title = "Andrew's PyRSS2Gen feed",
    link = "http://www.dalkescientific.com/Python/PyRSS2Gen.html",
    items = [
      RSSItem(
         title = "PyRSS2Gen-0.0 released",
         link = "http://www.dalkescientific.com/news/030906-PyRSS2Gen.html",
         description = "Dalke Scientific today announced PyRSS2Gen-0.0, "
                       "a library for generating RSS feeds for Python."),
      RSSItem(
         title = "Thoughts on RSS feeds for bioinformatics",
         link = "http://www.dalkescientific.com/writings/diary/"
                "archive/2003/09/06/RSS.html",
         description = "One of the reasons I wrote PyRSS2Gen was to "
                       "experiment with RSS for data collection in "
                       "bioinformatics.  Last year I came across..."),
      ])
I'm going to convert the data structure into an ElementTree rather than use xml.sax.saxutils.XMLGenerator. What I'll do is adapt nodes to a new "IElement" interface and make a simple way to construct ElementTree nodes which implement that interface. Using "_children" like this suggests it's a no-no. I should probably use append. This is quick hack code.

import sys
from elementtree import ElementTree as etree

import protocols
from protocols import Interface, advise, adapt

class IElement(Interface):
    pass

# XXX READONLY; {} and [] shared default parameters (FAQ 1.4.22)
class SimpleElement(etree._ElementInterface):
    advise(instancesProvide=[IElement])
    
    def __init__(self, tag, attrib={}, text=None, tail=None, children=[]):
        self.tag = tag
        self.attrib = attrib
        self.text = text
        self._children = children
    def __repr__(self):
        return "SimpleElement()"

Here are converter functions from RSS2 and RSSItem to IElement.

def rss2_to_element(rss):
    channel = SimpleElement(
        "channel", rss.element_attrs,
        children = [
             SimpleElement("title", text=rss.title),
             SimpleElement("link", text=rss.link)] + 
             [adapt(item, IElement) for item in rss.items]
        )
    return SimpleElement("rss", rss.rss_attrs, children=[channel])

def rssitem_to_element(item):
    return SimpleElement("item", children = [
        SimpleElement("title", text=item.title),
        SimpleElement("link", text=item.link),
        SimpleElement("description", text=item.description)])
I'll register these factory functions with PyProtocols
protocols.declareAdapter(
    rss2_to_element, provides=[IElement], forTypes=[RSS2])
protocols.declareAdapter(
    rssitem_to_element, provides=[IElement], forTypes=[RSSItem])
and create a function to write an RSS data structure to a file.
def write_rss2(rss, out, encoding="us-ascii"):
    root = adapt(rss, IElement)
    tree = etree.ElementTree(root)
    tree.write(out, encoding)
When I use it
write_rss2(rss, sys.stdout, "utf-8")
I get (reformatted for readability; the actual output has no line breaks or indentation)
<rss version="2.0"><channel>
  <title>Andrew's PyRSS2Gen feed</title>
  <link>http://www.dalkescientific.com/Python/PyRSS2Gen.html</link>
  <item>
   <title>PyRSS2Gen-0.0 released</title>
   <link>http://www.dalkescientific.com/news/030906-PyRSS2Gen.html</link>
   <description>Dalke Scientific today announced PyRSS2Gen-0.0, a library 
for generating RSS feeds for Python.</description>
  </item><item>
    <title>Thoughts on RSS feeds for bioinformatics</title>
     <link>http://www.dalkescientific.com/writings/diary/archive/2003/09/06/RSS.html</link>
    <description>One of the reasons I wrote PyRSS2Gen was to experiment with 
RSS for data collection in bioinformatics.  Last year I came across...</description>
  </item>
 </channel>
</rss>

I don't need to convert the whole structure in one go. With a little help from the ElementTree output conversion I can adapt elements only when needed:

class ElementTreeUsingAdapt(etree.ElementTree):
    def __init__(self, element=None, file=None):
        element = adapt(element, IElement)
        etree.ElementTree.__init__(self, element, file)

    def _write(self, file, node, encoding, namespaces):
        node = adapt(node, IElement)
        etree.ElementTree._write(self, file, node, encoding, namespaces)
        
def write_rss2(rss, out, encoding="us-ascii"):
    tree = ElementTreeUsingAdapt(rss)
    tree.write(out, encoding)
and change my rss2_to_element to support lazy conversion of the children
def rss2_to_element(rss):
    channel = SimpleElement(
        "channel", rss.element_attrs,
        children = [
             SimpleElement("title", text=rss.title),
             SimpleElement("link", text=rss.link)] + 
             rss.items  # defer conversion until later
        )
    return SimpleElement("rss", rss.rss_attrs, children=[channel])

I don't need to use ElementTree to convert. I could use the same XMLGenerator-based approach as before. I'll adapt the RSS2 structure to objects which implement the "IRSS2Publisher" interface.

import xml.sax.saxutils

class IRSS2Publisher(Interface):
    def publish(self, handler):
        pass

def _simple_element(handler, tag, value):
    handler.startElement(tag, {})
    handler.characters(value)
    handler.endElement(tag)

class PublishRSS2(object):
    advise(instancesProvide=[IRSS2Publisher],
           asAdapterForTypes=[RSS2])
    def __init__(self, rss2):
        self.rss2 = rss2
    def publish(self, handler):
        handler.startElement("rss", self.rss2.rss_attrs)
        handler.startElement("channel", self.rss2.element_attrs)
        _simple_element(handler, "title", self.rss2.title)
        _simple_element(handler, "link", self.rss2.link)
        for item in self.rss2.items:
            item = adapt(item, IRSS2Publisher)
            item.publish(handler)
        handler.endElement("channel")
        handler.endElement("rss")

class PublishRSSItem(object):
    advise(instancesProvide=[IRSS2Publisher],
           asAdapterForTypes=[RSSItem])
    def __init__(self, item):
        self.item = item
    def publish(self, handler):
        handler.startElement("item", {})
        _simple_element(handler, "title", self.item.title)
        _simple_element(handler, "link", self.item.link)
        _simple_element(handler, "description", self.item.description)
        handler.endElement("item")
        

def write_rss2(rss, out, encoding="us-ascii"):
    handler = xml.sax.saxutils.XMLGenerator(out, encoding)
    handler.startDocument()
    adapt(rss, IRSS2Publisher).publish(handler)
    handler.endElement()

write_rss2(rss, sys.stdout, "utf-8")
If you compare this code to the existing PyRSS2Gen code you'll see it's almost identical. There's an extra indirection to get from the adapater object to the actual data structure and I adapt children as I process the node. And of course the scaffolding to make it work with PyProtocols.

There are a few advantages. RSS2 output is no longer special. I can implement Atom generation on the same data structure and anyone can trivially strip out RSS2 generation support without needing to modify the data classes. (Warning: I don't know the Atom spec well enough; the following is a sketch of the solution and is incomplete and untested.)

class IAtomPublisher(Interface):
    def publish(self, handler):
        pass

def _link(handler, href):
    handler.startElement("link", {"href": href})
    handler.endElement("link")

class AtomRSS2(object):
    advise(instancesProvide=[IAtomPublisher],
           asAdapterForTypes=[RSS2])
    def __init__(self, rss2):
        self.rss2 = rss2
    def publish(self, handler):
        # I should use the NS versions, I think
        handler.startElement("feed", {"xmlns": "http://www.w3.org/2005/Atom"})
        _simple_element(handler, "title", self.rss2.title)
        _link(handler, self.rss2.link)
        for item in self.rss2.items:
            entry = adapt(item, IAtomPublisher)
            entry.publish(handler)
        handler.endElement("feed")

class AtomRSSItem(object):
    advise(instancesProvide=[IAtomPublisher],
           asAdapterForTypes=[RSSItem])
    def __init__(self, item):
        self.item = item
    def publish(self, handler):
        handler.startElement("entry", {})
        _simple_element(handler, "title", self.item.title)
        _link(handler, self.item.link)
        _simple_element(handler, "summary", self.item.description)
        handler.endElement("entry")

def write_atom(rss, out, encoding="us-ascii"):
    handler = xml.sax.saxutils.XMLGenerator(out, encoding)
    handler.startDocument()
    adapt(rss, IAtomPublisher).publish(handler)
    handler.endDocument()

write_atom(rss, sys.stdout, "utf-8")

I think I can implement extensions in new slots by also doing an adapt for IRSSExtensions (to get the list of extension objects) then doing an adapt for IRSS2Publisher or IAtomPublisher to emit the correct XML. In that way I don't have this cumbersome "publish_extensions" hook.

I really like this approach.


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB