Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2014/09/23/etc_passwd_api_summary

Summary of the /etc/passwd reader API

Next month I will be in Finland for Pycon Finland 2014. It will be my first time in Finland. I'm especially looking forward to the pre-conference sauna on Sunday evening.

My presentation, "Iterators, Generators, and Coroutines", will cover much of the same ground as my earlier essay. In that essay, I walked through the steps to make an /etc/passwd parser API which can be used like this:

from passwd_reader import read_passwd_entries

with read_passwd_entries("/etc/passwd", errors="strict") as reader:
    location = reader.location
    for entry in reader:
        print("%d:%s" % (location.lineno, entry.name))
I think the previous essay was a bit too detailed to understand the overall points, so in this essay I'll summarize what I did and why I did it. Hopefully it will also help me prepare for Finland.

The /etc/passwd parser is built around a generator, which is a pretty standard practice. Another standard approach is to build a parser object, as a class instance which implements the iterator protocol. The main difference is that the generator uses local variables in the generator's execution frame where the class approach uses instance variables.

Since the parser API can open a file, when the first parameter is a filename string instead of a file object, I want it to implement the context manager protocol and implement deterministic resource handling. If it always created a file then I could use contextlib.closing() or contextlib.contextmanager() to convert an iterator into a self-closing context manager, but my read_passwd_entries reader is polymorphic in that first parameter, so I can't use a standard solution.

I instead wrapped the generator inside of a PasswdReader which implements the appropriate __enter__ and __exit__ methods.

I also want the parser to track location information about current record; in this case the line number of the current record but in general it could include byte position or other information about the record's provenance. I store this in a Location instance accessed via the PasswdReader's "location" attribute.

The line number is stored as a local variable in the iterator's execution frame. While this could be accessed through the generator's ".gi_frame.f_locals", the documentation says that frames are internal types whose definitions may change with future versions of the interpreter. That doesn't sound like something I want to depend upon.

Instead, I used an uncommon technique where the generator registers a callback function that the Location can use to get the line number. This function is defined inside of the generator's scope so can access the local variables. This isn't quite as simple as I would like, because exception handling in a generator, including the StopIteration from calling a generator's close(), is a bit tricky, but it does work.

The more common technique is to rewrite the generator as a class which implements the iterator protocol, where each instance stores its state information as instance variables. It's easy to access instance variables, but it's a different sort of tricky to read and write the state information at the respectively start and end of each iteration step.

A good software design balances many factors, including performance and maintability. The weights for each factor depend on the expected use cases. An unsual alternate design can be justified when it's a better match to the use cases, which I think is the case with my uncommon technique.

In most cases, API users don't want the line number of each record. For the /etc/passwd parser I think it's only useful for error reporting. More generally, it could be used to build a record index, or a syntax highlighter, but those are relatively rare needs.

The traditional class-based solution is, I think, easier to understand and implement, though it's a bit tedious to save and restore the parser state for each entry and exit point. This synchronization adds a lot of overhead to the parser, which isn't neeed for the common case where that information is ignored.

By comparison, my alternative generator solution has a larger overhead - two function calls instead of an attribute lookup - to access location information, but it doesn't need the explicit save/restore for each step because those are maintained by the generator's own execution frame. I think it's a better match for my use cases.

(Hmm. I think I've gone the other way. I think few will understand this without the examples or context. So, definitely lots of code examples for Pycon Finland.)


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB