Parsing

[ previous | newer ] /home/writings/diary/archive/2005/04/22/parsing

Parsing

Much of what I do in chemistry and biology is parsing. I need to parse data files, the output of wrapped programs, web pages, fields from a database, user input – the list goes on and on.

Parsing is a huge field. There's detailed theory of languages with concepts like the Chomsky hierarchy but turning those into usable, practical tools is something different. Numerous libraries and packages are available. Whole books have been written on the the subject. It's almost overwhelming. Here are some pointers.

First, see if there's already a package that does what you want it to do or is close enough that you can modify it. In bioinformatics there are open-source packages like Biopython and Bioperl which parse a large number of formats and outputs. For chemical informatics your best bet is OpenEye's OEChem for parsing the different types and flavors of chemistry files.

If not then you'll need to write your own parser. There are two rough classes of formats: line-oriented and tree-oriented. The latter is often prefered by people who like recursion.

Examples of line-oriented formats include SD files, BLAST output, and GenBank files. These are parsed, as the name suggests, line-by-line. Most people assume the input file format is roughly correct, which simplifies the parsing code. The actual parsing is done with string operations or through regular expressions. Oft times the structure of the code directly reflects the structure of the file format.

A common line-oriented format is "CSV", for "common separated values". It refers to files where each lines contains field separated by a character, often a comma. It's a misnomer because the separator doesn't need to be a comma; it could be a space or a tab or other character. The most basic reader looks like this

for line in open(filename):
    # tab separated fields
    fields = line.split("\t")
     ... use the fields ...

There's no real CSV standard. Different programs handle things like quoted strings and embedded newlines differently. The most popular CSV format is that used by Excel. Python supports these variations with the csv module.

Hand-written parsers often assume the input data is at least roughly in the correct format because catching all the possible but unlikely errors is tedious and thankless. Machine generated parsers are rarely used for these formats because the tokenizer is position dependent and most parser generators don't handle that well. I wrote Martel as an alternative. That effort is on hold until I get more funding or find someone who's interested in developing it further.

Examples of tree-oriented formats include XML and Python code. These are usually designed to be easy for software to read. If a parser isn't available then one can usually be constructed easily using a parser generator. I've used SPARK but I think PLY is the recommended tool these days.

If you need to parse XML I can recommend Fredrik Lundh's ElementTree and it's high-performance version cElementTree. I've mentioned BeautifulSoup for parsing HTML. There's also an adapter to use ElementTree with the output of HTML Tidy. BeautifulSoup is nice because you only need one file, but if you're going to do a lot of HTML processing I would use the ElementTree+tidy approach.

The best source for information about these libraries is the Charming Python column by David Mertz. You should also read his XML Matters series and buy his book Text Processing in Python. If you're doing XML processing then also read Uche Ogbuji's writings.

Many formats are actually a sequence of records. The files can be large even if the records are of small or moderate size. It's best to read these one record at a time. A few years ago Python acquired the yield keyword which has made it easy to support this style of parsing.

The most generally usable parsing libraries these days are feed oriented. These parse a chunk of text at a time and can be used, for example, in reading from a network connection and displaying partial results as the data comes in. However, they are also hard to write because the program needs to maintain it stack explicitly. I think emulating continuations through yield may help but haven't had a chance to experiment with it because I've not needed that generality.

The Python standard libraries includes parsers for MS Windows-style INI files, email (including attachements), gzip files, zip archives, tar archive files and more. Peruse the module index to see what the standard library can do.

If you need to handle image files use PIL.

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me