Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2006/03/22/file_parsing_locations

File parsing locations

One thing I want from file parsers is some ability to determine the position something was in the file. It doesn't have to be the exact byte, though sometimes that's nice. If I'm parsing a line-oriented file then the line number and byte position of the start of the line is good enough. I want these numbers to be able to report errors and progress.

Python generators make it easy to parse line-oriented file formats. Here's one for FASTA files. It yields records at a time. You would use it for small records because with a single genome sized record you wouldn't get position feedback until the end. It also assumes the last character in the file is a newline, which isn't always the case.

class FastaRecord(object):
    def __init__(self, title, sequence):
        self.title = title
        self.sequence = sequence

def read_fasta(infile):
    title = None
    seq = None
    for lineno, line in enumerate(infile):
        if line[0] == ">":
            if title is not None:
                # Was working on a previous record; yield it
                yield FastaRecord(title, "".join(seq))
            title = line[1:-1]
            seq = []
        else:
            seq.append(line[:-1])

    # Reached end of file while working on a record
    if title is not None:
        yield FastaRecord(title, "".join(seq))


filename = "/Users/dalke/cvses/biopython/Doc/examples/ls_orchid.fasta"
print len(list(read_fasta(open(filename)))), "records"
Run this and it prints 94 records.

Suppose the FASTA title line needs to start with a GenBank "gi" number in NCBI's syntax, as gb|gi-number. There are various ways to do it. I want one which lets me reuse the low-level FASTA reader for other purposes, and which doesn't slow down the common case of simple FASTA reading. Here's one approach:

class NotGIRecord(Exception):
    def __init__(self, why, field=None):
        self.why = why
        self.field = field
    def __repr__(self):
        if self.field is None:
            return "NotGIRecord(%r)" % (self.why,)
        else:
            return "NotGIRecord(%r,%r)" % (self.why, self.field)
    __str__ = __repr__

def verify_gi_records(fasta_reader):
    for record in fasta_reader:
        if not record.title.startswith("gi|"):
            raise NotGIRecord("does not start with a GenBank identifier")
        fields = record.title.split("|")
        if not fields[1].isdigit():
            raise NotGIRecord("identifier must be a number", field[1])
        yield record

# Modified 'ls_orchid' so one of the fields is wrong
filename = "modified_ls_orchid.fasta"
print len(list(verify_gi_records(read_fasta(open(filename))))), "records"
% python fasta.py
Traceback (most recent call last):
  File "fasta.py", line 45, in ?
    print len(list(verify_gi_records(read_fasta(open(filename))))), "records"
  File "fasta.py", line 37, in verify_gi_records
    raise NotGIRecord("does not start with a GenBank identifier")
__main__.NotGIRecord: NotGIRecord('does not start with a GenBank identifier')
%
That is, use an iterator to read the records then a wrapper iterator which verifies each record as it comes through.

The problem is that I want to report the line number with the failure. I can't do that with the above code as there's no access to line numbers, only the record number. What I can do is change the parsing code to yield a bit more information. The following is an example of returning two objects -- position and record -- from the iterator. Note: I've only lightly tested it.

class Position(object):
    def __init__(self, start_lineno, end_lineno):
        self.start_lineno = start_lineno
        self.end_lineno = end_lineno

class FastaRecord(object):
    def __init__(self, title, sequence):
        self.title = title
        self.sequence = sequence

def read_fasta(infile):
    title = None
    seq = None
    pos = Position(1, 1)

    for lineno, line in enumerate(infile):
        if line[0] == ">":
            if title is not None:
                # Was working on a previous record; yield it
                pos.end_lineno = lineno
                yield pos, FastaRecord(title, "".join(seq))
                pos.start_lineno = lineno+1
            title = line[1:-1]
            seq = []
        else:
            seq.append(line[:-1])

    # Reached end of file while working on a record
    if title is not None:
        pos.end_lineno = lineno
        yield pos, FastaRecord(title, "".join(seq))

class NotGIRecord(Exception):
    def __init__(self, why, field=None):
        self.why = why
        self.field = field
    def __repr__(self):
        if self.field is None:
            return "NotGIRecord(%r)" % (self.why,)
        else:
            return "NotGIRecord(%r,%r)" % (self.why, self.field)
    __str__ = __repr__

def verify_gi_records(fasta_reader):
    for pos, record in fasta_reader:
        if not record.title.startswith("gi|"):
            raise NotGIRecord(
                "does not start with a GenBank identifier at line %d" %
                (pos.start_lineno,))
        fields = record.title.split("|")
        if not fields[1].isdigit():
            raise NotGIRecord(
                "identifier must be a number at line %d" %
                (pos.start_lineno, field[1]))
        yield record

filename = "modified_ls_orchid.fasta"

print len(list(verify_gi_records(read_fasta(open(filename))))), "records"
__main__.NotGIRecord: NotGIRecord('does not start with a GenBank identifier at line 143')

This approach works. It's ugly though because it returns a 2-ple (an anonymous data structure with two elements). One variation is to merge the Position and FastaRecord data into a single class. With that variation either records must have position, locatable classes are subclasses of both the data class and the location class, or the parser sticks on a location.

SAX2 parsers are interesting to me because they have three different types of handlers. One is for the content, and there's a distinct one to get the location information. That information is correct only when doing a content handler callback. What that does is let the SAX user decide if the line information is interesting. In theory the implementation could have optimized code when there is no location callback and slower code which keep track of position information when there is a callback.

Using that approach another variation is to pass a Position instance into the 'read_fasta' function, so it's modified in-place and only the sequence record is returned. To get the location, look at the passed-in position instance, like this:

def read_fasta(infile, pos):
    title = None
    seq = None
    pos.start_lineno = 1
    for lineno, line in enumerate(infile):
        if line[0] == ">":
            if title is not None:
                # Was working on a previous record; yield it
                pos.end_lineno = lineno
                yield FastaRecord(title, "".join(seq))
                pos.start_lineno = lineno+1
            title = line[1:-1]
            seq = []
        else:
            seq.append(line[:-1])

    # Reached end of file while working on a record
    if title is not None:
        pos.end_lineno = lineno
        yield FastaRecord(title, "".join(seq))

In this case I didn't do the optimization when no locations are needed. Then again doing that makes testing more complicated.

I'm not saying this is a better way to get location information. It's one of many approaches and the right solution depends on what you need.


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB