Dalke Scientific Software: More science. Less time. Products

A lot of the programming I do is parsing - extracting data fields from a file. It's an important though often boring job. There are many different file formats and most require a new parser, because the parser for a GenBank file can not handle BLAST or GO data. One of the reasons in favor of XML as a standard data representation format is to reduce the number of parsers needed, but the chances of everyone moving to XML is zero. You will need to learn how to write a parser.

The simplest standard file format in bioinformatics is the FASTA file. You've worked with it a bit in Scott's course. It looks like this

>YAL069W-1.334 Putative promoter sequence
CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACA
CATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTT
ACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCATTCAACCATACCACTCCGAAC
CACCATCCATCCCTCTACTTACTACCACTCACCCACCGTTACCCTCCAATTACCCATATC
CAACCCACTGCCACTTACCCTACCATTACCCTACCATCCACCATGACCTACTCACCATAC
TGTTCTTCTACCCACCATATTGAAACGCTAACAA

>YAL068C-7235.2170 Putative promoter sequence
TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCACAGTTTTCGTTAAGA
GAACTTAACATTTTCTTATGACGTAAATGAAGTTTATATATAAATTTCCTTTTTATTGGA
TAATATGCCTATGCCGCATAATTTTTATATCTTTCTCCTAACAAAACATTCGCTTGTAAA
GTATTATATTTAGGAAAAAAATAATCGATAAAGGCTCATCCGAAGATCAGTTAGATTCTT
TTTGCAAGTCCTGAAGAAATTTTCACACTACTACTATAAAAAAAAAATATCATAAAAAGG
TACATTACGTGCAACCAAAAGTGTAAAATGATTGGTTGCAATGTTTCACCTAAATTACTT

The first step to writing a parser is to figure out what's in the file and what you want from the file. I usually start by looking at files in the given format and get a feel for what's in them. In this case it looks like a FASTA file contains a set of sequence records and each sequence record has a title line (also called a description line or a comment line) and a sequence, which may be folded over multiple lines.

Looking at examples isn't enough because of a sampling problem. Does a FASTA file allow other fields? For example, might the following be allowed?

>YAL068C-7235.2170 Putative promoter sequence
;Author: Andrew Dalke
TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCACAGTTTTCGTTAAGA

There are a few ways to figure this out. Sometimes there's a written specification for a given file format (though not for FASTA). This is supposed to describe the format exactly but you'll frequently find it doesn't. I've run into many cases where, say, GenBank files from NCBI didn't meet their specification. In other cases "wild-type" file formats are only approximations to the specification. PDB (3D structure) files might only contain the atom records and not the other fields required by the specification. You can often find other information by looking at other programs that parse the given format, talking to friends, emailing the author of the format and searching for web pages and back USENET postings about it.

One guiding principle in software development is called YAGNI for "You Ain't Going To Need It." A tendency in many software developers is to write code that handles all the possible cases. In parsers this may mean that it can read all of the fields in a given format, be reusable, modular, flexible, etc. If you're like this you need to be careful because getting software to be "right", as compared to "right enough" is pretty hard. You might work hard doing something and find you never actually use it for anything, and that's just a waste of your time. Don't do things you aren't going to need.

If you know you only want a few fields from a file, only read those fields. If you have example data files you can check that they don't contain any surprises or special cases.

One thing you will need is a sanity check that the file is at least roughly what you expect it will be. If you're writing something to parse BLAST it should stop if the file doesn't appear to be in BLAST format.

Once you have an idea of what's in the file you need to figure out what you're going to do with the data. For this lecture I'll have you write various filters to select or reject records based on the description and the sequence so you'll need to parse both of those fields.

The FASTA file format is ambiguous. Some things are not well specified. Consider the following two records

>YAL068C-7235.2170 Putative promoter sequence
TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCACAGTTTTCGTTAAGA
GAACTTAACATTTTCTTATGACGTAAATGAAGTTTATATATAAATTTCCTTTTTATTGGA

>YAL068C-7235.2170 Putative promoter sequence
TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCA
CAGTTTTCGTTAAGAGAACTTAACATTTTCTTATGACGTAAATGA
AGTTTATATATAAATTTCCTTTTTATTGGA

The sequences are identical except for how the sequence is split across multiple lines. The first record has 60 characters per line while the second has 45. The FASTA format doesn't specify exactly how many characters go in a line and different programs and people use different values.

I point this out because sometimes people like parsers to be "round-tripable". This means that the parser extracts enough information to reproduce the original file exactly. To make a round-tripable FASTA parser you would need to keep track of where the sequence line breaks occured. Few people want that extra information so it falls into the YAGNI category.

I'll use an object model for each record. Each record has a title and a sequence so the class definition looks like this

class FastaRecord(object):
    def __init__(self, title, sequence):
        self.title = title
        self.sequence = sequence
and here's an example of making an instance of the FastaRecord class followed by getting the two attributes.
>>> rec = FastaRecord("YAL068C-7235.2170 Putative promoter sequence",
 ...       "TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCA")
>>> rec.title
'YAL068C-7235.2170 Putative promoter sequence'
>>> rec.sequence
'TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCA'
>>> 

I want the parser to work something like this

input_file = open(filename)
for rec in read_fasta_records(input_file):
  print rec.title

Suppose I have a function that reads a single FASTA record at a time from a file. Then I could implement read_fasta_records by reading each of the records into a list and return the list, like this

def read_fasta_records(input_file):
    records = []
    while 1:
        record = read_fasta_record(input_file)
        if record is None:
            break
        records.append(record)
    return records
Now I need a read_fasta_record function which returns a record if one is available and returns None if the file is empty or if all the records have been read.

This style of design process is known as top-down design. Most programs are organized in layers. The top of the program is the main function, which calls functions which in turn call other functions. Python functions eventually depend on C functions which in turn depend on operating system functions which depend on hardware. A good programmer needs to be be aware of all of the layers, which takes a long time to learn. Thankfully you can get work done without knowing everything!

It's called top-down because I started with how the main function should look like and worked my way down. Another style is called bottom-up design which specifies each part and puts together parts to make larger and larger components.

Real programming uses a mix of styles. Top-down design gives you an idea of where you want to go while bottom-up lets you know what you can do. In most cases you don't quite know where you are going but you have some idea. This style is called stepwise refinement or iterative development because you go one step at a time and use the results to understand more about the problem, which lets you refine the solution to get a better solution. Iterate until you've solved the problem or decided it isn't worthwhile to finish.

Testing is very important. Testing helps you evaluate if your code actually works. You can test everything manually but automated tests are better because it's tedious to do the same thing over and over. You need to build your software so it can be tested. It's best to use a bottom-up approach - build a part and test it, then assemble some of the parts into a component and test that.

To test a file parser I need an input file. I'll start with one I made by hand called "simple.fasta"

>first sequence record
TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCA

>second sequence record
CAGTTTTCGTTAAGAGAACTTAACATTTTCTTATGACGTAAATGA
AGTTTATATATAAATTTCCTTTTTATTGGA

>third sequence record
GAACTTAACATTTTCTTATGACGTAAATGAAGTTTATATATAAATTTCCTTTTTATTGGA
TAATATGCCTATGCCGCATAATTTTTATATCTTTCTCCTAACAAAACATTCGCTTGTAAA

As my first step I'll write a program that reads the first record in the file (named parse1.py)
#### parse1.py
infile = open("simple.fasta")

# The first line in a FASTA record is the title line.
# Examples:
# >third sequence record
# >gi|2765657|emb|Z78532.1|CCZ78532 C.californicum 5.8S rRNA gene
line = infile.readline()

# Double-check that it's a FASTA file by looking for the '>'.
if not line.startswith(">"):
    raise TypeError("Not a FASTA file: %r" % line)

title = line

# Read the sequence lines up to the blank line.
sequence_lines = []
while 1:
    line = infile.readline().rstrip()
    if line == "":
        # Reached the end of the record or end of the file
        break
    sequence_lines.append(line)

# Merge the lines together to make a single string containing the sequence
# (This is faster than doing "sequence = sequence + line" if there are
# more than a few lines)
sequence = "".join(sequence_lines)

# Check that the results are as expected
if title != "first sequence record":
    raise AssertionError(repr(title))
if sequence != "TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCA":
    raise AssertionError(repr(sequence))
When I run this I get
Traceback (most recent call last):
  File "<stdin>", line 37, in ?
    raise AssertionError(repr(title))
AssertionError: '>first sequence record\n'

This comes from what's called an uncaught exception. I talked about this a bit in class on Thursday. Exceptions follow the guiding principle that "Errors should never pass silently." If something unexpected happens in the code and there's no good way to handle it, or it's a known problem, then the proper response is to raise an exception.

I'll talk about exceptions in more details in the future. For now just know that exceptions are created with the code raise Exception() where the Exception() is an instance of some exception class. In this case I raise an AssertionError and I used the repr() to make it easier to see the contents of the line (I wanted to see the newline character escaped as the characters "\n").

Because the exception wasn't caught anywhere in the code, the Python session gets it instead. Its default action is to print information about the exception and where it was raised. In this case it tells me the filename and line number and it prints the actual line.

What it means is that my parsing code has a bug. Looking at it I realize I forgot to remove the ">" from the front of the title and the "\n" from its end. Here's the corrected code

### parse2.py
infile = open("simple.fasta")

# The first line in a FASTA record is the title line.
# Examples:
# >third sequence record
# >gi|2765657|emb|Z78532.1|CCZ78532 C.californicum 5.8S rRNA gene
line = infile.readline()

# Double-check that it's a FASTA file by looking for the '>'.
if not line.startswith(">"):
    raise TypeError("Not a FASTA file: %r" % line)

# The title starts after the '>' and doesn't include any
# trailing whitespace.
title = line[1:].rstrip()

# Read the sequence lines up to the blank line.
sequence_lines = []
while 1:
    # I know I'm at the end of the sequence lines when I reach the
    # blank line or the end of the file.  I can simplify the test
    # by calling rstrip and checking for the empty string.  If it's
    # a sequence line I'll still call rstrip to remove the final
    # newline and any trailing whitespace so I'll go ahead and
    # always rstring the line.
    line = infile.readline().rstrip()
    if line == "":
        # Reached the end of the record or end of the file
        break
    sequence_lines.append(line)

# Merge the lines together to make a single string containing the sequence
# (This is faster than doing "sequence = sequence + line" if there are
# more than a few lines)
sequence = "".join(sequence_lines)

# Check that the results are as expected
if title != "first sequence record":
    raise AssertionError(repr(title))
if sequence != "TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCA":
    raise AssertionError(repr(sequence))

print "Done."

When I run parse2.py I get no error messages. I feel more comfortable if I see a success message so I'll also added a line at the end of the program to print "Done." when everything works.

That is the minimimum amount of code needed to read the first record in the file. I want to make the code reusable so I can read other records. I'll move the code that reads a record into a function. The function will return the title and sequence. I could return them as a tuple but because I know I want a FastaRecord at some point I'll go ahead and add it now.

#### parse3.py
class FastaRecord(object):
    def __init__(self, title, sequence):
        self.title = title
        self.sequence = sequence

def read_fasta_record(infile):
    # The first line in a FASTA record is the title line.
    # Examples:
    # >third sequence record
    # >gi|2765657|emb|Z78532.1|CCZ78532 C.californicum 5.8S rRNA gene
    line = infile.readline()

    # Double-check that it's a FASTA file by looking for the '>'.
    if not line.startswith(">"):
        raise TypeError("Not a FASTA file: %r" % line)

    # The title starts after the '>' and doesn't include any
    # trailing whitespace.
    title = line[1:].rstrip()

    # Read the sequence lines up to the blank line.
    sequence_lines = []
    while 1:
        # I know I'm at the end of the sequence lines when I reach the
        # blank line or the end of the file.  I can simplify the test
        # by calling rstrip and checking for the empty string.  If it's
        # a sequence line I'll still call rstrip to remove the final
        # newline and any trailing whitespace so I'll go ahead and
        # always rstring the line.
        line = infile.readline().rstrip()
        if line == "":
            # Reached the end of the record or end of the file
            break
        sequence_lines.append(line)

    # Merge the lines together to make a single string containing the sequence
    # (This is faster than doing "sequence = sequence + line" if there are
    # more than a few lines)
    sequence = "".join(sequence_lines)
    
    return FastaRecord(title, sequence)


infile = open("simple.fasta")

record = read_fasta_record(infile)

# Check that the results are as expected
if record.title != "first sequence record":
    raise AssertionError(repr(title))
if record.sequence != "TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCA":
    raise AssertionError(repr(sequence))

print "Done."
The process of cleaning up a program to make it more usable or easier to understand or less complicated is called refactoring. It's best to refactor in small parts and test each change as you do it. In this case I did have to make a small change to the test code to handle the FastaRecord object. Now that I have a function I'll use it to read the next two records in the file and make sure they are correct. I'll show just the test code here
infile = open("simple.fasta")

record = read_fasta_record(infile)
if record.title != "first sequence record":
    raise AssertionError(repr(title))
if record.sequence != "TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCA":
    raise AssertionError(repr(sequence))

record = read_fasta_record(infile)
if record.title != "second sequence record":
    raise AssertionError(repr(title))
if record.sequence != (
    "CAGTTTTCGTTAAGAGAACTTAACATTTTCTTATGACGTAAATGA" +
    "AGTTTATATATAAATTTCCTTTTTATTGGA"):
    raise AssertionError(repr(sequence))

record = read_fasta_record(infile)
if record.title != "third sequence record":
    raise AssertionError(repr(title))
if record.sequence != (
    "GAACTTAACATTTTCTTATGACGTAAATGAAGTTTATATATAAATTTCCTTTTTATTGGA" +
    "TAATATGCCTATGCCGCATAATTTTTATATCTTTCTCCTAACAAAACATTCGCTTGTAAA"):
    raise AssertionError(repr(sequence))

print "Done."
I ran this and it worked without a problem. The reason it worked is because when the parser reads the blank line it knows it's at the end of a record. The next line must be the start of a new FASTA record (a line starting with ">") or the end of the file. What does my code do at the end of the file? I'll have the test code read one more record.
## parse4.py
class FastaRecord(object):
    def __init__(self, title, sequence):
        self.title = title
        self.sequence = sequence

def read_fasta_record(infile):
    # The first line in a FASTA record is the title line.
    # Examples:
    # >third sequence record
    # >gi|2765657|emb|Z78532.1|CCZ78532 C.californicum 5.8S rRNA gene
    line = infile.readline()

    # Double-check that it's a FASTA file by looking for the '>'.
    if not line.startswith(">"):
        raise TypeError("Not a FASTA file: %r" % line)

    # The title starts after the '>' and doesn't include any
    # trailing whitespace.
    title = line[1:].rstrip()

    # Read the sequence lines up to the blank line.
    sequence_lines = []
    while 1:
        # I know I'm at the end of the sequence lines when I reach the
        # blank line or the end of the file.  I can simplify the test
        # by calling rstrip and checking for the empty string.  If it's
        # a sequence line I'll still call rstrip to remove the final
        # newline and any trailing whitespace so I'll go ahead and
        # always rstring the line.
        line = infile.readline().rstrip()
        if line == "":
            # Reached the end of the record or end of the file
            break
        sequence_lines.append(line)

    # Merge the lines together to make a single string containing the sequence
    # (This is faster than doing "sequence = sequence + line" if there are
    # more than a few lines)
    sequence = "".join(sequence_lines)
    
    return FastaRecord(title, sequence)

infile = open("simple.fasta")

record = read_fasta_record(infile)
if record.title != "first sequence record":
    raise AssertionError(repr(title))
if record.sequence != "TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCA":
    raise AssertionError(repr(sequence))

record = read_fasta_record(infile)
if record.title != "second sequence record":
    raise AssertionError(repr(title))
if record.sequence != (
    "CAGTTTTCGTTAAGAGAACTTAACATTTTCTTATGACGTAAATGA" +
    "AGTTTATATATAAATTTCCTTTTTATTGGA"):
    raise AssertionError(repr(sequence))

record = read_fasta_record(infile)
if record.title != "third sequence record":
    raise AssertionError(repr(title))
if record.sequence != (
    "GAACTTAACATTTTCTTATGACGTAAATGAAGTTTATATATAAATTTCCTTTTTATTGGA" +
    "TAATATGCCTATGCCGCATAATTTTTATATCTTTCTCCTAACAAAACATTCGCTTGTAAA"):
    raise AssertionError(repr(sequence))

# Test what happens at the end of the file
record = read_fasta_record(infile)

print "Done."

When I run this I get the following error.

Traceback (most recent call last):
  File "parse4.py", line 69, in ?
    record = read_fasta_record(infile)
  File "parse4.py", line 15, in read_fasta_record
    raise TypeError("Not a FASTA file: %r" % line)
TypeError: Not a FASTA file: ''

This exception came from

    # Double-check that it's a FASTA file by looking for the '>'.
    if not line.startswith(">"):
        raise TypeError("Not a FASTA file: %r" % line)

The function expects that it can read a line and that the first line read for a record is the title line. It failed here because readline() returns an empty string when it's at the end of the file, and the empty string does not start with ">". This is an example of why it's helpful to do some sanity checking - what would have happened if I didn't have that bit of code?

If it's the end of file I want it to return the None object instead. The None acts as a special indicator (the technical term is "sentinel") that there is no more input. The test code is simple; if the readline() for the title returns the empty string then it's the end of file so return None.

...
    line = infile.readline()

    if not line:
        # End of file
        return None
...
I'll also test that I get a None at the end of the file.
...
# Read past the end of file
record = read_fasta_record(infile)
if record is not None:
    raise AssertionError(repr(record))
...

Here's the full function and its test code

## parse5.py
class FastaRecord(object):
    def __init__(self, title, sequence):
        self.title = title
        self.sequence = sequence

def read_fasta_record(infile):
    # The first line in a FASTA record is the title line.
    # Examples:
    # >third sequence record
    # >gi|2765657|emb|Z78532.1|CCZ78532 C.californicum 5.8S rRNA gene
    line = infile.readline()

    if not line:
        # End of file
        return None

    # Double-check that it's a FASTA file by looking for the '>'.
    if not line.startswith(">"):
        raise TypeError("Not a FASTA file: %r" % line)

    # The title starts after the '>' and doesn't include any
    # trailing whitespace.
    title = line[1:].rstrip()

    # Read the sequence lines up to the blank line.
    sequence_lines = []
    while 1:
        # I know I'm at the end of the sequence lines when I reach the
        # blank line or the end of the file.  I can simplify the test
        # by calling rstrip and checking for the empty string.  If it's
        # a sequence line I'll still call rstrip to remove the final
        # newline and any trailing whitespace so I'll go ahead and
        # always rstring the line.
        line = infile.readline().rstrip()
        if line == "":
            # Reached the end of the record or end of the file
            break
        sequence_lines.append(line)

    # Merge the lines together to make a single string containing the sequence
    # (This is faster than doing "sequence = sequence + line" if there are
    # more than a few lines)
    sequence = "".join(sequence_lines)
    
    return FastaRecord(title, sequence)


infile = open("simple.fasta")

record = read_fasta_record(infile)
if record.title != "first sequence record":
    raise AssertionError(repr(title))
if record.sequence != "TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCA":
    raise AssertionError(repr(sequence))

record = read_fasta_record(infile)
if record.title != "second sequence record":
    raise AssertionError(repr(title))
if record.sequence != (
    "CAGTTTTCGTTAAGAGAACTTAACATTTTCTTATGACGTAAATGA" +
    "AGTTTATATATAAATTTCCTTTTTATTGGA"):
    raise AssertionError(repr(sequence))

record = read_fasta_record(infile)
if record.title != "third sequence record":
    raise AssertionError(repr(title))
if record.sequence != (
    "GAACTTAACATTTTCTTATGACGTAAATGAAGTTTATATATAAATTTCCTTTTTATTGGA" +
    "TAATATGCCTATGCCGCATAATTTTTATATCTTTCTCCTAACAAAACATTCGCTTGTAAA"):
    raise AssertionError(repr(sequence))

# Read past the end of file
record = read_fasta_record(infile)
if record is not None:
    raise AssertionError(repr(record))

print "Done."

The test code is getting to be a bit complicated. I want to clean it up a bit and move it into its own function. There is a lot of duplication in my tests so I'll move the data into a list and use a for loop to test each field in the list.

test_data = [
    ("first sequence record",
     "TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCA"),
    ("second sequence record",
     "CAGTTTTCGTTAAGAGAACTTAACATTTTCTTATGACGTAAATGA" +
     "AGTTTATATATAAATTTCCTTTTTATTGGA"),
    ("third sequence record",
     "GAACTTAACATTTTCTTATGACGTAAATGAAGTTTATATATAAATTTCCTTTTTATTGGA" +
     "TAATATGCCTATGCCGCATAATTTTTATATCTTTCTCCTAACAAAACATTCGCTTGTAAA"),
    ]

def test():
    infile = open("simple.fasta")
    for (title, sequence) in test_data:
        record = read_fasta_record(infile)
        if title != record.title:
            raise AssertionError("title %r != %r" % (title, record.title))
        if sequence != record.sequence:
            raise AssertionError("sequence %r != %r" % (sequence,
                                                        record.sequence))
    record = read_fasta_record(infile)
    if record is not None:
        raise AssertionError(repr(record))

I want the code to be used as a module (to parse FASTA records) and on the command-line (to run the self tests) so I'll use a special Python technique for that. Python modules have an internal variable called __name__. If the file is imported then it's the name of the module but if it's run from the command-line then it's the string "__main__".

Here's an example of how it works. I'll make a file called "show_main_name.py" and run it from the command-line then import it from Python's interactive shell.

[~/NBN]% cat > show_main_name.py
print "This is always printed."
print "__name__ ==", __name__
if __name__ == "__main__":
    print "Run from the command-line"
else:
    print "Imported as a module"
^D
[~/NBN]%
[~/NBN]% python show_main_name.py 
This is always printed.
__name__ == __main__
Run from the command-line
[~/NBN]%
[~/NBN]% python 
Python 2.3 (#1, Sep 13 2003, 00:49:11) 
[GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import show_main_name 
This is always printed.
__name__ == show_main_name
Imported as a module
>>> ^D
[~/NBN]%
I can use this technique for my new test() function with the following
if __name__ == "__main__":
    test()
    print "All tests passed."
I changed the output from "Done." to "All tests passsed." because I wanted to emphasize just what was being done.

Putting these together, here's the current version of the code

## parse6.py
class FastaRecord(object):
    def __init__(self, title, sequence):
        self.title = title
        self.sequence = sequence

def read_fasta_record(infile):
    # The first line in a FASTA record is the title line.
    # Examples:
    # >third sequence record
    # >gi|2765657|emb|Z78532.1|CCZ78532 C.californicum 5.8S rRNA gene
    line = infile.readline()

    if not line:
        # End of file
        return None

    # Double-check that it's a FASTA file by looking for the '>'.
    if not line.startswith(">"):
        raise TypeError("Not a FASTA file: %r" % line)

    # The title starts after the '>' and doesn't include any
    # trailing whitespace.
    title = line[1:].rstrip()

    # Read the sequence lines up to the blank line.
    sequence_lines = []
    while 1:
        # I know I'm at the end of the sequence lines when I reach the
        # blank line or the end of the file.  I can simplify the test
        # by calling rstrip and checking for the empty string.  If it's
        # a sequence line I'll still call rstrip to remove the final
        # newline and any trailing whitespace so I'll go ahead and
        # always rstring the line.
        line = infile.readline().rstrip()
        if line == "":
            # Reached the end of the record or end of the file
            break
        sequence_lines.append(line)

    # Merge the lines together to make a single string containing the sequence
    # (This is faster than doing "sequence = sequence + line" if there are
    # more than a few lines)
    sequence = "".join(sequence_lines)
    
    return FastaRecord(title, sequence)

test_data = [
    ("first sequence record",
     "TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCA"),
    ("second sequence record",
     "CAGTTTTCGTTAAGAGAACTTAACATTTTCTTATGACGTAAATGA" +
     "AGTTTATATATAAATTTCCTTTTTATTGGA"),
    ("third sequence record",
     "GAACTTAACATTTTCTTATGACGTAAATGAAGTTTATATATAAATTTCCTTTTTATTGGA" +
     "TAATATGCCTATGCCGCATAATTTTTATATCTTTCTCCTAACAAAACATTCGCTTGTAAA"),
    ]

def test():
    infile = open("simple.fasta")
    for (title, sequence) in test_data:
        record = read_fasta_record(infile)
        if title != record.title:
            raise AssertionError("title %r != %r" % (title, record.title))
        if sequence != record.sequence:
            raise AssertionError("sequence %r != %r" % (sequence,
                                                        record.sequence))
    record = read_fasta_record(infile)
    if record is not None:
        raise AssertionError(repr(record))

if __name__ == "__main__":
    test()
    print "All tests passed."

And here it is in action, first as the command-line version to do the self-tests and the second as a module import.

[~/NBN]% python parse6.py
All tests passed.
[~/NBN]% python
Python 2.3 (#1, Sep 13 2003, 00:49:11) 
[GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import parse6
>>> rec = parse6.read_fasta_record(open("simple.fasta"))
>>> rec.title
'first sequence record'
>>> 

Now that I have a function to read a record, I can make a function to read a list of all the records. I'll use the one I defined earlier in this essay

def read_fasta_records(input_file):
    records = []
    while 1:
        record = read_fasta_record(input_file)
        if record is None:
            break
        records.append(record)
    return records

I need to write some test code for it. The code is somewhat different than the existing test code so I decided to make a new function. I moved things around a bit so the old test code is now named test_single_record, the new test function is named test_records and I have a new test function that calls each function. (This new test function is sometimes called the master or driver because it controls how to do the tests.)

Here's the latest version

## parse7.py
class FastaRecord(object):
    def __init__(self, title, sequence):
        self.title = title
        self.sequence = sequence

def read_fasta_record(infile):
    # The first line in a FASTA record is the title line.
    # Examples:
    # >third sequence record
    # >gi|2765657|emb|Z78532.1|CCZ78532 C.californicum 5.8S rRNA gene
    line = infile.readline()

    if not line:
        # End of file
        return None

    # Double-check that it's a FASTA file by looking for the '>'.
    if not line.startswith(">"):
        raise TypeError("Not a FASTA file: %r" % line)

    # The title starts after the '>' and doesn't include any
    # trailing whitespace.
    title = line[1:].rstrip()

    # Read the sequence lines up to the blank line.
    sequence_lines = []
    while 1:
        # I know I'm at the end of the sequence lines when I reach the
        # blank line or the end of the file.  I can simplify the test
        # by calling rstrip and checking for the empty string.  If it's
        # a sequence line I'll still call rstrip to remove the final
        # newline and any trailing whitespace so I'll go ahead and
        # always rstring the line.
        line = infile.readline().rstrip()
        if line == "":
            # Reached the end of the record or end of the file
            break
        sequence_lines.append(line)

    # Merge the lines together to make a single string containing the sequence
    # (This is faster than doing "sequence = sequence + line" if there are
    # more than a few lines)
    sequence = "".join(sequence_lines)
    
    return FastaRecord(title, sequence)

def read_fasta_records(input_file):
    records = []
    while 1:
        record = read_fasta_record(input_file)
        if record is None:
            break
        records.append(record)
    return records


##### Self-test code

test_data = [
    ("first sequence record",
     "TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCA"),
    ("second sequence record",
     "CAGTTTTCGTTAAGAGAACTTAACATTTTCTTATGACGTAAATGA" +
     "AGTTTATATATAAATTTCCTTTTTATTGGA"),
    ("third sequence record",
     "GAACTTAACATTTTCTTATGACGTAAATGAAGTTTATATATAAATTTCCTTTTTATTGGA" +
     "TAATATGCCTATGCCGCATAATTTTTATATCTTTCTCCTAACAAAACATTCGCTTGTAAA"),
    ]

# Test the ability to read a single record at a time
def test_single_record():
    infile = open("simple.fasta")
    for (title, sequence) in test_data:
        record = read_fasta_record(infile)
        if title != record.title:
            raise AssertionError("title %r != %r" % (title, record.title))
        if sequence != record.sequence:
            raise AssertionError("sequence %r != %r" % (sequence,
                                                        record.sequence))
    record = read_fasta_record(infile)
    if record is not None:
        raise AssertionError(repr(record))

# Test the ability to read all the records into a list
def test_records():
    infile = open("simple.fasta")

    count = 0
    for record in read_fasta_records(infile):
        title, sequence = test_data[count]
        if title != record.title:
            raise AssertionError("title %r != %r" % (title, record.title))
        if sequence != record.sequence:
            raise AssertionError("sequence %r != %r" % (sequence,
                                                        record.sequence))
        count = count + 1

    if count != len(test_data):
        raise AssertionError("Read %d records, expected %d" %
                             (count, len(test_data)))

# The master test function
def test():
    test_single_record()
    test_records()
    
if __name__ == "__main__":
    test()
    print "Done."

CAUTION: This FASTA file reader is not correct. There are a few cases it does not handle correctly. You should use the Biopython FASTA parser for real work, or at least make sure that your FASTA files are cleanly formatted.



Copyright © 2001-2013 Andrew Dalke Scientific AB