A lot of the programming I do is parsing - extracting data fields from a file. It's an important though often boring job. There are many different file formats and most require a new parser, because the parser for a GenBank file can not handle BLAST or GO data. One of the reasons in favor of XML as a standard data representation format is to reduce the number of parsers needed, but the chances of everyone moving to XML is zero. You will need to learn how to write a parser.
The simplest standard file format in bioinformatics is the FASTA file. You've worked with it a bit in Scott's course. It looks like this
>YAL069W-1.334 Putative promoter sequence CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACA CATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTT ACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCATTCAACCATACCACTCCGAAC CACCATCCATCCCTCTACTTACTACCACTCACCCACCGTTACCCTCCAATTACCCATATC CAACCCACTGCCACTTACCCTACCATTACCCTACCATCCACCATGACCTACTCACCATAC TGTTCTTCTACCCACCATATTGAAACGCTAACAA >YAL068C-7235.2170 Putative promoter sequence TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCACAGTTTTCGTTAAGA GAACTTAACATTTTCTTATGACGTAAATGAAGTTTATATATAAATTTCCTTTTTATTGGA TAATATGCCTATGCCGCATAATTTTTATATCTTTCTCCTAACAAAACATTCGCTTGTAAA GTATTATATTTAGGAAAAAAATAATCGATAAAGGCTCATCCGAAGATCAGTTAGATTCTT TTTGCAAGTCCTGAAGAAATTTTCACACTACTACTATAAAAAAAAAATATCATAAAAAGG TACATTACGTGCAACCAAAAGTGTAAAATGATTGGTTGCAATGTTTCACCTAAATTACTT
The first step to writing a parser is to figure out what's in the file and what you want from the file. I usually start by looking at files in the given format and get a feel for what's in them. In this case it looks like a FASTA file contains a set of sequence records and each sequence record has a title line (also called a description line or a comment line) and a sequence, which may be folded over multiple lines.
Looking at examples isn't enough because of a sampling problem. Does a FASTA file allow other fields? For example, might the following be allowed?
>YAL068C-7235.2170 Putative promoter sequence ;Author: Andrew Dalke TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCACAGTTTTCGTTAAGA
There are a few ways to figure this out. Sometimes there's a written specification for a given file format (though not for FASTA). This is supposed to describe the format exactly but you'll frequently find it doesn't. I've run into many cases where, say, GenBank files from NCBI didn't meet their specification. In other cases "wild-type" file formats are only approximations to the specification. PDB (3D structure) files might only contain the atom records and not the other fields required by the specification. You can often find other information by looking at other programs that parse the given format, talking to friends, emailing the author of the format and searching for web pages and back USENET postings about it.
One guiding principle in software development is called YAGNI for "You Ain't Going To Need It." A tendency in many software developers is to write code that handles all the possible cases. In parsers this may mean that it can read all of the fields in a given format, be reusable, modular, flexible, etc. If you're like this you need to be careful because getting software to be "right", as compared to "right enough" is pretty hard. You might work hard doing something and find you never actually use it for anything, and that's just a waste of your time. Don't do things you aren't going to need.
If you know you only want a few fields from a file, only read those fields. If you have example data files you can check that they don't contain any surprises or special cases.
One thing you will need is a sanity check that the file is at least roughly what you expect it will be. If you're writing something to parse BLAST it should stop if the file doesn't appear to be in BLAST format.
Once you have an idea of what's in the file you need to figure out what you're going to do with the data. For this lecture I'll have you write various filters to select or reject records based on the description and the sequence so you'll need to parse both of those fields.
The FASTA file format is ambiguous. Some things are not well specified. Consider the following two records
>YAL068C-7235.2170 Putative promoter sequence TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCACAGTTTTCGTTAAGA GAACTTAACATTTTCTTATGACGTAAATGAAGTTTATATATAAATTTCCTTTTTATTGGA >YAL068C-7235.2170 Putative promoter sequence TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCA CAGTTTTCGTTAAGAGAACTTAACATTTTCTTATGACGTAAATGA AGTTTATATATAAATTTCCTTTTTATTGGA
The sequences are identical except for how the sequence is split across multiple lines. The first record has 60 characters per line while the second has 45. The FASTA format doesn't specify exactly how many characters go in a line and different programs and people use different values.
I point this out because sometimes people like parsers to be "round-tripable". This means that the parser extracts enough information to reproduce the original file exactly. To make a round-tripable FASTA parser you would need to keep track of where the sequence line breaks occured. Few people want that extra information so it falls into the YAGNI category.
I'll use an object model for each record. Each record has a title and a sequence so the class definition looks like this
class FastaRecord(object):
def __init__(self, title, sequence):
self.title = title
self.sequence = sequence
and here's an example of making an instance of the FastaRecord class
followed by getting the two attributes.
>>> rec = FastaRecord("YAL068C-7235.2170 Putative promoter sequence",
... "TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCA")
>>> rec.title
'YAL068C-7235.2170 Putative promoter sequence'
>>> rec.sequence
'TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCA'
>>>
I want the parser to work something like this
input_file = open(filename) for rec in read_fasta_records(input_file): print rec.title
Suppose I have a function that reads a single FASTA record at a time from a file. Then I could implement read_fasta_records by reading each of the records into a list and return the list, like this
def read_fasta_records(input_file):
records = []
while 1:
record = read_fasta_record(input_file)
if record is None:
break
records.append(record)
return records
Now I need a read_fasta_record function which returns a record if
one is available and returns None if the file is empty or if all the
records have been read.
This style of design process is known as top-down design. Most programs are organized in layers. The top of the program is the main function, which calls functions which in turn call other functions. Python functions eventually depend on C functions which in turn depend on operating system functions which depend on hardware. A good programmer needs to be be aware of all of the layers, which takes a long time to learn. Thankfully you can get work done without knowing everything!
It's called top-down because I started with how the main function should look like and worked my way down. Another style is called bottom-up design which specifies each part and puts together parts to make larger and larger components.
Real programming uses a mix of styles. Top-down design gives you an idea of where you want to go while bottom-up lets you know what you can do. In most cases you don't quite know where you are going but you have some idea. This style is called stepwise refinement or iterative development because you go one step at a time and use the results to understand more about the problem, which lets you refine the solution to get a better solution. Iterate until you've solved the problem or decided it isn't worthwhile to finish.
Testing is very important. Testing helps you evaluate if your code actually works. You can test everything manually but automated tests are better because it's tedious to do the same thing over and over. You need to build your software so it can be tested. It's best to use a bottom-up approach - build a part and test it, then assemble some of the parts into a component and test that.
To test a file parser I need an input file. I'll start with one I made by hand called "simple.fasta"
>first sequence record TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCA >second sequence record CAGTTTTCGTTAAGAGAACTTAACATTTTCTTATGACGTAAATGA AGTTTATATATAAATTTCCTTTTTATTGGA >third sequence record GAACTTAACATTTTCTTATGACGTAAATGAAGTTTATATATAAATTTCCTTTTTATTGGA TAATATGCCTATGCCGCATAATTTTTATATCTTTCTCCTAACAAAACATTCGCTTGTAAAAs my first step I'll write a program that reads the first record in the file (named parse1.py)
#### parse1.py
infile = open("simple.fasta")
# The first line in a FASTA record is the title line.
# Examples:
# >third sequence record
# >gi|2765657|emb|Z78532.1|CCZ78532 C.californicum 5.8S rRNA gene
line = infile.readline()
# Double-check that it's a FASTA file by looking for the '>'.
if not line.startswith(">"):
raise TypeError("Not a FASTA file: %r" % line)
title = line
# Read the sequence lines up to the blank line.
sequence_lines = []
while 1:
line = infile.readline().rstrip()
if line == "":
# Reached the end of the record or end of the file
break
sequence_lines.append(line)
# Merge the lines together to make a single string containing the sequence
# (This is faster than doing "sequence = sequence + line" if there are
# more than a few lines)
sequence = "".join(sequence_lines)
# Check that the results are as expected
if title != "first sequence record":
raise AssertionError(repr(title))
if sequence != "TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCA":
raise AssertionError(repr(sequence))
When I run this I get
Traceback (most recent call last):
File "<stdin>", line 37, in ?
raise AssertionError(repr(title))
AssertionError: '>first sequence record\n'
This comes from what's called an uncaught exception. I talked about this a bit in class on Thursday. Exceptions follow the guiding principle that "Errors should never pass silently." If something unexpected happens in the code and there's no good way to handle it, or it's a known problem, then the proper response is to raise an exception.
I'll talk about exceptions in more details in the future. For now just know that exceptions are created with the code raise Exception() where the Exception() is an instance of some exception class. In this case I raise an AssertionError and I used the repr() to make it easier to see the contents of the line (I wanted to see the newline character escaped as the characters "\n").
Because the exception wasn't caught anywhere in the code, the Python session gets it instead. Its default action is to print information about the exception and where it was raised. In this case it tells me the filename and line number and it prints the actual line.
What it means is that my parsing code has a bug. Looking at it I realize I forgot to remove the ">" from the front of the title and the "\n" from its end. Here's the corrected code
### parse2.py
infile = open("simple.fasta")
# The first line in a FASTA record is the title line.
# Examples:
# >third sequence record
# >gi|2765657|emb|Z78532.1|CCZ78532 C.californicum 5.8S rRNA gene
line = infile.readline()
# Double-check that it's a FASTA file by looking for the '>'.
if not line.startswith(">"):
raise TypeError("Not a FASTA file: %r" % line)
# The title starts after the '>' and doesn't include any
# trailing whitespace.
title = line[1:].rstrip()
# Read the sequence lines up to the blank line.
sequence_lines = []
while 1:
# I know I'm at the end of the sequence lines when I reach the
# blank line or the end of the file. I can simplify the test
# by calling rstrip and checking for the empty string. If it's
# a sequence line I'll still call rstrip to remove the final
# newline and any trailing whitespace so I'll go ahead and
# always rstring the line.
line = infile.readline().rstrip()
if line == "":
# Reached the end of the record or end of the file
break
sequence_lines.append(line)
# Merge the lines together to make a single string containing the sequence
# (This is faster than doing "sequence = sequence + line" if there are
# more than a few lines)
sequence = "".join(sequence_lines)
# Check that the results are as expected
if title != "first sequence record":
raise AssertionError(repr(title))
if sequence != "TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCA":
raise AssertionError(repr(sequence))
print "Done."
When I run parse2.py I get no error messages. I feel more comfortable if I see a success message so I'll also added a line at the end of the program to print "Done." when everything works.
That is the minimimum amount of code needed to read the first record in the file. I want to make the code reusable so I can read other records. I'll move the code that reads a record into a function. The function will return the title and sequence. I could return them as a tuple but because I know I want a FastaRecord at some point I'll go ahead and add it now.
#### parse3.py
class FastaRecord(object):
def __init__(self, title, sequence):
self.title = title
self.sequence = sequence
def read_fasta_record(infile):
# The first line in a FASTA record is the title line.
# Examples:
# >third sequence record
# >gi|2765657|emb|Z78532.1|CCZ78532 C.californicum 5.8S rRNA gene
line = infile.readline()
# Double-check that it's a FASTA file by looking for the '>'.
if not line.startswith(">"):
raise TypeError("Not a FASTA file: %r" % line)
# The title starts after the '>' and doesn't include any
# trailing whitespace.
title = line[1:].rstrip()
# Read the sequence lines up to the blank line.
sequence_lines = []
while 1:
# I know I'm at the end of the sequence lines when I reach the
# blank line or the end of the file. I can simplify the test
# by calling rstrip and checking for the empty string. If it's
# a sequence line I'll still call rstrip to remove the final
# newline and any trailing whitespace so I'll go ahead and
# always rstring the line.
line = infile.readline().rstrip()
if line == "":
# Reached the end of the record or end of the file
break
sequence_lines.append(line)
# Merge the lines together to make a single string containing the sequence
# (This is faster than doing "sequence = sequence + line" if there are
# more than a few lines)
sequence = "".join(sequence_lines)
return FastaRecord(title, sequence)
infile = open("simple.fasta")
record = read_fasta_record(infile)
# Check that the results are as expected
if record.title != "first sequence record":
raise AssertionError(repr(title))
if record.sequence != "TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCA":
raise AssertionError(repr(sequence))
print "Done."
The process of cleaning up a program to make it more usable or easier
to understand or less complicated is called refactoring. It's
best to refactor in small parts and test each change as you do it. In
this case I did have to make a small change to the test code to handle
the FastaRecord object.
Now that I have a function I'll use it to read the next two records in
the file and make sure they are correct. I'll show just the test code
here
infile = open("simple.fasta")
record = read_fasta_record(infile)
if record.title != "first sequence record":
raise AssertionError(repr(title))
if record.sequence != "TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCA":
raise AssertionError(repr(sequence))
record = read_fasta_record(infile)
if record.title != "second sequence record":
raise AssertionError(repr(title))
if record.sequence != (
"CAGTTTTCGTTAAGAGAACTTAACATTTTCTTATGACGTAAATGA" +
"AGTTTATATATAAATTTCCTTTTTATTGGA"):
raise AssertionError(repr(sequence))
record = read_fasta_record(infile)
if record.title != "third sequence record":
raise AssertionError(repr(title))
if record.sequence != (
"GAACTTAACATTTTCTTATGACGTAAATGAAGTTTATATATAAATTTCCTTTTTATTGGA" +
"TAATATGCCTATGCCGCATAATTTTTATATCTTTCTCCTAACAAAACATTCGCTTGTAAA"):
raise AssertionError(repr(sequence))
print "Done."
I ran this and it worked without a problem.
The reason it worked is because when the parser reads the blank line
it knows it's at the end of a record. The next line must be the start
of a new FASTA record (a line starting with ">") or the end of the
file.
What does my code do at the end of the file? I'll have the test
code read one more record.
## parse4.py
class FastaRecord(object):
def __init__(self, title, sequence):
self.title = title
self.sequence = sequence
def read_fasta_record(infile):
# The first line in a FASTA record is the title line.
# Examples:
# >third sequence record
# >gi|2765657|emb|Z78532.1|CCZ78532 C.californicum 5.8S rRNA gene
line = infile.readline()
# Double-check that it's a FASTA file by looking for the '>'.
if not line.startswith(">"):
raise TypeError("Not a FASTA file: %r" % line)
# The title starts after the '>' and doesn't include any
# trailing whitespace.
title = line[1:].rstrip()
# Read the sequence lines up to the blank line.
sequence_lines = []
while 1:
# I know I'm at the end of the sequence lines when I reach the
# blank line or the end of the file. I can simplify the test
# by calling rstrip and checking for the empty string. If it's
# a sequence line I'll still call rstrip to remove the final
# newline and any trailing whitespace so I'll go ahead and
# always rstring the line.
line = infile.readline().rstrip()
if line == "":
# Reached the end of the record or end of the file
break
sequence_lines.append(line)
# Merge the lines together to make a single string containing the sequence
# (This is faster than doing "sequence = sequence + line" if there are
# more than a few lines)
sequence = "".join(sequence_lines)
return FastaRecord(title, sequence)
infile = open("simple.fasta")
record = read_fasta_record(infile)
if record.title != "first sequence record":
raise AssertionError(repr(title))
if record.sequence != "TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCA":
raise AssertionError(repr(sequence))
record = read_fasta_record(infile)
if record.title != "second sequence record":
raise AssertionError(repr(title))
if record.sequence != (
"CAGTTTTCGTTAAGAGAACTTAACATTTTCTTATGACGTAAATGA" +
"AGTTTATATATAAATTTCCTTTTTATTGGA"):
raise AssertionError(repr(sequence))
record = read_fasta_record(infile)
if record.title != "third sequence record":
raise AssertionError(repr(title))
if record.sequence != (
"GAACTTAACATTTTCTTATGACGTAAATGAAGTTTATATATAAATTTCCTTTTTATTGGA" +
"TAATATGCCTATGCCGCATAATTTTTATATCTTTCTCCTAACAAAACATTCGCTTGTAAA"):
raise AssertionError(repr(sequence))
# Test what happens at the end of the file
record = read_fasta_record(infile)
print "Done."
When I run this I get the following error.
Traceback (most recent call last):
File "parse4.py", line 69, in ?
record = read_fasta_record(infile)
File "parse4.py", line 15, in read_fasta_record
raise TypeError("Not a FASTA file: %r" % line)
TypeError: Not a FASTA file: ''
This exception came from
# Double-check that it's a FASTA file by looking for the '>'.
if not line.startswith(">"):
raise TypeError("Not a FASTA file: %r" % line)
The function expects that it can read a line and that the first line read for a record is the title line. It failed here because readline() returns an empty string when it's at the end of the file, and the empty string does not start with ">". This is an example of why it's helpful to do some sanity checking - what would have happened if I didn't have that bit of code?
If it's the end of file I want it to return the None object instead. The None acts as a special indicator (the technical term is "sentinel") that there is no more input. The test code is simple; if the readline() for the title returns the empty string then it's the end of file so return None.
...
line = infile.readline()
if not line:
# End of file
return None
...
I'll also test that I get a None at the end of the file.
...
# Read past the end of file
record = read_fasta_record(infile)
if record is not None:
raise AssertionError(repr(record))
...
Here's the full function and its test code
## parse5.py
class FastaRecord(object):
def __init__(self, title, sequence):
self.title = title
self.sequence = sequence
def read_fasta_record(infile):
# The first line in a FASTA record is the title line.
# Examples:
# >third sequence record
# >gi|2765657|emb|Z78532.1|CCZ78532 C.californicum 5.8S rRNA gene
line = infile.readline()
if not line:
# End of file
return None
# Double-check that it's a FASTA file by looking for the '>'.
if not line.startswith(">"):
raise TypeError("Not a FASTA file: %r" % line)
# The title starts after the '>' and doesn't include any
# trailing whitespace.
title = line[1:].rstrip()
# Read the sequence lines up to the blank line.
sequence_lines = []
while 1:
# I know I'm at the end of the sequence lines when I reach the
# blank line or the end of the file. I can simplify the test
# by calling rstrip and checking for the empty string. If it's
# a sequence line I'll still call rstrip to remove the final
# newline and any trailing whitespace so I'll go ahead and
# always rstring the line.
line = infile.readline().rstrip()
if line == "":
# Reached the end of the record or end of the file
break
sequence_lines.append(line)
# Merge the lines together to make a single string containing the sequence
# (This is faster than doing "sequence = sequence + line" if there are
# more than a few lines)
sequence = "".join(sequence_lines)
return FastaRecord(title, sequence)
infile = open("simple.fasta")
record = read_fasta_record(infile)
if record.title != "first sequence record":
raise AssertionError(repr(title))
if record.sequence != "TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCA":
raise AssertionError(repr(sequence))
record = read_fasta_record(infile)
if record.title != "second sequence record":
raise AssertionError(repr(title))
if record.sequence != (
"CAGTTTTCGTTAAGAGAACTTAACATTTTCTTATGACGTAAATGA" +
"AGTTTATATATAAATTTCCTTTTTATTGGA"):
raise AssertionError(repr(sequence))
record = read_fasta_record(infile)
if record.title != "third sequence record":
raise AssertionError(repr(title))
if record.sequence != (
"GAACTTAACATTTTCTTATGACGTAAATGAAGTTTATATATAAATTTCCTTTTTATTGGA" +
"TAATATGCCTATGCCGCATAATTTTTATATCTTTCTCCTAACAAAACATTCGCTTGTAAA"):
raise AssertionError(repr(sequence))
# Read past the end of file
record = read_fasta_record(infile)
if record is not None:
raise AssertionError(repr(record))
print "Done."
The test code is getting to be a bit complicated. I want to clean it up a bit and move it into its own function. There is a lot of duplication in my tests so I'll move the data into a list and use a for loop to test each field in the list.
test_data = [
("first sequence record",
"TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCA"),
("second sequence record",
"CAGTTTTCGTTAAGAGAACTTAACATTTTCTTATGACGTAAATGA" +
"AGTTTATATATAAATTTCCTTTTTATTGGA"),
("third sequence record",
"GAACTTAACATTTTCTTATGACGTAAATGAAGTTTATATATAAATTTCCTTTTTATTGGA" +
"TAATATGCCTATGCCGCATAATTTTTATATCTTTCTCCTAACAAAACATTCGCTTGTAAA"),
]
def test():
infile = open("simple.fasta")
for (title, sequence) in test_data:
record = read_fasta_record(infile)
if title != record.title:
raise AssertionError("title %r != %r" % (title, record.title))
if sequence != record.sequence:
raise AssertionError("sequence %r != %r" % (sequence,
record.sequence))
record = read_fasta_record(infile)
if record is not None:
raise AssertionError(repr(record))
I want the code to be used as a module (to parse FASTA records) and on the command-line (to run the self tests) so I'll use a special Python technique for that. Python modules have an internal variable called __name__. If the file is imported then it's the name of the module but if it's run from the command-line then it's the string "__main__".
Here's an example of how it works. I'll make a file called "show_main_name.py" and run it from the command-line then import it from Python's interactive shell.
[~/NBN]% cat > show_main_name.py
print "This is always printed."
print "__name__ ==", __name__
if __name__ == "__main__":
print "Run from the command-line"
else:
print "Imported as a module"
^D
[~/NBN]%
[~/NBN]% python show_main_name.py
This is always printed.
__name__ == __main__
Run from the command-line
[~/NBN]%
[~/NBN]% python
Python 2.3 (#1, Sep 13 2003, 00:49:11)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import show_main_name
This is always printed.
__name__ == show_main_name
Imported as a module
>>> ^D
[~/NBN]%
I can use this technique for my new test() function with the following
if __name__ == "__main__":
test()
print "All tests passed."
I changed the output from "Done." to "All tests passsed." because I
wanted to emphasize just what was being done.
Putting these together, here's the current version of the code
## parse6.py
class FastaRecord(object):
def __init__(self, title, sequence):
self.title = title
self.sequence = sequence
def read_fasta_record(infile):
# The first line in a FASTA record is the title line.
# Examples:
# >third sequence record
# >gi|2765657|emb|Z78532.1|CCZ78532 C.californicum 5.8S rRNA gene
line = infile.readline()
if not line:
# End of file
return None
# Double-check that it's a FASTA file by looking for the '>'.
if not line.startswith(">"):
raise TypeError("Not a FASTA file: %r" % line)
# The title starts after the '>' and doesn't include any
# trailing whitespace.
title = line[1:].rstrip()
# Read the sequence lines up to the blank line.
sequence_lines = []
while 1:
# I know I'm at the end of the sequence lines when I reach the
# blank line or the end of the file. I can simplify the test
# by calling rstrip and checking for the empty string. If it's
# a sequence line I'll still call rstrip to remove the final
# newline and any trailing whitespace so I'll go ahead and
# always rstring the line.
line = infile.readline().rstrip()
if line == "":
# Reached the end of the record or end of the file
break
sequence_lines.append(line)
# Merge the lines together to make a single string containing the sequence
# (This is faster than doing "sequence = sequence + line" if there are
# more than a few lines)
sequence = "".join(sequence_lines)
return FastaRecord(title, sequence)
test_data = [
("first sequence record",
"TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCA"),
("second sequence record",
"CAGTTTTCGTTAAGAGAACTTAACATTTTCTTATGACGTAAATGA" +
"AGTTTATATATAAATTTCCTTTTTATTGGA"),
("third sequence record",
"GAACTTAACATTTTCTTATGACGTAAATGAAGTTTATATATAAATTTCCTTTTTATTGGA" +
"TAATATGCCTATGCCGCATAATTTTTATATCTTTCTCCTAACAAAACATTCGCTTGTAAA"),
]
def test():
infile = open("simple.fasta")
for (title, sequence) in test_data:
record = read_fasta_record(infile)
if title != record.title:
raise AssertionError("title %r != %r" % (title, record.title))
if sequence != record.sequence:
raise AssertionError("sequence %r != %r" % (sequence,
record.sequence))
record = read_fasta_record(infile)
if record is not None:
raise AssertionError(repr(record))
if __name__ == "__main__":
test()
print "All tests passed."
And here it is in action, first as the command-line version to do the self-tests and the second as a module import.
[~/NBN]% python parse6.py
All tests passed.
[~/NBN]% python
Python 2.3 (#1, Sep 13 2003, 00:49:11)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import parse6
>>> rec = parse6.read_fasta_record(open("simple.fasta"))
>>> rec.title
'first sequence record'
>>>
Now that I have a function to read a record, I can make a function to read a list of all the records. I'll use the one I defined earlier in this essay
def read_fasta_records(input_file):
records = []
while 1:
record = read_fasta_record(input_file)
if record is None:
break
records.append(record)
return records
I need to write some test code for it. The code is somewhat different than the existing test code so I decided to make a new function. I moved things around a bit so the old test code is now named test_single_record, the new test function is named test_records and I have a new test function that calls each function. (This new test function is sometimes called the master or driver because it controls how to do the tests.)
CAUTION: This FASTA file reader is not correct. There are a few cases it does not handle correctly. You should use the Biopython FASTA parser for real work, or at least make sure that your FASTA files are cleanly formatted.
Copyright © 2001-2008 Dalke Scientific Software, LLC.


