Dalke Scientific Software: More science. Less time. Products

Parsing FASTA files

What, again?

FASTA files are a bit more complex than what I talked about a couple weeks ago. That FASTA parser won't handle some FASTA files. It assumes there is a blank line between each record. I made that assumption because it's easier to write a parser when there is a well-defined "end of record" identifier. Real FASTA files don't always have that blank line, as with the following.

>YAL059W-36310.36514 Putative promoter sequence
AAATAATATTTGGGGCCCCTCGCGGCTCATTTGTAGTATCTAAGATTATGTATTTTCTTT
TATAATATTTGTTGTTATGAAACAGACAGAAGTAAGTTTCTGCGACTATATTATTTTTTT
TTTTCTTCTTTTTTTTTCCTTTATTCAACTTGGCGATGAGCTGAAAATTTTTTTGGTTAA
GGACCCTTTAGAAGTATTGAATGTG
>YAL058W-37154.37469 Putative promoter sequence
TTTCATATGAAAGGTCCTAGGAATACACGATTCTTGTACGCATTCTTCTTTTTTCTATCT
TCTTTCATTCTTTGTACATTAGATAACATGGTTTTAGCTTAGTTTTATTTTATTTTTTAT
ATATCTGGATGTATACTATTATTGAAAAACTTCATTAATAGTTACAACTTTTTCAATATC
AAGTTGATTAAGAAAAAGAAAATTATTATGGGTTAGCTGAAAACCGTGTGATGCATGTCG
TTTAAGGATTGTGTAAAAAAGTGAACGGCAACGCATTTCTAATATAGATAACGGCCACAC
AAAGTAGTACTATGAA
>YAL056W-38979.39264 Putative promoter sequence
AAGTGTTAGTTTATAACATGGTCTCAATAATTGCACCACAACGGCTTCTCTTTTATAGAT
GGTTAACATTATAGTATCAATATTATCATCATGATTAAATGATGATGTATAATACTTACC
CGATGTTAAATCTTATTTTTTCATGCAGTAAGTAATCATGCAACAAGAAAAACCCGTAAT
TAAGCGAACATAGAACAACTAGCATCCCCGATAAGACGGAATAGAATAGTAAAGATTGTG
ATTCATTGGCAGGTCCATTGTCGCATTACTAAATCATAGGCATGGA
Less often you'll come across FASTA files with extra blank lines between records or at the beginning or end of the file, but I've never seen one with an extra blank line inside of a record. I expect that many parsers won't handle that case correctly.

What happens if I use the old FASTA parser to read these FASTA records? I'll save them to the file "test.fasta" and use fasta_reader.py as a library to read the first record in the file.

>>> import fasta_reader
>>> rec = fasta_reader.read_fasta_record(open("test.fasta"))
>>> rec.title
'YAL059W-36310.36514 Putative promoter sequence'
>>> rec.sequence
'AAATAATATTTGGGGCCCCTCGCGGCTCATTTGTAGTATCTAAGATTATGTATTTTCTTTTATAATATT
TGTTGTTATGAAACAGACAGAAGTAAGTTTCTGCGACTATATTATTTTTTTTTTTCTTCTTTTTTTTTCC
TTTATTCAACTTGGCGATGAGCTGAAAATTTTTTTGGTTAAGGACCCTTTAGAAGTATTGAATGTG>
YAL058W-37154.37469 Putative promoter sequenceTTTCATATGAAAGGTCCTAGGAAT
ACACGATTCTTGTACGCATTCTTCTTTTTTCTATCTTCTTTCATTCTTTGTACATTAGATAACATGGTTT
TAGCTTAGTTTTATTTTATTTTTTATATATCTGGATGTATACTATTATTGAAAAACTTCATTAATAGTTA
CAACTTTTTCAATATCAAGTTGATTAAGAAAAAGAAAATTATTATGGGTTAGCTGAAAACCGTGTGATGC
ATGTCGTTTAAGGATTGTGTAAAAAAGTGAACGGCAACGCATTTCTAATATAGATAACGGCCACACAAAG
TAGTACTATGAA>YAL056W-38979.39264 Putative promoter sequenceAAGTGTTA
GTTTATAACATGGTCTCAATAATTGCACCACAACGGCTTCTCTTTTATAGATGGTTAACATTATAGTATC
AATATTATCATCATGATTAAATGATGATGTATAATACTTACCCGATGTTAAATCTTATTTTTTCATGCAG
TAAGTAATCATGCAACAAGAAAAACCCGTAATTAAGCGAACATAGAACAACTAGCATCCCCGATAAGACG
GAATAGAATAGTAAAGATTGTGATTCATTGGCAGGTCCATTGTCGCATTACTAAATCATAGGCATGGA'
>>> 
There is no blank line so the code that reads the sequence lines kept on reading lines until it reached the end of the file. All of the rest of the file was interpreted as sequence, so the other two title files were included in the sequence output.

You know how I keep talking about sanity checking? I should have done a bit more checking to make sure that I didn't use a line starting with a ">" as a sequence line. The change would occur in the following of the existing code

    sequence_lines = []
    while 1:
        line = infile.readline().rstrip()
        if line == "":
            break
        sequence_lines.append(line)
and the code with the change would look like
    sequence_lines = []
    while 1:
        line = infile.readline().rstrip()
        if line == "":
            break
        elif line.startswith(">"):
            raise TypeError("Read title line when expecting sequence line")
        sequence_lines.append(line)
I'm not going to show the full program with this change because I'm instead going to handle that case correctly.

Look-ahead

The problem with this more complex FASTA format is that there's no way to identify the end of a record except by reading the start of the next record. If I read a line after the title line it could be sequence data or it could be a blank line (marking the end of the sequence data) or it could be the next record's title line (implicitly marking the end of the sequence data and the start of the next record). In computer science terms, it needs to look-ahead to see if it's at the end of a record.

My fasta_record_reader() function reads a function and leaves the input file positioned ready to read the next record. After it reads a record it's ready to read the next record. If there's no blank line then it will read the first line of the next record, notice it's finished with the original record and return it to the caller. But in this case it's read one line too many. The file pointer will be on the sequence data line after the title line of the next record.

About to read the first line
title = infile.readline()
if not title: return None  # End-of-file?
data = []
while 1:
    line = infile.readline()
    if line.isspace():
       return FastaRecord(title, "".join(data))
    if line.startswith(">"):
       # Throw away the contents of the line?
       return FastaRecord(title, "".join(data))
    data.append(line.rstrip("\n"))
♦>YAL059W-36310.36514 Putative promoter sequence
 GGACCCTTTAGAAGTATTGAATGTG
 >YAL058W-37154.37469 Putative promoter sequence
 CGCATTCTTCTTTTTTCTATCT
title == ">YAL059W-36310.36514 Putative promoter sequence\n"
title = infile.readline()
if not title: return None  # End-of-file?
data = []
while 1:
    line = infile.readline()
    if line.isspace():
       return FastaRecord(title, "".join(data))
    if line.startswith(">"):
       # Throw away the contents of the line?
       return FastaRecord(title, "".join(data))
    data.append(line.rstrip("\n"))
 >YAL059W-36310.36514 Putative promoter sequence
♦GGACCCTTTAGAAGTATTGAATGTG
 >YAL058W-37154.37469 Putative promoter sequence
 CGCATTCTTCTTTTTTCTATCT
About to read the second line (first line of sequence data)
title = infile.readline()
if not title: return None  # End-of-file?
data = []
while 1:
    line = infile.readline()
    if line.isspace():
       return FastaRecord(title, "".join(data))
    if line.startswith(">"):
       # Throw away the contents of the line?
       return FastaRecord(title, "".join(data))
    data.append(line.rstrip("\n"))
 >YAL059W-36310.36514 Putative promoter sequence
♦GGACCCTTTAGAAGTATTGAATGTG
 >YAL058W-37154.37469 Putative promoter sequence
 CGCATTCTTCTTTTTTCTATCT
line == "GGACCCTTTAGAAGTATTGAATGT"
title = infile.readline()
if not title: return None  # End-of-file?
data = []
while 1:
    line = infile.readline()
    if line.isspace():
       return FastaRecord(title, "".join(data))
    if line.startswith(">"):
       # Throw away the contents of the line?
       return FastaRecord(title, "".join(data))
    data.append(line.rstrip("\n"))
 >YAL059W-36310.36514 Putative promoter sequence
 GGACCCTTTAGAAGTATTGAATGTG
♦>YAL058W-37154.37469 Putative promoter sequence
 CGCATTCTTCTTTTTTCTATCT
Append to the 'data' list
title = infile.readline()
if not title: return None  # End-of-file?
data = []
while 1:
    line = infile.readline()
    if line.isspace():
       return FastaRecord(title, "".join(data))
    if line.startswith(">"):
       # Throw away the contents of the line?
       return FastaRecord(title, "".join(data))
    data.append(line.rstrip("\n"))
 >YAL059W-36310.36514 Putative promoter sequence
 GGACCCTTTAGAAGTATTGAATGTG
♦>YAL058W-37154.37469 Putative promoter sequence
 CGCATTCTTCTTTTTTCTATCT
About to read the third line (second line of sequence data)
title = infile.readline()
if not title: return None  # End-of-file?
data = []
while 1:
    line = infile.readline()
    if line.isspace():
       return FastaRecord(title, "".join(data))
    if line.startswith(">"):
       # Throw away the contents of the line?
       return FastaRecord(title, "".join(data))
    data.append(line.rstrip("\n"))
 >YAL059W-36310.36514 Putative promoter sequence
 GGACCCTTTAGAAGTATTGAATGTG
♦>YAL058W-37154.37469 Putative promoter sequence
 CGCATTCTTCTTTTTTCTATCT
line == ">YAL058W-37154.37469 Putative promoter sequence\n"
title = infile.readline()
if not title: return None  # End-of-file?
data = []
while 1:
    line = infile.readline()
    if line.isspace():
       return FastaRecord(title, "".join(data))
    if line.startswith(">"):
       # Throw away the contents of the line?
       return FastaRecord(title, "".join(data))
    data.append(line.rstrip("\n"))
 >YAL059W-36310.36514 Putative promoter sequence
 GGACCCTTTAGAAGTATTGAATGTG
 >YAL058W-37154.37469 Putative promoter sequence
♦CGCATTCTTCTTTTTTCTATCT
What do I do with the contents of 'line'?
title = infile.readline()
if not title: return None  # End-of-file?
data = []
while 1:
    line = infile.readline()
    if line.isspace():
       return FastaRecord(title, "".join(data))
    if line.startswith(">"):
       # Throw away the contents of the line?
       return FastaRecord(title, "".join(data))
    data.append(line.rstrip("\n"))
 >YAL059W-36310.36514 Putative promoter sequence
 GGACCCTTTAGAAGTATTGAATGTG
 >YAL058W-37154.37469 Putative promoter sequence
♦CGCATTCTTCTTTTTTCTATCT

Possible solution #1 - seek

Files support a seek function, which moves the location of the file pointer. Here's how to move backwards in the file:

title = infile.readline()
if not title: return None  # End-of-file?
data = []
while 1:
    line = infile.readline()
    if line.isspace():
       return FastaRecord(title, "".join(data))
    if line.startswith(">"):
       infile.seek(-len(line), 1)  # the 1 means "relative to current position"
       return FastaRecord(title, "".join(data))
    data.append(line.rstrip("\n"))

Python supports many file-like objects but not all are seekable. One example is a network connection

>>> import urllib2
>>> f = urllib2.urlopen("http://www.dalkescientific.com/writings/NBN/data/ls_orchid.fasta")
>>> f.read(40)
'>gi|2765658|emb|Z78533.1|CIZ78533 C.irap'
>>> f.read(40)
'eanum 5.8S rRNA gene and ITS1 and ITS2 D'
>>> f.read(40)
'NA\nCGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGAT'
>>> f.read(40)
'CATTGATGAGACCGTGGAATAAACGATCGAGTG\nAATCCG'
>>> f.seek(-5, 1)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
AttributeError: addinfourl instance has no attribute 'seek'
>>> 
while another is program input from a terminal or unix pipe connected to sys.stdin.

Because of the limitations in different file-like objects it's best to have a solution which works without the ability to seek.

Possible solution #2 - a new file-like object

It's easy to make a file-like object. Actually, it's easy to make a sufficiently file-like object. The FASTA record reader only uses the readline method of the file object. I can implement as a wrapper to an existing file. Here's one which is useful for debugging.

class DebugFile(object):
    def __init__(self, infile):
        self.infile = infile
    def readline(self):
        line = self.infile.readline()
        print "Read:", repr(line)
        return line

>>> f = DebugFile(open("test.fasta"))
>>> s = f.readline()
Read: '>YAL059W-36310.36514 Putative promoter sequence\n'
>>> s
'>YAL059W-36310.36514 Putative promoter sequence\n'
>>> 

I'll modify this to have an UndoFile with the new method saveline which takes a string. The string is saved so the next readline returns that string instead of reading the next line of the file. Once returned it's reset to None to mean that no line is saved.

class UndoFile:
    def __init__(self, infile):
        self.infile = infile
        self._saved = None
    
    def readline(self):
        if self._saved is not None:
            s = self._saved
            self._saved = None
            return s
        return self.infile.readline()
    
    def saveline(self, s):
        if self._saved is not None:
            raise TypeError("Only one line may be saved")
        self._saved = s
Here it is in action on a file-like object which doesn't support seeks.
>>> import urllib2
>>> infile = urllib2.urlopen("http://www.dalkescientific.com/writings/NBN/data/test.fasta")
>>> f = UndoFile(infile)
>>> f.readline()
'>YAL059W-36310.36514 Putative promoter sequence\n'
>>> f.readline()
'AAATAATATTTGGGGCCCCTCGCGGCTCATTTGTAGTATCTAAGATTATGTATTTTCTTT\n'
>>> f.readline()
'TATAATATTTGTTGTTATGAAACAGACAGAAGTAAGTTTCTGCGACTATATTATTTTTTT\n'
>>> f.readline()
'TTTTCTTCTTTTTTTTTCCTTTATTCAACTTGGCGATGAGCTGAAAATTTTTTTGGTTAA\n'
>>> f.readline()
'GGACCCTTTAGAAGTATTGAATGTG\n'
>>> f.readline()
'>YAL058W-37154.37469 Putative promoter sequence\n'
>>> s = _
>>> s
'>YAL058W-37154.37469 Putative promoter sequence\n'
>>> f.saveline(s)
>>> f.readline()
'>YAL058W-37154.37469 Putative promoter sequence\n'
>>> 
and here's how to use it to read a FASTA record
def read_fasta_record(infile):
    line = infile.readline()
    if not line:
        return  # End-of-file
    
    # Double-check that it's a title line
    if not line.startswith(">"):
        raise TypeError(
            "The title line must start with a '>': %r" % line)
    
    title = line[1:].rstrip("\n")
    
    sequences = []
    while 1:
        line = infile.readline()
        if line.isspace():
            break
        if line.startswith(">"):
            # Read the first line of the next record.
            # Save for next time and go with what we've had
            infile.saveline(line)
            break
        sequences.append(line)
    
    return FastaRecord(title, "".join(sequences))

This is a good general-purpose solution. You can use the UndoFile with anything that parses files using readline and you can use it with anything that understands that new "saveline" protocol. Protocol is the Python word which says that there's an agreement on what a given objects must do to be labled "*-like". The more an object implements the file protocol the more it can be used any place that a file can be used.

Some of the Biopython parsers use this approach. The biggest problem is the performance. Every readline() first goes to the UndoHandle (that's the Biopython version of UndoFile) and nearly all then go to actual file object. Function calls in Python are somewhat slow.

There's also the complication in the caller of wrapping the actual file into an undo-able object.

The iterator protocol

The for-loop works on lists, dictionaries, strings, files, generator objects and other data types. The technical term for the things you iterate over is container; you iterator over elements in the container.

You can make your own objects work in a for-loop by implementing the iterator protocol. The for-loop starts by asking the object for its iterator. The object can be a container (like a string or dictionary) or an iterator. If it's an iterator it should return itself. The for-loop gets the iterator by calling the special method __iter__().

The iterator must implement two methods: __iter__() and next(). The first almost certainly returns itself. The second is more interesting. When the for-loop needs the next element from the iterator it gets it by calling next() and using the return value. If there are no more values then next() must raise an exception named StopIteration.

To see that in action, here's an iterator which counts from a given number down to 0.

>>> class Countdown(object):
...     def __init__(self, value):
...         self.value = value
...     def __iter__(self):
...         return self
...     def next(self):
...         if self.value < 0:
...             raise StopIteration
...         value = self.value
...         self.value = value-1
...         return value
... 
>>> for i in Countdown(5):
...     print i
... 
5
4
3
2
1
0
>>> 
In detail here's what the for-loop uses from the container and the iterator.
>>> container = Countdown(5)
>>> iter(container)
<__main__.Countdown object at 0x554f0>
>>> iterobj = iter(container)
>>> iterobj.next()
5
>>> iterobj.next()
4
>>> iterobj.next()
3
>>> iterobj.next()
2
>>> iterobj.next()
1
>>> iterobj.next()
0
>>> iterobj.next()
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "<stdin>", line 8, in next
StopIteration
>>> 

Possible solution #3 - a FASTA record iterator

I could make an iterator over the records in a FASTA file pretty easily if I have a read_fasta_record function which reads a record and has the file handle set up for the next record.

import fasta_reader
class FastaRecordReader(object):
    def __init__(self, infile):
        self.infile = infile
    def __iter__(self):
        return self
    def next(self):
        x = fasta_reader.read_fasta_record(self.infile)
        if x is None:
            raise StopIteration
        return x

>>> for rec in FastaRecordReader(open("simple.fasta")):
...     print rec.title
... 
first sequence record
second sequence record
third sequence record
>>> 
This is effectively identical to the generator solution of
import fasta_reader
def read_fasta_records(infile):
    while 1:
        rec = fasta_reader.read_fasta_record(infile)
        if rec is None:
            break
        yield rec

What's useful about the FastaRecordReader class solution is that the object provides a place to store information needed to read the next record, so long as there isn't lookahead information which needs to be shared with other parsers which expect a standard file object.

Instead of saving the look-ahead information in its own class, I'll merge it with the record reading code to get a new FastaRecordReader. The interface is unchanged, except that this will correctly handle cases where there is no blank line to end a record.

class FastaRecordReader(object):
    def __init__(self, infile):
        self.infile = infile
        self._saved = None
    def __iter__(self):
        return self
    def next(self):
        # The first line could come from the lookahead
        if self._saved is not None:
            line, self._saved = self._saved, None
        else:
            line = self.infile.readline()
            if not line:
                # end-of-file
                raise StopIteration

        # Double-check that it's a title line
        if not line.startswith(">"):
            raise TypeError(
                "The title line must start with a '>': %r" % line)

        title = line[1:].rstrip("\n")
        
        sequences = []
        while 1:
            line = self.infile.readline()
            if line.isspace():
                break
            if line.startswith(">"):
                self._saved = line
                break
            sequences.append(line)

        return FastaRecord(title, "".join(sequences))

Possible solution #4 - a generator

I showed how the first FastaRecordReader class was "effectively identical to the generator implementation of read_fasta_records." The same is true with the improved FastaRecordReader – I can make an improved generator version. The code only changes in a few places; I use the local variable "saved" instead of the instance attribute "self._saved" and I "yield" items in a while loop instead of "return"ing items each time the function is called.

def read_fasta_records(infile):
    saved = None
    while 1:
        # Look for the start of a record
        if saved is not None:
            line = saved
            saved = None
        else:
            line = infile.readline()
            if not line:
                return
        
        # skip blank lines
        if line.isspace():
            continue
        
        # Double-check that it's a title line
        if not line.startswith(">"):
            raise TypeError(
                "The title line must start with a '>': %r" % line)
        
        title = line.rstrip()
        
        # Read the sequence lines until the next record, a blank
        # line, or the end of file
        sequences = []
        
        while 1:
            line = infile.readline()
            if not line or line.isspace():
                break
            if line.startswith(">"):
                # The start of the next record
                saved = line
                break
            sequences.append(line.rstrip("\n"))
        
        yield FastaRecord(title, "".join(sequences))

The only major difference is in how they save state. The class approach uses instance variables while the generator uses local variables. The technical term for saving state in local variables is closure. Some people prefer closures over objects and classes.

I've found that the generator code for examples like this are easier than a class-based solution, but mostly because everything is in one function instead of in two methods of a class. In theory anything one one form has a solution in the other form and the translation from one to the other is mechanical. My experience says that the differences are also very minor.

(The more complex translations - though still mechanical - occur when the closure solution yields when inside of several loops. Bioinformatics data files are made of records and they can be processed a record at a time so this doesn't happen frequently. I've only needed the more complex class solution once.)

Which is best?

That depends on your style and what you need it for. I usually prefer the generator function first (easiest for me to use), the look-ahead solution second (very general and easiest to undestand), and the class-based one last.



Copyright © 2001-2013 Andrew Dalke Scientific AB