Parsing FASTA files
What, again?
FASTA files are a bit more complex than what I talked about a couple weeks ago. That FASTA parser won't handle some FASTA files. It assumes there is a blank line between each record. I made that assumption because it's easier to write a parser when there is a well-defined "end of record" identifier. Real FASTA files don't always have that blank line, as with the following.
>YAL059W-36310.36514 Putative promoter sequence AAATAATATTTGGGGCCCCTCGCGGCTCATTTGTAGTATCTAAGATTATGTATTTTCTTT TATAATATTTGTTGTTATGAAACAGACAGAAGTAAGTTTCTGCGACTATATTATTTTTTT TTTTCTTCTTTTTTTTTCCTTTATTCAACTTGGCGATGAGCTGAAAATTTTTTTGGTTAA GGACCCTTTAGAAGTATTGAATGTG >YAL058W-37154.37469 Putative promoter sequence TTTCATATGAAAGGTCCTAGGAATACACGATTCTTGTACGCATTCTTCTTTTTTCTATCT TCTTTCATTCTTTGTACATTAGATAACATGGTTTTAGCTTAGTTTTATTTTATTTTTTAT ATATCTGGATGTATACTATTATTGAAAAACTTCATTAATAGTTACAACTTTTTCAATATC AAGTTGATTAAGAAAAAGAAAATTATTATGGGTTAGCTGAAAACCGTGTGATGCATGTCG TTTAAGGATTGTGTAAAAAAGTGAACGGCAACGCATTTCTAATATAGATAACGGCCACAC AAAGTAGTACTATGAA >YAL056W-38979.39264 Putative promoter sequence AAGTGTTAGTTTATAACATGGTCTCAATAATTGCACCACAACGGCTTCTCTTTTATAGAT GGTTAACATTATAGTATCAATATTATCATCATGATTAAATGATGATGTATAATACTTACC CGATGTTAAATCTTATTTTTTCATGCAGTAAGTAATCATGCAACAAGAAAAACCCGTAAT TAAGCGAACATAGAACAACTAGCATCCCCGATAAGACGGAATAGAATAGTAAAGATTGTG ATTCATTGGCAGGTCCATTGTCGCATTACTAAATCATAGGCATGGALess often you'll come across FASTA files with extra blank lines between records or at the beginning or end of the file, but I've never seen one with an extra blank line inside of a record. I expect that many parsers won't handle that case correctly.
What happens if I use the old FASTA parser to read these FASTA records? I'll save them to the file "test.fasta" and use fasta_reader.py as a library to read the first record in the file.
>>> import fasta_reader >>> rec = fasta_reader.read_fasta_record(open("test.fasta")) >>> rec.title 'YAL059W-36310.36514 Putative promoter sequence' >>> rec.sequence 'AAATAATATTTGGGGCCCCTCGCGGCTCATTTGTAGTATCTAAGATTATGTATTTTCTTTTATAATATT TGTTGTTATGAAACAGACAGAAGTAAGTTTCTGCGACTATATTATTTTTTTTTTTCTTCTTTTTTTTTCC TTTATTCAACTTGGCGATGAGCTGAAAATTTTTTTGGTTAAGGACCCTTTAGAAGTATTGAATGTG> YAL058W-37154.37469 Putative promoter sequenceTTTCATATGAAAGGTCCTAGGAAT ACACGATTCTTGTACGCATTCTTCTTTTTTCTATCTTCTTTCATTCTTTGTACATTAGATAACATGGTTT TAGCTTAGTTTTATTTTATTTTTTATATATCTGGATGTATACTATTATTGAAAAACTTCATTAATAGTTA CAACTTTTTCAATATCAAGTTGATTAAGAAAAAGAAAATTATTATGGGTTAGCTGAAAACCGTGTGATGC ATGTCGTTTAAGGATTGTGTAAAAAAGTGAACGGCAACGCATTTCTAATATAGATAACGGCCACACAAAG TAGTACTATGAA>YAL056W-38979.39264 Putative promoter sequenceAAGTGTTA GTTTATAACATGGTCTCAATAATTGCACCACAACGGCTTCTCTTTTATAGATGGTTAACATTATAGTATC AATATTATCATCATGATTAAATGATGATGTATAATACTTACCCGATGTTAAATCTTATTTTTTCATGCAG TAAGTAATCATGCAACAAGAAAAACCCGTAATTAAGCGAACATAGAACAACTAGCATCCCCGATAAGACG GAATAGAATAGTAAAGATTGTGATTCATTGGCAGGTCCATTGTCGCATTACTAAATCATAGGCATGGA' >>>There is no blank line so the code that reads the sequence lines kept on reading lines until it reached the end of the file. All of the rest of the file was interpreted as sequence, so the other two title files were included in the sequence output.
You know how I keep talking about sanity checking? I should have done a bit more checking to make sure that I didn't use a line starting with a ">" as a sequence line. The change would occur in the following of the existing code
sequence_lines = [] while 1: line = infile.readline().rstrip() if line == "": break sequence_lines.append(line)and the code with the change would look like
sequence_lines = [] while 1: line = infile.readline().rstrip() if line == "": break elif line.startswith(">"): raise TypeError("Read title line when expecting sequence line") sequence_lines.append(line)I'm not going to show the full program with this change because I'm instead going to handle that case correctly.
Look-ahead
The problem with this more complex FASTA format is that there's no way to identify the end of a record except by reading the start of the next record. If I read a line after the title line it could be sequence data or it could be a blank line (marking the end of the sequence data) or it could be the next record's title line (implicitly marking the end of the sequence data and the start of the next record). In computer science terms, it needs to look-ahead to see if it's at the end of a record.
My fasta_record_reader() function reads a function and leaves the input file positioned ready to read the next record. After it reads a record it's ready to read the next record. If there's no blank line then it will read the first line of the next record, notice it's finished with the original record and return it to the caller. But in this case it's read one line too many. The file pointer will be on the sequence data line after the title line of the next record.
title = infile.readline() if not title: return None # End-of-file? data = [] while 1: line = infile.readline() if line.isspace(): return FastaRecord(title, "".join(data)) if line.startswith(">"): # Throw away the contents of the line? return FastaRecord(title, "".join(data)) data.append(line.rstrip("\n")) |
♦>YAL059W-36310.36514 Putative promoter sequence GGACCCTTTAGAAGTATTGAATGTG >YAL058W-37154.37469 Putative promoter sequence CGCATTCTTCTTTTTTCTATCT | ||
title == ">YAL059W-36310.36514 Putative promoter sequence\n" | |||
title = infile.readline() if not title: return None # End-of-file? data = [] while 1: line = infile.readline() if line.isspace(): return FastaRecord(title, "".join(data)) if line.startswith(">"): # Throw away the contents of the line? return FastaRecord(title, "".join(data)) data.append(line.rstrip("\n")) |
>YAL059W-36310.36514 Putative promoter sequence ♦GGACCCTTTAGAAGTATTGAATGTG >YAL058W-37154.37469 Putative promoter sequence CGCATTCTTCTTTTTTCTATCT | ||
| |||
title = infile.readline() if not title: return None # End-of-file? data = [] while 1: line = infile.readline() if line.isspace(): return FastaRecord(title, "".join(data)) if line.startswith(">"): # Throw away the contents of the line? return FastaRecord(title, "".join(data)) data.append(line.rstrip("\n")) |
>YAL059W-36310.36514 Putative promoter sequence ♦GGACCCTTTAGAAGTATTGAATGTG >YAL058W-37154.37469 Putative promoter sequence CGCATTCTTCTTTTTTCTATCT | ||
line == "GGACCCTTTAGAAGTATTGAATGT" | |||
title = infile.readline() if not title: return None # End-of-file? data = [] while 1: line = infile.readline() if line.isspace(): return FastaRecord(title, "".join(data)) if line.startswith(">"): # Throw away the contents of the line? return FastaRecord(title, "".join(data)) data.append(line.rstrip("\n")) |
>YAL059W-36310.36514 Putative promoter sequence GGACCCTTTAGAAGTATTGAATGTG ♦>YAL058W-37154.37469 Putative promoter sequence CGCATTCTTCTTTTTTCTATCT | ||
title = infile.readline() if not title: return None # End-of-file? data = [] while 1: line = infile.readline() if line.isspace(): return FastaRecord(title, "".join(data)) if line.startswith(">"): # Throw away the contents of the line? return FastaRecord(title, "".join(data)) data.append(line.rstrip("\n")) |
>YAL059W-36310.36514 Putative promoter sequence GGACCCTTTAGAAGTATTGAATGTG ♦>YAL058W-37154.37469 Putative promoter sequence CGCATTCTTCTTTTTTCTATCT | ||
| |||
title = infile.readline() if not title: return None # End-of-file? data = [] while 1: line = infile.readline() if line.isspace(): return FastaRecord(title, "".join(data)) if line.startswith(">"): # Throw away the contents of the line? return FastaRecord(title, "".join(data)) data.append(line.rstrip("\n")) |
>YAL059W-36310.36514 Putative promoter sequence GGACCCTTTAGAAGTATTGAATGTG ♦>YAL058W-37154.37469 Putative promoter sequence CGCATTCTTCTTTTTTCTATCT | ||
line == ">YAL058W-37154.37469 Putative promoter sequence\n" | |||
title = infile.readline() if not title: return None # End-of-file? data = [] while 1: line = infile.readline() if line.isspace(): return FastaRecord(title, "".join(data)) if line.startswith(">"): # Throw away the contents of the line? return FastaRecord(title, "".join(data)) data.append(line.rstrip("\n")) |
>YAL059W-36310.36514 Putative promoter sequence GGACCCTTTAGAAGTATTGAATGTG >YAL058W-37154.37469 Putative promoter sequence ♦CGCATTCTTCTTTTTTCTATCT | ||
title = infile.readline() if not title: return None # End-of-file? data = [] while 1: line = infile.readline() if line.isspace(): return FastaRecord(title, "".join(data)) if line.startswith(">"): # Throw away the contents of the line? return FastaRecord(title, "".join(data)) data.append(line.rstrip("\n")) |
>YAL059W-36310.36514 Putative promoter sequence GGACCCTTTAGAAGTATTGAATGTG >YAL058W-37154.37469 Putative promoter sequence ♦CGCATTCTTCTTTTTTCTATCT |
Possible solution #1 - seek
Files support a seek function, which moves the location of the file pointer. Here's how to move backwards in the file:
title = infile.readline() if not title: return None # End-of-file? data = [] while 1: line = infile.readline() if line.isspace(): return FastaRecord(title, "".join(data)) if line.startswith(">"): infile.seek(-len(line), 1) # the 1 means "relative to current position" return FastaRecord(title, "".join(data)) data.append(line.rstrip("\n"))
Python supports many file-like objects but not all are seekable. One example is a network connection
>>> import urllib2 >>> f = urllib2.urlopen("http://www.dalkescientific.com/writings/NBN/data/ls_orchid.fasta") >>> f.read(40) '>gi|2765658|emb|Z78533.1|CIZ78533 C.irap' >>> f.read(40) 'eanum 5.8S rRNA gene and ITS1 and ITS2 D' >>> f.read(40) 'NA\nCGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGAT' >>> f.read(40) 'CATTGATGAGACCGTGGAATAAACGATCGAGTG\nAATCCG' >>> f.seek(-5, 1) Traceback (most recent call last): File "<stdin>", line 1, in ? AttributeError: addinfourl instance has no attribute 'seek' >>>while another is program input from a terminal or unix pipe connected to sys.stdin.
Because of the limitations in different file-like objects it's best to have a solution which works without the ability to seek.
Possible solution #2 - a new file-like object
It's easy to make a file-like object. Actually, it's easy to make a sufficiently file-like object. The FASTA record reader only uses the readline method of the file object. I can implement as a wrapper to an existing file. Here's one which is useful for debugging.
class DebugFile(object): def __init__(self, infile): self.infile = infile def readline(self): line = self.infile.readline() print "Read:", repr(line) return line >>> f = DebugFile(open("test.fasta")) >>> s = f.readline() Read: '>YAL059W-36310.36514 Putative promoter sequence\n' >>> s '>YAL059W-36310.36514 Putative promoter sequence\n' >>>
I'll modify this to have an UndoFile with the new method saveline which takes a string. The string is saved so the next readline returns that string instead of reading the next line of the file. Once returned it's reset to None to mean that no line is saved.
class UndoFile: def __init__(self, infile): self.infile = infile self._saved = None def readline(self): if self._saved is not None: s = self._saved self._saved = None return s return self.infile.readline() def saveline(self, s): if self._saved is not None: raise TypeError("Only one line may be saved") self._saved = sHere it is in action on a file-like object which doesn't support seeks.
>>> import urllib2 >>> infile = urllib2.urlopen("http://www.dalkescientific.com/writings/NBN/data/test.fasta") >>> f = UndoFile(infile) >>> f.readline() '>YAL059W-36310.36514 Putative promoter sequence\n' >>> f.readline() 'AAATAATATTTGGGGCCCCTCGCGGCTCATTTGTAGTATCTAAGATTATGTATTTTCTTT\n' >>> f.readline() 'TATAATATTTGTTGTTATGAAACAGACAGAAGTAAGTTTCTGCGACTATATTATTTTTTT\n' >>> f.readline() 'TTTTCTTCTTTTTTTTTCCTTTATTCAACTTGGCGATGAGCTGAAAATTTTTTTGGTTAA\n' >>> f.readline() 'GGACCCTTTAGAAGTATTGAATGTG\n' >>> f.readline() '>YAL058W-37154.37469 Putative promoter sequence\n' >>> s = _ >>> s '>YAL058W-37154.37469 Putative promoter sequence\n' >>> f.saveline(s) >>> f.readline() '>YAL058W-37154.37469 Putative promoter sequence\n' >>>and here's how to use it to read a FASTA record
def read_fasta_record(infile): line = infile.readline() if not line: return # End-of-file # Double-check that it's a title line if not line.startswith(">"): raise TypeError( "The title line must start with a '>': %r" % line) title = line[1:].rstrip("\n") sequences = [] while 1: line = infile.readline() if line.isspace(): break if line.startswith(">"): # Read the first line of the next record. # Save for next time and go with what we've had infile.saveline(line) break sequences.append(line) return FastaRecord(title, "".join(sequences))
This is a good general-purpose solution. You can use the UndoFile with anything that parses files using readline and you can use it with anything that understands that new "saveline" protocol. Protocol is the Python word which says that there's an agreement on what a given objects must do to be labled "*-like". The more an object implements the file protocol the more it can be used any place that a file can be used.
Some of the Biopython parsers use this approach. The biggest problem is the performance. Every readline() first goes to the UndoHandle (that's the Biopython version of UndoFile) and nearly all then go to actual file object. Function calls in Python are somewhat slow.
There's also the complication in the caller of wrapping the actual file into an undo-able object.
The iterator protocol
The for-loop works on lists, dictionaries, strings, files, generator objects and other data types. The technical term for the things you iterate over is container; you iterator over elements in the container.
You can make your own objects work in a for-loop by implementing the iterator protocol. The for-loop starts by asking the object for its iterator. The object can be a container (like a string or dictionary) or an iterator. If it's an iterator it should return itself. The for-loop gets the iterator by calling the special method __iter__().
The iterator must implement two methods: __iter__() and next(). The first almost certainly returns itself. The second is more interesting. When the for-loop needs the next element from the iterator it gets it by calling next() and using the return value. If there are no more values then next() must raise an exception named StopIteration.
To see that in action, here's an iterator which counts from a given number down to 0.
>>> class Countdown(object): ... def __init__(self, value): ... self.value = value ... def __iter__(self): ... return self ... def next(self): ... if self.value < 0: ... raise StopIteration ... value = self.value ... self.value = value-1 ... return value ... >>> for i in Countdown(5): ... print i ... 5 4 3 2 1 0 >>>In detail here's what the for-loop uses from the container and the iterator.
>>> container = Countdown(5) >>> iter(container) <__main__.Countdown object at 0x554f0> >>> iterobj = iter(container) >>> iterobj.next() 5 >>> iterobj.next() 4 >>> iterobj.next() 3 >>> iterobj.next() 2 >>> iterobj.next() 1 >>> iterobj.next() 0 >>> iterobj.next() Traceback (most recent call last): File "<stdin>", line 1, in ? File "<stdin>", line 8, in next StopIteration >>>
Possible solution #3 - a FASTA record iterator
I could make an iterator over the records in a FASTA file pretty easily if I have a read_fasta_record function which reads a record and has the file handle set up for the next record.
import fasta_reader class FastaRecordReader(object): def __init__(self, infile): self.infile = infile def __iter__(self): return self def next(self): x = fasta_reader.read_fasta_record(self.infile) if x is None: raise StopIteration return x >>> for rec in FastaRecordReader(open("simple.fasta")): ... print rec.title ... first sequence record second sequence record third sequence record >>>This is effectively identical to the generator solution of
import fasta_reader def read_fasta_records(infile): while 1: rec = fasta_reader.read_fasta_record(infile) if rec is None: break yield rec
What's useful about the FastaRecordReader class solution is that the object provides a place to store information needed to read the next record, so long as there isn't lookahead information which needs to be shared with other parsers which expect a standard file object.
Instead of saving the look-ahead information in its own class, I'll merge it with the record reading code to get a new FastaRecordReader. The interface is unchanged, except that this will correctly handle cases where there is no blank line to end a record.
class FastaRecordReader(object): def __init__(self, infile): self.infile = infile self._saved = None def __iter__(self): return self def next(self): # The first line could come from the lookahead if self._saved is not None: line, self._saved = self._saved, None else: line = self.infile.readline() if not line: # end-of-file raise StopIteration # Double-check that it's a title line if not line.startswith(">"): raise TypeError( "The title line must start with a '>': %r" % line) title = line[1:].rstrip("\n") sequences = [] while 1: line = self.infile.readline() if line.isspace(): break if line.startswith(">"): self._saved = line break sequences.append(line) return FastaRecord(title, "".join(sequences))
Possible solution #4 - a generator
I showed how the first FastaRecordReader class was "effectively identical to the generator implementation of read_fasta_records." The same is true with the improved FastaRecordReader – I can make an improved generator version. The code only changes in a few places; I use the local variable "saved" instead of the instance attribute "self._saved" and I "yield" items in a while loop instead of "return"ing items each time the function is called.
def read_fasta_records(infile): saved = None while 1: # Look for the start of a record if saved is not None: line = saved saved = None else: line = infile.readline() if not line: return # skip blank lines if line.isspace(): continue # Double-check that it's a title line if not line.startswith(">"): raise TypeError( "The title line must start with a '>': %r" % line) title = line.rstrip() # Read the sequence lines until the next record, a blank # line, or the end of file sequences = [] while 1: line = infile.readline() if not line or line.isspace(): break if line.startswith(">"): # The start of the next record saved = line break sequences.append(line.rstrip("\n")) yield FastaRecord(title, "".join(sequences))
The only major difference is in how they save state. The class approach uses instance variables while the generator uses local variables. The technical term for saving state in local variables is closure. Some people prefer closures over objects and classes.
I've found that the generator code for examples like this are easier than a class-based solution, but mostly because everything is in one function instead of in two methods of a class. In theory anything one one form has a solution in the other form and the translation from one to the other is mechanical. My experience says that the differences are also very minor.
(The more complex translations - though still mechanical - occur when the closure solution yields when inside of several loops. Bioinformatics data files are made of records and they can be processed a record at a time so this doesn't happen frequently. I've only needed the more complex class solution once.)
Which is best?
That depends on your style and what you need it for. I usually prefer the generator function first (easiest for me to use), the look-ahead solution second (very general and easiest to undestand), and the class-based one last.
Copyright © 2001-2020 Andrew Dalke Scientific AB