Dalke Scientific Software: More science. Less time. Products
Products

Python generators

In the parsing lecture I wrote two functions. One was read_fasta_record() which read a single record from a FASTA file and left the input file ready for processing the next record. I also write a read_fasta_records() function for reading all of the FASTA records into a list. Here's how it was implemented:

def read_fasta_records(input_file):
    records = []
    while 1:
        record = read_fasta_record(input_file)
        if record is None:
            break
        records.append(record)
    return records

In the exercises you used the read_fasta_records() function because that was easier to use. Compare these two programs which compute the average sequence length for FASTA records.

num_records = 0
total_size = 0

infile = open("br_sequences.fasta")
for record in fasta_reader.read_fasta_records(infile):
    num_records += 1
    total_size += len(record.sequence)

print "Average sequence length:", float(total_size) / num_records
num_records = 0
total_size = 0

infile = open("br_sequences.fasta")
while 1:
    record = fasta_reader.read_fasta_record(infile)
    if record is None:
        break
    num_records += 1
    total_size += len(record.sequence)

print "Average sequence length:", float(total_size) / num_records
The first is clearly easier to understand.

The problem with the first approach is that it entire contents of the file into memory. If there are a million records then the list will have a million FastaRecord entries. This may take a lot of memory.

Even if memory isn't a problem, suppose you want the first 10 records where the sequences start with "A", and no more data from the file. It might be that the first 10 records of the input all start with "A" but because read_fasta_records() reads all the records first, there's no way to stop when there have been enough matches.

The (modern) Python solution is to use generators. The for statement in Python works on anything which can be iterated. Iterate over a string and you get the characters in the string.

>>> for x in "Hi!":
...     print x
... 
H
i
!
>>> 
Iterate over a list and you get the elements of the list.
>>> for x in ["A", "list", "of", "words"]:
...     print x
... 
A
list
of
words
>>> 
Iterate over a dictionary and you get the keys.
>>> for x in {"zero": 0, "one": 1, "two": 2}:
...     print x
... 
zero
two
one
>>> 
Iterate over a open file and you get the lines in the file
>>> for x in open("simple.fasta"):
...     print repr(x)
... 
'>first sequence record\n'
'TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCA\n'
'\n'
'>second sequence record\n'
'CAGTTTTCGTTAAGAGAACTTAACATTTTCTTATGACGTAAATGA\n'
'AGTTTATATATAAATTTCCTTTTTATTGGA\n'
'\n'
'>third sequence record\n'
'GAACTTAACATTTTCTTATGACGTAAATGAAGTTTATATATAAATTTCCTTTTTATTGGA\n'
'TAATATGCCTATGCCGCATAATTTTTATATCTTTCTCCTAACAAAACATTCGCTTGTAAA\n'
'\n'
>>> 
The file example is a nice one because it shows that iterators only work once. They go "forward" and there's no way to backup or restart an iterator.

A generator is a special type of function that can be used as an iterator. A generator function is any function which uses the yield statement. Here's an example generator function first created then used in a for loop.

>>> def spam():
...   yield "one"
...   yield "two"
...   yield "three"
... 
>>> spam()
<generator object at 0x642d8>
>>> for x in spam():
...     print x
... 
one
two
three
>>> 
When you call a generator function it returns a generator object. This object stores information about the original function and the state of function call. (By "state" I mean the values of the variables in the function and the location of the code to run next.) The inital location is the start of the function.

When the for loop asks for the next element the generator starts (or continues) function execution at the place it left off. It runs until it reaches a yield statement. The value of the yield statement is used as the next element in the generator.

There are three ways to tell the for loop that there are no more elements in the generator: let the program execution reach the end of the function ("fall off the end" in programmer speak), use a return statement (it may not return any value), or raise a StopIteration exception.

The function call is suspended at the yield statement. The generator stores the function call state and returns the given element. To return the next item it resumes the function call with the next statement after the yield

Here's an example which shows that lines in the generator function are only executed when needed. Recall that zip() stops with the shortest iterable.

>>> def show():
...   print "I am here"
...   yield "Hello!"
...   print "you are there"
...   yield "Bye!"
... 
>>> for (x, y) in zip([0], show()):
...   print "-->", repr(x), repr(y)
... 
I am here
--> 0 'Hello!'
>>> for (x, y) in zip([0, 1], show()):
...   print "-->", repr(x), repr(y)
... 
I am here
you are there
--> 0 'Hello!'
--> 1 'Bye!'
>>> for (x, y) in zip([0, 1, 2], show()):
...   print "-->", repr(x), repr(y)
... 
I am here
you are there
--> 0 'Hello!'
--> 1 'Bye!'
>>> 

Generators can be more complex. Here's a yield statement inside of a for loop.

>>> def generate_ints(N):
...   for i in range(N):
...     yield i
... 
>>> for x in generate_ints(5):
...   print x
... 
0
1
2
3
4
>>> 

Turning a generator into a list is easy. The list() call converts any iterable (string, dictionary, file handle, etc.) into a list.

>>> list(generate_ints(5)) 
[0, 1, 2, 3, 4]
>>>

The examples so far were finite generators. Generators can also be infinite. Here's one which produces an infinite number of the given input.

>>> def forever(x):
...   while 1:
...     yield x
... 
>>> count = 0
>>> for item in forever(10):
...   count = count + 1 
...   if count > 1000:
...     print "That was enough"
...     break
... 
That was enough
>>> count
1001
>>> item
10
>>>
If you're curious, try doing
list(forever(0))
You should control-C or kill Python soon otherwise you'll run out of memory.

For those with some mathematical background, here's an example of a useful infinite generator for the Fibonacci series. You can see that the generator stores local variables for the next use

>>> def fib():
...     a, b = 0, 1
...     while 1:
...         yield b
...         a, b = b, a+b
... 
>>> for (i, f) in zip(range(10), fib()):
...   print i, "=", f
... 
0 = 1
1 = 1
2 = 2
3 = 3
4 = 5
5 = 8
6 = 13
7 = 21
8 = 34
9 = 55
>>>  

You know enough now to understand how to implement read_fasta_records() using yield and read_fasta_record().

def read_fasta_records(infile):
    while 1:
        rec = read_fasta_record(infile)
        if rec is None:
            break  # I could also use 'return' here
        yield rec
Here it is in use
>>> read_fasta_records(open("simple.fasta")) 
<generator object at 0x7f508>
>>> for rec in read_fasta_records(open("simple.fasta")):
...     print repr(rec.title)
... 
'first sequence record'
'second sequence record'
'third sequence record'
>>> 

Iterators allow what is sometimes called lazy-evaluation because elements aren't created until needed.

Iterators are becoming a more core component of Python. The itertools module, part of the standard library, contains iterators designed to work with other iterators. Here's an example of how to get the first 10 Fibonacci terms using islice()

>>> import itertools 
>>> itertools.islice(fib(), 10)
<itertools.islice object at 0x71f90>
>>> list(itertools.islice(fib(), 10))
[1, 1, 2, 3, 5, 8, 13, 21, 34, 55]
>>> 
This works like the [start:end:step] notation for getting portions of a list. In Python that's called slicing a list. The islice() call is a slice of an iterator.

Another handy itertools function is ifilter(). This is like the standard filter command except that it works on iterators instead of lists. It's makes a new iterator which returns those elements of the old iterator which pass the given function.

Here's a function which returns True if the record's sequence starts with an "MK", otherwise it returns False.

def startswith_MK(rec):
  return rec.sequence.startswith("MK")
 
I'll use it to count the number of records in the br_sequences.fasta file which start with an "MK".
>>> def startswith_MK(rec):
...     return rec.sequence.startswith("MK")
... 
>>> data = itertools.ifilter(startswith_MK,
...               read_fasta_records(open("br_sequences.fasta")))
>>> data
<itertools.ifilter object at 0x935f0>
>>> len(list(data))
15
>>> 
Here are the first 10 titles from the records where the sequence starts with "MK"

>>> for rec in itertools.islice(
...     itertools.ifilter(startswith_MK, read_fasta_records(open("br_sequences.fasta"))),
...                           10):
...     print rec.title
... 
gi|71083330|ref|YP_266049.1| bacteriorhodopsin [Candidatus Pelagibacter ubique HTCC1062]
gi|71062443|gb|AAZ21446.1| bacteriorhodopsin [Candidatus Pelagibacter ubique HTCC1062]
gi|67906782|gb|AAY82845.1| proteorhodopsin [uncultured bacterium MedeBAC46A06]
gi|67906661|gb|AAY82751.1| predicted proteorhodopsin [uncultured bacterium eBACmed18B02]
gi|67906506|gb|AAY82613.1| proteorhodopsin [uncultured bacterium MedeBAC35C06]
gi|67483470|gb|AAY68055.1| proteorhodopsin [uncultured bacterium]
gi|67483454|gb|AAY68047.1| proteorhodopsin [uncultured bacterium]
gi|32699616|sp|Q9F7P4|PRRG_PRB01 Green-light absorbing proteorhodopsin precursor (GPR)
gi|47779380|gb|AAT38609.1| predicted proteorhodopsin [uncultured gamma proteobacterium eBACHOT4E07]
gi|45644626|gb|AAS73014.1| proteorhodopsin [uncultured marine gamma proteobacterium EBAC20E09]

>>> 
while here are the titles from the first 10 records which also have a sequence starting with "MK".

>>> for rec in itertools.ifilter(startswith_MK,
...         itertools.islice(read_fasta_records(open("br_sequences.fasta")), 10)):
...     print rec.title
... 
gi|71083330|ref|YP_266049.1| bacteriorhodopsin [Candidatus Pelagibacter ubique HTCC1062]
gi|71062443|gb|AAZ21446.1| bacteriorhodopsin [Candidatus Pelagibacter ubique HTCC1062]

>>> 
Function composition of iterators can be quite powerful, though as you can see has a tendency to become difficult to read.



Contact Us | Home
Copyright © 2001-2013 Andrew Dalke Scientific AB. All rights reserved.
Company
Contact Us
News