Dalke Scientific Software: More science. Less time. Products

By this time you know the routine. Put the answers in a directory named "assignment6". Include a README file and any other files (Python, data files, etc.) needed to do the assignment. Email me a gzip'ed or bzip'ed tar file of the directory, or a zip file.

When will I grade the previous homework? I like grading even less than you like doing the homework! Maybe when Janet starts teaching ontologies I'll have the time and inclination...

Part 1

This will build on the generators lecture from last week. Put your function definitions in the Python file named "codon_functions.py". Make sure the file can be imported and used as a module. Once all the functions are implements you can test the module by putting test_codon_functions.py in your assignment directory and running it as python test_codon_functions .

Generate codons

Write a generator function named "get_codons" which takes a DNA sequence as a string and yields each 3-letter codon.

Note: The sequence might not have a multiple of 3 bases. If that happens you must exclude the final 1- or 2-base term.

To help, here is code to test your function.

def test_get_codons():
    for seq, expected_codons in (
        # These lengths are a multiple of 3
        ("", []),
        ("ATC", ["ATC"]),
        ("ATCGAT", ["ATC", "GAT"]),
        ("ATCGTGCATAGACTATGCAATATACCG",
           ["ATC","GTG","CAT","AGA","CTA","TGC","AAT","ATA","CCG"]),

        # These lengths are not a multiple of 3
        ("A", []),
        ("TC", []),
        ("TTTC", ["TTT"]),
        ("TTTCC", ["TTT"]),
        ("AAATTTCCCGGGATCG", ["AAA", "TTT", "CCC", "GGG", "ATC"]),
        ):

        codons = list(get_codons(seq))
        if codons != expected_codons:
            raise AssertionError("Codons for %r was %r, expected %r" %
                                 (seq, codons, expected_codons))

    # This checks that the 'get_codon' function returns a generator
    gen = get_codons("ATC")
    if gen.next() != "ATC":
        raise AssertionError("Could not get the codon")
    try:
        gen.next()
    except StopIteration:
        pass
    else:
        raise AssertionError("Only supposed to find one codon")

def test():
    test_get_codons()
        
if __name__ == "__main__":
    test()
    print "All tests passed."

Generate translated bases

Write a generator function named "translate_codons" which converts each codon from get_codons() into protein residues. The standard translation table is

table = {
     'TTT': 'F', 'TTC': 'F', 'TTA': 'L', 'TTG': 'L', 'TCT': 'S',
     'TCC': 'S', 'TCA': 'S', 'TCG': 'S', 'TAT': 'Y', 'TAC': 'Y',
     'TGT': 'C', 'TGC': 'C', 'TGG': 'W', 'CTT': 'L', 'CTC': 'L',
     'CTA': 'L', 'CTG': 'L', 'CCT': 'P', 'CCC': 'P', 'CCA': 'P',
     'CCG': 'P', 'CAT': 'H', 'CAC': 'H', 'CAA': 'Q', 'CAG': 'Q',
     'CGT': 'R', 'CGC': 'R', 'CGA': 'R', 'CGG': 'R', 'ATT': 'I',
     'ATC': 'I', 'ATA': 'I', 'ATG': 'M', 'ACT': 'T', 'ACC': 'T',
     'ACA': 'T', 'ACG': 'T', 'AAT': 'N', 'AAC': 'N', 'AAA': 'K',
     'AAG': 'K', 'AGT': 'S', 'AGC': 'S', 'AGA': 'R', 'AGG': 'R',
     'GTT': 'V', 'GTC': 'V', 'GTA': 'V', 'GTG': 'V', 'GCT': 'A',
     'GCC': 'A', 'GCA': 'A', 'GCG': 'A', 'GAT': 'D', 'GAC': 'D',
     'GAA': 'E', 'GAG': 'E', 'GGT': 'G', 'GGC': 'G', 'GGA': 'G',
     'GGG': 'G', }
stop_codons = [ 'TAA', 'TAG', 'TGA', ],
start_codons = [ 'TTG', 'CTG', 'ATG', ]

Various codon tables are available in Biopython so you should use that instead. The library also handles the ambiguous encodings so "ATH" (where "H" means "A", "C", or "T") encodes for isoleucine because "ATA", "ATC" and "ATT" all code for isoleucine. Here are examples of use:
>>> from Bio.Data import CodonTable
>>> table = CodonTable.ambiguous_dna_by_name["Standard"]
>>> table.forward_table["ATG"]
'M'
>>> table.forward_table["ATA"]
'I'
>>> table.forward_table["ATH"]
'I'
>>>  table.stop_codons 
['TAA', 'TAG', 'TGA']
>>> table.forward_table["TAA"]
Traceback (most recent call last):
  File "<stdin>", line 1, in ?

  File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/
python2.3/site-packages/Bio/Data/CodonTable.py", line 557, in __getitem__
    raise KeyError, codon  # it's a stop codon
KeyError: 'TAA'
>>> 

The KeyError is used for anything which can't be translated, including stop codons. Your function must test if the codon is a stop codon and return a "*" for that case.

Here is the new test code for this function

def test_translate_codons():
    # A quick test to make sure the function allows lists of codons
    residues = list(translate_codons( ["GAG", "AAG", "TTG", "GCT", "GAT"]))
    if residues != ["E", "K", "L", "A", "D"]:
        raise AssertionError("Residues for list data was %r, expected %r" %
                             (residues, ["E", "K", "L", "A", "D"]))

    for (seq, expected_residues) in (
        # Multiple of 3, no stop codons, not ambiguous
        ("", []),
        ("GCT", ["A"]),
        ("GCTAAT", ["A", "N"]),
        ("TGGGAGCGTGATAATGCT", ["W", "E", "R", "D", "N", "A"]),
        
        # Not multiples of 3, no stop codons, not ambiguous
        ("A", []),
        ("GAGAAGTTGGCTGAT", ["E", "K", "L", "A", "D"]),
        
        # Test the stop codon
        ("TAATAGTGA", ["*", "*", "*"]),
        ("AAATAAAAATGAAAA", ["K", "*", "K", "*", "K"]),
        
        # Test the ambiguous codes
        ("ATH", ["I"]),
        ("AGRAGY", ["R", "S"]),
        ("AAAAGRTAGATHC", ["K", "R", "*", "I"]),
        ):
        
        residues = list(translate_codons(get_codons(seq)))
        if residues != expected_residues:
            raise AssertionError("Residues for %r was %r, expected %r" %
                                 (seq, residues, expected_residues))
    
    # This checks that the 'translate_codons' function uses an iterator
    # and returns a generator
    def yield_codons():
        yield "GCT"
        raise AssertionError("I didn't ask for the second codon")

    gen = translate_codons(yield_codons())
    if gen.next() != "A":
        raise AssertionError("Could not get the codon")
    try:
        # Make sure it raises the expected
        gen.next()
    except AssertionError:
        pass
    else:
        raise AssertionError("There was a second item?")


def test():
    test_get_codons()
    test_translate_codons()
      
  

I'll get picky before anyone corrects me - biologically speaking translations occur from RNA to protein so this skips the transcription step.

Translate to stop codon

Translations stop at the stop codon. Add the function "translate_codons_to_stop" which takes a codon iterator and yields the translated protein residue until the end of the input or up to the stop codon. Here's some test, which isn't much changed from "test_translate_codons".

def test_translate_codons_to_stop():
    for (seq, expected_residues) in (
        # Multiple of 3, no stop codons, not ambiguous
        ("", []),
        ("GCT", ["A"]),
        ("GCTAAT", ["A", "N"]),
        ("TGGGAGCGTGATAATGCT", ["W", "E", "R", "D", "N", "A"]),
        
        # Not multiples of 3, no stop codons, not ambiguous
        ("A", []),
        ("GAGAAGTTGGCTGAT", ["E", "K", "L", "A", "D"]),
        
        # Test the stop codon
        ("TAATAGTGA", []),
        ("AAATAAAAATGAAAA", ["K"]),
        ("GAGAAATGA", ["E", "K"]),
        
        # Test the ambiguous codes
        ("ATH", ["I"]),
        ("AGRAGY", ["R", "S"]),
        ("AAAAGRTAGATHC", ["K", "R"]),
        ):
        
        residues = list(translate_codons_to_stop(get_codons(seq)))
        if residues != expected_residues:
            raise AssertionError("Residues for %r was %r, expected %r" %
                                 (seq, residues, expected_residues))
    
    # This checks that the 'translate_codons_to_stop' function uses an iterator
    # and returns a generator
    def yield_codons():
        yield "GCT"
        raise AssertionError("I didn't ask for the second codon")

    gen = translate_codons_to_stop(yield_codons())
    if gen.next() != "A":
        raise AssertionError("Could not get the codon")
    try:
        # Make sure it raises the expected
        gen.next()
    except AssertionError:
        pass
    else:
        raise AssertionError("There was a second item?")

Translate reading frames (optional)

The previous function assumes the first codon is part of the translation. Translation actually starts at a start codon and goes to the end codon. The translation may occur in one of 6 different reading frames. ( See here for background.) To get the start codons from the CodonTable

>>> table.start_codons 
['TTG', 'CTG', 'ATG']
>>> 

For this assignment assume that get_codons() returns codons for the proper reading frame. In your "codon_functions.py" module add a new function named "translate_reading_frame" which takes the output from get_codons and returns the translated region from the first start codon up to the end codon or the end of the sequence. The start codon is included in the protein sequence but the stop residue is not.

Hint: There are many ways to implement this. One of the simpler ways uses the iter() function. This converts a list or iterator into an iterator. (Generator functions are one type of itererator.) A nice thing about iterators is that when it's used in a for loop it starts from where it left off in the previous for loop. Lists, by comparison, start again from the beginning. Here's an example of the difference

>>> seq = "Hello Cape Town!"
>>> seq_iter = iter(seq)
>>> 
>>> def show_word(seq_iter):
...   for c in seq_iter:
...     if c == " ":
...       break
...     print repr(c),
...   print
... 
>>> show_word(seq_iter)
'H' 'e' 'l' 'l' 'o'
>>> show_word(seq_iter)
'C' 'a' 'p' 'e'
>>> show_word(seq_iter)
'T' 'o' 'w' 'n' '!'
>>> show_word(seq_iter)

>>> show_word(seq)
'H' 'e' 'l' 'l' 'o'
>>> show_word(seq)
'H' 'e' 'l' 'l' 'o'
>>> show_word(seq)
'H' 'e' 'l' 'l' 'o'
>>> 

Another way is to use a composition of two of the functions from the itertools module. This is probably the most elegant solution but it takes some practice to think about how to solve it this way. If you have the time, try this approach as well and let me know what you think about them.

Here is test code for your new function

def test_translate_reading_frame():
    for (seq, expected_orf) in (
        ("", []),
        
        # Has a start and an end
        ("CTGAAATGA", ["L", "K"]),
        # start codon is the 3rd residue
        ("ATHAGYCTGAAATGA", ["L", "K"]),
        # make sure it does not read multiple reading frames
        ("ATHAGYCTGAAATGAATHAGYCTGAAATGA", ["L", "K"]),

        # Test a different start codon; go to the end (no stop codon)
        ("AAAATGGGGGCT", ["M", "G", "A"]),

        # Make sure only one start codon is accepted
        ("AAAATGGGGATGGCT", ["M", "G", "M", "A"]),
        ):

        orf = list(translate_reading_frame(get_codons(seq)))
        if orf != expected_orf:
            raise AssertionError("ORF for %r was %r, expected %r" %
                                 (seq, orf, expected_orf))

    # Check that the translate_reading_frame function works with
    # both lists and iterables
    orf = translate_reading_frame(["ATH", "AGY", "CTG", "AAA", "TGA"])
    x = orf.next()
    if x != "L":
        raise AssertionError("First residue is %r" % (x,))
    x = orf.next()
    if x != "K":
        raise AssertionError("Second residue is %r" % (x,))
    try:
        x = orf.next()
    except StopIteration:
        pass
    else:
        raise AssertionError("Unexpected third residue %r" % (x,))

Remember, once you've created your "codon_functions.py" module you should test it with the test_codon_functions.py test code.

Part 2

This section applies only to those who want to practice doing HTML. If you feel comfortable already with HTML then in your README write "I am comfortable using HTML."

Write an HTML page named "bio.html" containing a short description of yourself. It should have the following:

Part 3 - making HTML pages

A - make a table

You will write a command-line program named "lengths.py" which makes an HTML page describing the sequence length of records from a FASTA file.

The input will be from a FASTA-formatted file specified on the command-line. It will have 0 or more FASTA records. The first word of the title is the sequence id. For example, if the title is "gi|12345 Golem/Hobbit hybrid" then the sequence id is "gi|12345".

Your program will generate the HTML file named "lengths.html". That file will contain a title and an h1 header based on the input FASTA filename. For example, if the FASTA filename is "br_sequences.fasta" then you might use the text "br_sequences.fasta record properties".

It will contain an HTML table with information about the record number (starting at 1), record id, and total sequence length. (The point of this is to generate HTML tables, not compute sequence properties.) Here's an example of what it might look like:

#idlength
1gi|4376110|emb|CAA49762.1|259
2gi|6320236|ref|NP_010316.1|320
3gi|6319869|ref|NP_009950.1|332
4gi|6319528|ref|NP_009610.1|344
5gi|67527045|gb|AAY68314.1|256
6gi|71083330|ref|YP_266049.1|255
7gi|4502639|ref|NP_000570.1|352
8gi|71062443|gb|AAZ21446.1|255
9gi|70982522|ref|XP_746789.1|304
10gi|57228009|gb|AAW44466.1|328

Optional: Use the following FASTA file as input to your program.

>"silly&lt;!-- strange FASTA record #1
ACDE
>&alpha;&beta;--> strange FASTA record #2
PPPP
Did it cause strange output? If so you'll need to escape the special characters. Use cgi.escape().

Optional: use the generator-based FASTA record reader from today's lecture instead of the older fasta_reader module.

B - HTML with a generated image

Modify the hydrophobicity plot code to make a new command-line program named "hydroplot.py". This will take a FASTA-formatted filename on the command line and generate two files based on the first record in the FASTA file.

One output file is "hydroplot.png" which is the hydrophobicity plot saved to a PNG file. (Use the code I gave in my lecture for making a PNG instead of displaying to the screen.) You should use either the triangle or the Savitzky-Golay filter and may use a hard-coded window size of 19.

The other is "hydroplot.html" which contains some basic information about the input record. You may decide for yourself what it should be. This HTML file must embed a image of "hydroplot.png" and that image must be a hyperlink to the PNG itself.

Optional: did you use cgi.escape in the right places?

Optional: Include a table of the start and end positions (in biologist coordinates) of each predicted transmembrane helix. Biologist coordinates start at 1 and include the end coordinate so the Python range [4,8) coorresponds to the biologist coordinates [5,8].

For both A and B, include your test files with the assignment and in your README tell me what I need to do to run your programs with your test files.



Copyright © 2001-2013 Andrew Dalke Scientific AB