Writing HTML using Python

You've written some HTML by hand. Here I'll show you how to write HTML using Python. There are better ways using HTML template languages which I'll talk about next week. But to understand them I think it's best to know how to do things manually first.

I'm going to write a program which takes a GenBank file and makes an HTML page with a table. Each entry of the table will have the feature name, start position and end position. The name will be a hyperlink to a FASTA file containing the sequence data for that feature.

The first step is to get the feature information from a FASTA file. For my input data I'll used AB077698 (originally from the BioJava distribution). You've already written a program to get GenBank feature data so I'll skip the Biopython specific part and show you the code.

from Bio import GenBank

parser = GenBank.RecordParser()
record = parser.parse(open("AB077698.gb"))

for feature in record.features:
    print "Feature", repr(feature.key), repr(feature.location)

Here's the output

Feature 'source' '1..2701'
Feature 'gene' '1..2701'
Feature "5'UTR" '<1..79'
Feature 'CDS' '80..1144'
Feature 'misc_feature' '137..196'
Feature 'misc_feature' '239..292'
Feature 'misc_feature' '617..676'
Feature 'misc_feature' '725..778'
Feature "3'UTR" '1145..2659'
Feature 'polyA_site' '1606'
Feature 'polyA_site' '2660'

You can see there are three types of feature locations: just a position, a start and end position, and the strange one with the "<". That's called a fuzzy location and it means the start is at or to the left of position 1. (Actually, that one isn't really fuzzy. Consider (1.10)..(60.88) ). For details read the NCBI feature table definition.

Biopython can parse the details of the feature table. Many times that information isn't needed and there's extra effort in parsing that data so there's another GenBank parser – FeatureParser – which is used for that data.

from Bio import GenBank

parser = GenBank.FeatureParser()
record = parser.parse(open("AB077698.gb"))

for feature in record.features:
    print "Feature", repr(feature.type), repr(feature.location)

Note that I changed feature.key into feature.type. It would have been better if those were the same attribute names.

Here's the output from running the FeatureParser-based code.

Feature 'source' <Bio.SeqFeature.FeatureLocation instance at 0x1284b98>
Feature 'gene' <Bio.SeqFeature.FeatureLocation instance at 0x128d260>
Feature "5'UTR" <Bio.SeqFeature.FeatureLocation instance at 0x10efa58>
Feature 'CDS' <Bio.SeqFeature.FeatureLocation instance at 0x12910a8>
Feature 'misc_feature' <Bio.SeqFeature.FeatureLocation instance at 0x1288800>
Feature 'misc_feature' <Bio.SeqFeature.FeatureLocation instance at 0x1280f08>
Feature 'misc_feature' <Bio.SeqFeature.FeatureLocation instance at 0x1112440>
Feature 'misc_feature' <Bio.SeqFeature.FeatureLocation instance at 0x1114cd8>
Feature "3'UTR" <Bio.SeqFeature.FeatureLocation instance at 0x1118698>
Feature 'polyA_site' <Bio.SeqFeature.FeatureLocation instance at 0x1116058>
Feature 'polyA_site' <Bio.SeqFeature.FeatureLocation instance at 0x1114030>

You can see that the location is now a FeatureLocation instance instead of a string. This object has ways to return the fuzzy and non-fuzzy location information.

from Bio import GenBank

parser = GenBank.FeatureParser()
record = parser.parse(open("AB077698.gb"))

for feature in record.features:
    print "Feature", repr(feature.type)
    loc = feature.location
    print "  fuzzy", loc.start, loc.end
    print "  nofuzzy", loc.nofuzzy_start, loc.nofuzzy_end

The output

Feature 'source'
  fuzzy 0 2701
  nofuzzy 0 2701
Feature 'gene'
  fuzzy 0 2701
  nofuzzy 0 2701
Feature "5'UTR"
  fuzzy <0 79
  nofuzzy 0 79
Feature 'CDS'
  fuzzy 79 1144
  nofuzzy 79 1144
Feature 'misc_feature'
  fuzzy 136 196
  nofuzzy 136 196
Feature 'misc_feature'
  fuzzy 238 292
  nofuzzy 238 292
Feature 'misc_feature'
  fuzzy 616 676
  nofuzzy 616 676
Feature 'misc_feature'
  fuzzy 724 778
  nofuzzy 724 778
Feature "3'UTR"
  fuzzy 1144 2659
  nofuzzy 1144 2659
Feature 'polyA_site'
  fuzzy 1605 1605
  nofuzzy 1605 1605
Feature 'polyA_site'
  fuzzy 2659 2659
  nofuzzy 2659 2659

I'll use the non-fuzzy information because that's easier to deal with.

Now that I have the coordinate information I want to make HTML output for each feature. I'll use a table

from Bio import GenBank

parser = GenBank.FeatureParser()
record = parser.parse(open("AB077698.gb"))

print """<html>
<head>
 <title>Feature information</title>
</head>
<body>
<table border="1">"""

print "<tr><th>Feature</th><th>Start</th><th>End</th></tr>"

for feature in record.features:
    loc = feature.location
    print <tr><td>%s</td><td>%s</td><td>%s</td></tr>" % (
        feature.type, loc.nofuzzy_start, loc.nofuzzy_end)

print """</table>
</body></html>"""

Here's the output

<html>
<head>
 <title>Feature information</title>
</head>
<body>
<table border="1">
<tr><th>Feature</th><th>Start</th><th>End</th></tr>
<tr><td>source</td><td>0</td><td>2701</td></tr>
<tr><td>gene</td><td>0</td><td>2701</td></tr>
<tr><td>5'UTR</td><td>0</td><td>79</td></tr>
<tr><td>CDS</td><td>79</td><td>1144</td></tr>
<tr><td>misc_feature</td><td>136</td><td>196</td></tr>
<tr><td>misc_feature</td><td>238</td><td>292</td></tr>
<tr><td>misc_feature</td><td>616</td><td>676</td></tr>
<tr><td>misc_feature</td><td>724</td><td>778</td></tr>
<tr><td>3'UTR</td><td>1144</td><td>2659</td></tr>
<tr><td>polyA_site</td><td>1605</td><td>1605</td></tr>
<tr><td>polyA_site</td><td>2659</td><td>2659</td></tr>
</table>
</body></html>

and here's what the table looks like as HTML

Feature Start End

source 0 2701

gene 0 2701

5'UTR 0 79

CDS 79 1144

misc_feature 136 196

misc_feature 238 292

misc_feature 616 676

misc_feature 724 778

3'UTR 1144 2659

polyA_site 1605 1605

polyA_site 2659 2659

I want to save the output to the file "features.html". I'll use a new bit of Python syntax for that. It's an extension of the print statement used to print to a file instead of to the screen

outfile = open("hello.txt", "w")
print >>outfile, "Hello!"

The changes to the Python code are minimal

from Bio import GenBank

parser = GenBank.FeatureParser()
record = parser.parse(open("AB077698.gb"))
outfile = open("features.html", "w")

print >>outfile, """<html>
<head>
 <title>Feature information</title>
</head>
<body>
<table border="1">"""

print >>outfile, "<tr><th>Feature</th><th>Start</th><th>End</th></tr>"

for feature in record.features:
    loc = feature.location
    print >>outfile, "<tr><td>%s</td><td>%s</td><td>%s</td></tr>" % (
        feature.type, loc.nofuzzy_start, loc.nofuzzy_end)

print >>outfile, """</table>
</body></html>"""

The output file.

I also want to make FASTA files for the sequence from each feature. I'll name the first FASTA file "seq1.fasta", the second "seq2.fasta", and so on. First, the code to make the FASTA files.

def save_fasta(filename, title, sequence):
    fasta_file = open(filename, "w")
    fasta_file.write(">" + title + "\n")
    for i in range(0, len(sequence), 72):
        fasta_file.write(sequence[i:i+72])
        fasta_file.write("\n")
    fasta_file.close()

You've seens this code before, though in a different form. Here it is in the full program:

from Bio import GenBank

def save_fasta(filename, title, sequence):
    fasta_file = open(filename, "w")
    fasta_file.write(">" + title + "\n")
    for i in range(0, len(sequence), 72):
        fasta_file.write(sequence[i:i+72])
        fasta_file.write("\n")
    fasta_file.close()

parser = GenBank.FeatureParser()
record = parser.parse(open("AB077698.gb"))
outfile = open("features.html", "w")

print >>outfile, """<html>
<head>
 <title>Feature information</title>
</head>
<body>
<table border="1">"""

print >>outfile, "<tr><th>Feature</th><th>Start</th><th>End</th></tr>"

counter = 1
for feature in record.features:
    loc = feature.location
    start = loc.nofuzzy_start
    end = loc.nofuzzy_end

    # Make the FASTA file
    filename = "seq%s.fasta" % counter
    title = "Feature %s: %s" % (counter, feature.type)
    save_fasta(filename, title, record.seq.data[start:end+1])

    print >>outfile, "<tr><td>%s</td><td>%s</td><td>%s</td></tr>" % (
        feature.type, loc.start, loc.end)

    counter += 1
    

print >>outfile, """</table>
</body></html>"""

Here's the 9^th record.

Finally, I need to make a hyperlink from the feature's name to the FASTA record. In this case I only need an extra <a href="...">...</a>.

from Bio import GenBank

def save_fasta(filename, title, sequence):
    fasta_file = open(filename, "w")
    fasta_file.write(">" + title + "\n")
    for i in range(0, len(sequence), 72):
        fasta_file.write(sequence[i:i+72])
        fasta_file.write("\n")
    fasta_file.close()

parser = GenBank.FeatureParser()
record = parser.parse(open("AB077698.gb"))
outfile = open("features.html", "w")

print >>outfile, """<html>
<head>
 <title>Feature information</title>
</head>
<body>
<table border="1">"""

print >>outfile, "<tr><th>Feature</th><th>Start</th><th>End</th></tr>"

counter = 1
for feature in record.features:
    loc = feature.location
    start = loc.nofuzzy_start
    end = loc.nofuzzy_end

    # Make the FASTA file
    filename = "seq%s.fasta" % counter
    title = "Feature %s: %s" % (counter, feature.type)
    save_fasta(filename, title, record.seq.data[start:end+1])

    print >>outfile, '''<tr><td><a href="%s">%s</a></td><td>%s</td><td>%s</td></tr>''' % (
        filename, feature.type, loc.start, loc.end)

    counter += 1
    

print >>outfile, """</table>
</body></html>"""

To see it in action, here is features.html.