Dalke Scientific Software: More science. Less time. Products

Web Programming

Here's what happens with you use tell your web browser to use HTTP to connect to a server. (The details are in RFC 2616.)

Here's an example of the request header when I connect to http://www.nbn.ac.za/.
GET / HTTP/1.1
Host: www.nbn.ac.za
Connection: keep-alive
User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-us) AppleWebKit/312.1 (KHTML, like Gecko) Safari/312
Accept: */*
Accept-Encoding: gzip, deflate
Accept-Language: en-us

This says that the request is an HTTP "GET" request for the "/" resource of the server. The request is being made using the HTTP/1.1 specification. The client tells the server that it reached the server using the hostname "www.nbn.ac.za". (The server might not know the host name if it serves multiple domains or if the request is forwarded to some other machine.)

The "Connection" field tells the server to keep the connection open with the client even after the response is sent.

The last few fields provide information about the client and preferences:

The MIME type lets the client have some idea of how to display the result. A MIME type is a short notation for the contents of a file. Simple text is "text/plain" while HTML is "text/html". A PNG file is "image/png" and a PDF file is "application/pdf". Some of these MIME types are formally specified but people often make up their own for new file formats.

The response started with this

HTTP/1.1 200 OK
Date: Tue, 30 Aug 2005 01:26:33 GMT
Server: Apache/2.0.52 (FreeBSD) DAV/2 PHP/4.3.10
Last-Modified: Thu, 25 Aug 2005 12:04:04 GMT
ETag: "2a6ccc-7d48-d34d5500"
Accept-Ranges: bytes
Content-Length: 32072
Keep-Alive: timeout=15, max=100
Connection: Keep-Alive
Content-Type: text/html; charset=ISO-8859-1

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <title>National Bioinformatics Network</title>
  <script language="JavaScript" src="/incnew/nav.js"></script>
  <link rel="stylesheet" href="/incnew/styles.css" type="text/css">
  <script language="JavaScript">

This says the response is also in HTTP/1.1. The response code "200" (which means "OK") says that the requested resource "/" was found. "Date" is the time the request was made, "Server" contains configuration information about the server. The "/" resource was last modified on 25 August. The ETag is a value which is supposed to change if the given resource changes. A client can ask the server to only send the full document if the ETag changed since the previous request. That's a cool but tricky part about the web.

The server supports requests for a range of characters, for example, to get only the part of the file between bytes 100 and 150. The response is 32072 bytes long. The "Keep-Alive" fields are in response to the request for a keep-alive connection. The last field of this header says the response is in "text/html" and more specifically that the characters are in the ISO-8859-1 (also called "Latin-1") character set.

The body contains the HTML for the NBN's home page.

I went through this because I think it's helpful to understand that fundamentally web programming isn't complex. A block of text goes one way, a block of text comes back. You could parse the headers in the request and response using the same skills you developed for parsing FASTA or BLAST output files.

On the other hand, letters in a book are easy to understand but the contents of the book may be simple to understand or very hard. Web application programming is like that. There are many conventions for how to do things and keeping them all in mind can be a pain.

Programmers love developing frameworks. These are supposed to organize the complexity to help you focus on your project and not on all the details of HTTP. Web frameworks are easy to develop so there a lot. Because of the PyWebOff I've been using CherryPy because it seems like the best solution (for now) for Python and for small web applications.

I'll walk through the CherryPy tutorial.

Here's the debug output including information from when I connected to http://localhost:8080/

2005/08/30 04:04:14 CONFIG INFO Server parameters:
2005/08/30 04:04:14 CONFIG INFO   logToScreen: 1
2005/08/30 04:04:14 CONFIG INFO   logFile: 
2005/08/30 04:04:14 CONFIG INFO   protocolVersion: HTTP/1.0
2005/08/30 04:04:14 CONFIG INFO   socketHost: 
2005/08/30 04:04:14 CONFIG INFO   socketPort: 8080
2005/08/30 04:04:14 CONFIG INFO   socketFile: 
2005/08/30 04:04:14 CONFIG INFO   reverseDNS: 0
2005/08/30 04:04:14 CONFIG INFO   socketQueueSize: 5
2005/08/30 04:04:14 CONFIG INFO   threadPool: 0
2005/08/30 04:04:14 CONFIG INFO   sslKeyFile: 
2005/08/30 04:04:14 CONFIG INFO   sessionStorageType: 
2005/08/30 04:04:14 CONFIG INFO   staticContent: []
2005/08/30 04:04:14 HTTP INFO Serving HTTP on socket: ('', 8080)
2005/08/30 04:04:19 HTTP INFO 127.0.0.1 - GET / HTTP/1.1
2005/08/30 04:04:19 HTTP INFO 127.0.0.1 - GET /favicon.ico HTTP/1.1

The CherryPy server is a web server. It converts the request URL into a set of actions for Python. More specifically it converts the path of the URL (which looks like /writings/NBN/) into something which looks like

root = server_configuration.get_root()
outfile.write(root.writings.NBN.index())
This lets you think of web programming as being similar to developing functions that are part of a Python object.

I'm going to go through the examples included with the CherryPy distribution. Those are in the subdirectory CherryPy-2.0.0/cherrypy/tutorial in your downloaded copy of CherryPy. I'll assume everyone here has written HTML before. Here's a tutorial on HTML forms for those who haven't done forms before. It's not the best tutorials; for example this one has pictures to show you what the controls look like.

"""
Tutorial 02 - Multiple methods

This tutorial shows you how to link to other methods of your request
handler.
"""

from cherrypy import cpg

class HelloWorld:

    def index(self):
        # Let's link to another method here.
        return 'We have an <a href="showMessage">important message</a> for you!'

    index.exposed = True


    def showMessage(self):
        # Here's the important message!
        return "Hello world!"

    showMessage.exposed = True

cpg.root = HelloWorld()
cpg.server.start(configFile = 'tutorial.conf')

"""
Tutorial 03 - Passing variables

This tutorial shows you how to pass GET/POST variables to methods.
"""

from cherrypy import cpg

class WelcomePage:

    def index(self):
        # Ask for the user's name.
        return '''
            <form action="greetUser" method="GET">
            What is your name?
            <input type="text" name="name" />
            <input type="submit" />
            </form>
        '''

    index.exposed = True


    def greetUser(self, name = None):
        # CherryPy passes all GET and POST variables as method parameters.
        # It doesn't make a difference where the variables come from, how
        # large their contents are, and so on.
        #
        # You can define default parameter values as usual. In this
        # example, the "name" parameter defaults to None so we can check
        # if a name was actually specified.

        if name:
            # Greet the user!
            return "Hey %s, what's up?" % name
        else:
            # No name was specified
            return 'Please enter your name <a href="./">here</a>.'

    greetUser.exposed = True


cpg.root = WelcomePage()
cpg.server.start(configFile = 'tutorial.conf')

There's a small security problem here; what if the name is "<a href='http://google.com/'>Andrew</a>"? This is an example of a cross-site scripting (XSS) attack. Additional info on XSS. To make it harder to do a XSS attack you should quote untrustworthy strings using cgi.escape()
>>> import cgi
>>> cgi.escape("P&G")
'P&amp;G'
>>> cgi.escape("<b>hi!</b>")
'&lt;b&gt;hi!&lt;/b&gt;'
>>> 

At this point you know enough to develop a simple web application for sequence data. Here's a program to compute the length of the input sequence. It ignores any space or newline characters in the input

from cherrypy import cpg

class BioWebApp:

    def index(self):
        # Ask for the user's name.
        return '''
            <form action="seqlength" method="GET">
            Sequence
            <input type="text" name="seq" />
            <input type="submit" />
            </form>
        '''

    index.exposed = True


    def seqlength(self, seq):
        n = len(seq) - seq.count(" ") - seq.count("\n")
        return "sequence length is %d" % n

    seqlength.exposed = True


cpg.root = BioWebApp()
cpg.server.start()

I want to paste in a FASTA file as input instead of a single sequence line. The file might be big and for various reasons it's better to use a POST for this instead of a GET. To handle this you need to learn about the cStringIO module. This lets you work with a string as if it is a file. (There's a StringIO module too but cStringIO is written in C and is faster.)

>>> import cStringIO
>>> f = cStringIO.StringIO("This is\na test\nof the\ncStringIO.")
>>> f.readline()
'This is\n'
>>> f.readline()
'a test\n'
>>> f.readline()
'of the\n'
>>> f.readline()
'cStringIO.'
>>> f.readline()
''
>>> f.seek(10)
>>> f.read(5)
'test\n'
>>> 
I'll use a StringIO to convert the input string into something I can parse using the fasta_reader module.

Here's the previous web application modified to use a POST instead of a GET, to make the input text area bigger, and to parse the input as a FASTA file.

from cherrypy import cpg
from cStringIO import StringIO

import fasta_reader

class BioWebApp:

    def index(self):
        # Ask for the user's name.
        return '''<form action="seqlength" method="POST">
Sequence (in FASTA format):<br />
<textarea name="seq" rows="10" cols="80" /></textarea><br />
<input type="submit" />
</form>
        '''

    index.exposed = True

    def seqlength(self, seq):
        f = StringIO(seq)
        rec = fasta_reader.read_fasta_record(f)
        n = len(rec.sequence)
        return "sequence length is %d" % n

    seqlength.exposed = True


cpg.root = BioWebApp()
cpg.server.start()

In the next example I'll list the length of all of the fields in the input FASTA file. I'll change how I make the output. Instead of returning a single string I'll yield several strings. CherryPy has special code to handle generator functions. It iterates over the yielded strings to build the response. This is very handy because making a large string gets to be cumbersome. Doing things the CherryPy way makes the yield statement act almost like a print statment. Also, note that the FASTA title can contain almost arbitrary text so needs to be escaped


from cherrypy import cpg
from cStringIO import StringIO
import cgi

import fasta_reader

class BioWebApp:

    def index(self):
        # Ask for the user's name.
        return '''<form action="seqlength" method="POST">
Sequence:<br />
<textarea name="seq" rows="10" cols="80" /></textarea><br />
<input type="submit" />
</form>
        '''

    index.exposed = True

    def seqlength(self, seq):
        f = StringIO(seq)
        rec = fasta_reader.read_fasta_record(f)
        n = len(rec.sequence)
        yield "Sequence length of <i>"
        yield cgi.escape(rec.title)
        yield "</i> is %d" % n

    seqlength.exposed = True


cpg.root = BioWebApp()
cpg.server.start()

I can make a table listing the sequence length of each record like this

from cherrypy import cpg
from cStringIO import StringIO
import cgi

import fasta_reader

class BioWebApp:

    def index(self):
        # Ask for the user's name.
        return '''<form action="seqlength" method="POST">
Sequence:<br />
<textarea name="seq" rows="10" cols="80" /></textarea><br />
<input type="submit" />
</form>
        '''

    index.exposed = True

    def seqlength(self, seq):
        f = StringIO(seq)
        # Define the table
        yield '<table border="1">'
        # Define the table headers
        yield '<tr><th>Title</th><th>Length</th></tr>\n'

        # data rows in the table
        for rec in fasta_reader.read_fasta_records(f):
            n = len(rec.sequence)
            yield "<tr><td>"
            yield cgi.escape(rec.title)
            yield "</td><td>"
            yield str(n)
            yield "</td></tr>\n"

        # Finish the table
        yield "</table>"

    seqlength.exposed = True


cpg.root = BioWebApp()
cpg.server.start()

One thing that makes web programming hard is that HTTP is stateless. The client connects to the server, makes a request, gets the response, and (effectively) closes the connection. If several clients may use the server then every time a client connects the server has to figure out which client it was and what it was doing the last time it was there.

It's like mail. There isn't a string connecting the mail to the person who sent it. Instead you need to look at the envelope (or the headers of it's email) to figure out if it's someone you know and if it's continuing one conversation or starting a new one. Humans are pretty good at that. Computers aren't.

There are two common ways to keep track of what's going on. One is by using a cookie. This is another bit of computer programmer jargon. From the FOLDOC definition:

2. <protocol> A handle, transaction ID, or other token of agreement between cooperating programs. "I give him a packet, he gives me back a cookie".

The ticket you get from a dry-cleaning shop is a perfect mundane example of a cookie; the only thing it's useful for is to relate a later transaction to this one (so you get the same clothes back).
or the Wikipedia definition.

Cookies work through headers in the HTTP request and response. Cookies are small bits of text, stored by host name or domain. If a client connects to a server and the server's name or domain is in the cookie jar then the client sends the cookie's text in the header. The server can ask the client to store a cookie for use the next time the client connects to the server. This is done through a header in the response.

The other way is to pass the information around through the URL. Each client (or session or resource) is assigned a unique string. This string is passed around as part of the URL. For example, I might have a "uid" (for user-id) of 14488. The server can rewrite all of the URLs it gives me so they look like one of the following:

There are tradeoffs between using cookies and using URLs for this inforamtion. The choice depends on what you want to do and what you are more comfortable with.

What I want to do is allow someone to send the FASTA file to the server and see the table of sequence sizes. I'll provide a "download to Excel" link to let the user get the results formatted as a tab-delimited file as for Excel. Because two URLs need to have information about the upload I'll need to store information on the server. I'll do this by giving each data set a unique id, used in the URL parameter.

The web interface will work like this:

The tab output file doesn't contain any HTML so I'll change the Content-Type to "text/plain". To make the file automatically load into Excel I would need to change the content-type to something configured configured for that. I think that's supposed to be "application/vnd.ms-excel" and perhaps with a "Content-Disposition" header like "attachment;filename=seqlength.xls" but I don't have Excel handy so I can't test if this is the case.
from cherrypy import cpg
from cStringIO import StringIO
import cgi

import fasta_reader

datasets = {}

# WARNING: this is not thread-safe
# WARNING: this is not secure (the values are easy to guess)
def add_dataset(records):
    N = len(datasets)
    # The id needs to be a string
    id = str(N)
    datasets[id] = records
    return id

class BioWebApp:

    def index(self):
        # Ask for the user's name.
        return '''<form action="seqlength" method="POST">
Sequence:<br />
<textarea name="seq" rows="10" cols="80" /></textarea><br />
<input type="submit" />
</form>
        '''

    index.exposed = True

    def seqlength(self, seq):
        f = StringIO(seq)
        records = fasta_reader.read_fasta_records(f)
        id = add_dataset(records)

        yield '<table border="1">'
        yield '<tr><th>Title</th><th>Length</thf></tr>\n'
        for rec in records:
            n = len(rec.sequence)
            yield "<tr><td>"
            yield cgi.escape(rec.title)
            yield "</td><td>"
            yield str(n)
            yield "</td></tr>\n"
        yield "</table><br />\n"
        yield '<a href="excel?id=%s">Download to Excel</a>' % (id,)

    seqlength.exposed = True


    def excel(self, id):
        records = datasets[id]
        cpg.response.headerMap["Content-Type"] = "text/plain"
        yield "id\tlength\n"
        for rec in records:
            # use the first word of the title; should be the id
            # BUG: what if the title is empty?
            # BUG: what if the title contains a tab?
            yield "%s\t%s\n" % (rec.title.split(), len(rec.sequence))

    excel.exposed = True

cpg.root = BioWebApp()
cpg.server.start()

I don't like this interface because it combines the file upload with information about the uploaded data set. That's okay, I can easily seperate the two. I'll send the form data to the "/upload" URL (instead of "/seqlength"). That will read the data set. If that worked I'll redirect the client to the "/summary?id=ID" URL. (If there was a problem then CherryPy will display the Python stack track.)

This code uses the function cherrypy.lib.httptools.redirect function to tell CherryPy to tell the web client to look someplace else for the results of the POST. This is a common practice in web development.

I changed the target url for the form so you'll need to reload the main page and re-enter the sequence data to make sure everything works.


from cStringIO import StringIO
import cgi

from cherrypy import cpg
from cherrypy.lib import httptools

import fasta_reader

datasets = {}

# WARNING: this is not thread-safe
# WARNING: this is not secure (the values are easy to guess)
def add_dataset(records):
    N = len(datasets)
    # The id needs to be a string
    id = str(N)
    datasets[id] = records
    return id

class BioWebApp:

    def index(self):
        # Ask for the user's name.
        return '''<form action="upload" method="POST">
Sequence:<br />
<textarea name="seq" rows="10" cols="80" /></textarea><br />
<input type="submit" />
</form>
        '''

    index.exposed = True

    def upload(self, seq):
        f = StringIO(seq)
        records = fasta_reader.read_fasta_records(f)
        id = add_dataset(records)
        httptools.redirect("summary?id=%s" % id)

    upload.exposed = True

    def summary(self, id):
        records = datasets[id]
        
        yield '<table border="1">'
        yield '<tr><th>Title</th><th>Length</thf></tr>\n'
        for rec in records:
            n = len(rec.sequence)
            yield "<tr><td>"
            yield cgi.escape(rec.title)
            yield "</td><td>"
            yield str(n)
            yield "</td></tr>\n"
        yield "</table><br />\n"
        yield '<a href="excel?id=%s">Download to Excel</a>' % (id,)


    summary.exposed = True


    def excel(self, id):
        records = datasets[id]
        cpg.response.headerMap["Content-Type"] = "text/plain"
        yield "id\tlength\n"
        for rec in records:
            # use the first word of the title; should be the id
            # BUG: what if the title is empty?
            # BUG: what if the title contains a tab?
            yield "%s\t%s\n" % (rec.title.split(), len(rec.sequence))

    excel.exposed = True

cpg.root = BioWebApp()
cpg.server.start()

In my BioWebApp I stored all of the records in memory. If the program runs for a long time or is used by a lot of people then the machine might run out of memory. If you quit and restart the program then the data disappears. If either are problems for you then you'll probably want to store the data in files or in a database. The choice depends on what you're doing with the data.

Even using those options you have to worry about what to do with old data. Because HTTP is stateless you don't know if someone's finished with the data. Some web applications have a "log out" or "release resources" button but people rarely use those. You may have to figure what to do with old records that haven't been touched in several years. Delete? Or save in case the user bookmarked the file and expects to return to it some day? Or just buy bigger hard drives to store all the old data?



Copyright © 2001-2013 Andrew Dalke Scientific AB