Dalke Scientific Software: More science. Less time. Products

Editing the taxonomy database

Stating the problem

In the previous lecture I made a taxonomy browser with support for searches and viewing individual records. I'm going to modify the server to let anyone add aliases for a given term.

The user in this case is a taxonomy specialist somewhere in the world. To make the user more concrete I'll name him Jim. Jim received his PhD in biology 2 years ago and is doing a post-doc on coral reef fishes of Micronesia. The local fishermen use names in their own language for different species, and don't use the Latin names. Jim published a paper listing the different names and now he wants to add it to the database so others can find it more easily. Jim is not a computer programmer and would rather spend his time in the water than learning how to use a new computer program.

Given the user case here's the interaction scenario. Jim is working with Pomacentridae (locally known as damselfishes) and wants to note that Abudefduf luridus is also known as 'canary damsel' or 'canary damselfish.' He goes to the taxonomy server and searches for "Abudefduf luridus". From the search results he selects the species record and sees that the aliases don't yet exist so he adds both of them. He then wants to record that Stegastes planifrons is also known as "threespot damselfish."

The above paragraph is a functional description of what Jim wants to do, but it doesn't desribe all of the details of how he does it. You as the developer need to figure out a few things:

I can think of several solutions. The choice of which is right depends on the users. Because I expect that people like Jim will more often broswe the database and will want little or no training I'll modify the browser interface slightly to include the ability to add new aliases. Suppose instead I had a local curational team searching the literature and adding aliases. In that case a specialized entry application makes more sense.

Understanding what people want and how to make the human/computer interface work well is part of "CHI" (for "computer/human interface) and includes subfields like usability design and user-centered development. For some initial pointers read some of the stories and books by Bruce Tognazzini, Alan Cooper's "The Inmates are Running the Asylum", and my lecture from last year.

Updating the taxid detail page

Here's the look of the page showing details about a taxonomy record.

Start a new search

Abudefduf luridus

Taxon identifier: 142648
Genetic Code: Standard

Parent taxon: Abudefduf
Children: None

I'm going to modify it to

Start a new search

Abudefduf luridus

Taxon identifier: 142648
Genetic Code: Standard
Aliases: canary damsel, canary damselfish [edit aliases]

Parent taxon: Abudefduf
Children: None

and clicking on the "edit aliases" link will take me to a page which looks like this:

Edit aliases for Abudefduf luridus

Taxon identifier: 142648
Edit, erase entries or add to the following list of aliases
1.
2.
3.
4.
5.
6.
7.

Changing the data model

To make this work I'll need to store aliases in the database. The current database has two tables, "taxonomy" and "genetic_code". They look like this:
table: taxonomy
taxidscientific_namerankparent_idgenetic_code_idmitochondrial_genetic_code_id
1rootno rank 10
2Bacteriasuperkingdom31567110
6Azorhizobiumgenus35928110
7Azorhizobium caulinodansspecies110
9Buchnera aphidicolaspecies2199110
...
table: genetic_code
idname
0Unspecified
1Standard
2Vertebrate Mitochondrial
3Yeast Mitochondrial
...

I need a new table which relates an alias record to a taxonomy record. The alias table must have two fields: a tax_id field which is the primary key of the taxonomy record (meaning "this alias is an alias to that taxonomy record") and the text of the alias. SQLObject adds another requirement that every record have a unique primary key. (This is not a requirement of a relational database and if were using SQL directly we wouldn't need an primary identifier.)
table: alias
idtax_idalias
1142648canary damsel
2142648canary damselfish
3159713threespot damselfish
...

The easiest way to make the table is through the database administation program. For sqlite that's 'sqlite3'. The other database systems have similar interfaces. I'll load the database and show the current schema:

[~/nbn/taxonomy] % sqlite3 taxdata.sqlite
SQLite version 3.2.7
Enter ".help" for instructions
sqlite> .schema
CREATE TABLE genetic_code (
    id INTEGER PRIMARY KEY,
    name TEXT
);
CREATE TABLE taxonomy (
    tax_id INTEGER PRIMARY KEY,
    scientific_name TEXT,
    rank TEXT,
    parent_id INT,
    genetic_code_id INT,
    mitochondrial_genetic_code_id INT
);
CREATE INDEX genetic_code_id_idx on taxonomy (genetic_code_id);
CREATE INDEX name_idx on genetic_code (name);
CREATE INDEX parent_id_idx on taxonomy (parent_id);
CREATE INDEX scientific_name_idx on taxonomy (scientific_name);
I then entered the following to create the new table
sqlite> CREATE TABLE alias (
   ...>     id INTEGER PRIMARY KEY,
   ...>     tax_id INTEGER,
   ...>     alias STRING);
sqlite> 
I could have done this by modifying the model.py file and using it to update the table but I find that working with the SQL level is easier to understand because there's one fewer layer of abstraction going on.

Because I want to work with the database through SQLObject I still need to modify my model.py file to add the 'Alias' table. The new definition is

class Alias(SQLObject):
    taxon = ForeignKey("Taxonomy", dbName="tax_id")
    alias = StringCol()
The default identifier name is 'id' so I don't need the sqlmeta statement to override that. I want to be able to say "alias.taxon" to get the taxonomy record for a given alias. SQLObject's by default assumes the relation will be stored in the "taxonomy_id" but it's actually stored in "tax_id", so I need to override that using the 'dbName' option.

I also want to go from an annotation to the list of aliases, so I'll change the Taxonomy SQLObject class definition to include a new field called "aliases". The new bit is in bold:

class Taxonomy(SQLObject):
    class sqlmeta:
        idName = "tax_id"

    scientific_name = StringCol()
    rank = StringCol()

    parent = ForeignKey("Taxonomy")
    children = MultipleJoin("Taxonomy", joinColumn="parent_id")

    genetic_code = ForeignKey("GeneticCode")
    mitochondrial_genetic_code = ForeignKey("GeneticCode")

    aliases = MultipleJoin("Alias", joinColumn="tax_id",
                           orderBy="alias")
The MultipleJoin means that ".alias" will contain a list of all Alias entries where the 'tax_id' of the alias is the same as the Taxonomy's primary identifier. Normally the results are in arbitrary order. I want the results to be sorted alphabetically on the "alias" field so it's easier to read the list of aliases. I do this with the "orderBy" parameter, which takes either the SQL column name for the field to sort on, or the special/magic Table.q.fieldName from SQLObject.

Changing the templates

I'm going to update the record details template to include the aliases as well as the hyperlink to the edit page. I haven't defined the edit page yet. Because I want to see some results I'll edit the database through the TurboGears shell and add "canary damsel" as an alias to taxonomy record 142648.

[~/nbn/taxonomy/TaxonomyServer] % tg-admin shell
Python 2.4.2 (#6, Apr 15 2006, 11:26:48) 
[GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from model import Taxonomy, Alias
>>> taxon = Taxonomy.get(142648)
>>> Alias(taxon=taxon, alias="canary damsel")
<Alias 1 taxonID=142648 alias='canary damsel'>
>>> import model
>>> model.hub.commit()
>>> 
I then quit the shell and restarted it, to prove to myself that the database actually saved my new record. After getting the record again I'll request all of its aliases:
>>> from model import Taxonomy
>>> Taxonomy.get(142648).aliases
[<Alias 1 taxonID=142648 alias='canary damsel'>]
>>> 
Well what do you know - it worked!

Now that I have data I'll edit "templates/details.kid". Where the old template had:

<P>
Taxon identifier: ${taxon.id}<br />
Genetic Code: ${taxon.genetic_code.name}<br />
</P>
I'll add a new bit at the end
<P>
Taxon identifier: ${taxon.id}<br />
Genetic Code: ${taxon.genetic_code.name}<br />
Aliases: <span py:strip="True" py:for="alias in taxon.aliases">${alias.alias}</span> [<a href="/edit_aliases?tax_id=${taxon.id}">edit aliases</a>]<br />
</P>
and try out taxon/142648.

That seems to work so I'll create the new edit interface. This will be a new URL and function called "edit_aliases" which takes the tax_id as its input parameter. Here's the new controller for it, which will be part of the Root class

    @expose(template="taxonomyserver.template.edit")
    def edit_aliases(self, tax_id):
        return dict(taxon=Taxonomy.get(tax_id))
It's very simple as all of the display logic is in the template. The template will give an edit field for each alias in the database. I'll add 5 extra fields just in case someone wants to add a several aliases at the same time. (This is part of the use case - Jim wants to add two aliases for the specific damselfish species.)

Following is the new template

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:py="http://purl.org/kid/ns#"
    py:extends="'master.kid'">

<head>
    <meta content="text/html; charset=UTF-8" http-equiv="content-type" py:replace="''"/>
    <title>Edit aliases for ${taxon.scientific_name}</title>
</head>

<body>
[<a href="/">Start a new search</a>|
<a href="/taxon/${taxon.id}">View taxon detail page</a>]
<h2>Edit aliases for ${taxon.scientific_name}</h2>
<P>
Taxon identifier: ${taxon.id}<br />
<form method="POST" action="update_aliases">
<input type="hidden" name="tax_id" value="${taxon.id}" />
Edit, erase entries or add to the following list of aliases<br />
<ol>
 <li py:for="alias in taxon.aliases"><input type="text" name="alias" value="${alias.alias}" /></li>
 <li><input type="text" name="alias" /></li>
 <li><input type="text" name="alias" /></li>
 <li><input type="text" name="alias" /></li>
 <li><input type="text" name="alias" /></li>
 <li><input type="text" name="alias" /></li>
</ol>
<button type="submit">Update database</button>
</form>
</P>


</body>
</html>

There are a few things to point out in it. The form will go to a new URI named "update_aliases". I haven't created it yet.

The form uses a POST action instead of a GET action. This is important! The distinction is built into the way the web works. A GET request is supposed to be idempotent. That's a fancy way of saying it shouldn't be used to edit or delete things. Why? It's used for fetching existing records and documents, and only that. The web specification says that GET requests may be cached along the way. For example if you use a caching proxy it may store the results of old GET request. When you ask for it again the proxy may return the old results and not go all the way to the remote server to get the original document. This means the server never sees the change.

If you are interested in knowing more about web proxies, take a look a Squid. (Google for "squid proxy".) I've wondered how useful they are for places here in South Africa where out-of-country bandwidth is low and even in-country bandwidth is expensive.

Notice that all of the form input elements use the same variable "name" of "alias"? The to-be-written update_aliases controller will get those as the Python function parameter "aliases". If one alias is given then the function will get a string and if more than one aliases is given then it will get a list of all aliases.

One last thing to notice is the special "hidden" field. The update_aliases controller needs the list of aliases and the taxonomy record to edit. The web is "stateless" meaning that there are no built-in mechanisms for the web browswer or server to know the status of the other. Any important state information must be passed about through each request and response document. There are three main ways to do this. One is through hidden fields in form elements. These will not be visible to the user but will be sent back to the server when the form is submitted. Here the hidden field stores the taxon identifier.

Updating the database - edit_aliases

The edit_aliases controller will work in a very dumb and obvious way. To update the database for a given taxon it will delete all of the existing aliases then create new entries for each of the inputs. When done, redirect the browser to the details page for the taxon. This will list the aliases and confirm to the user that the update took place.

Here are the two changes I made to the "controllers.py" file. The first was the extra import for the "Alias" class:

from model import GeneticCode, Taxonomy, Alias
and the second is the actual controller code (another method in the Root class):
    @expose()
    def update_aliases(self, tax_id, alias=[]):
        # one input alias -> a string
        # multiple input aliases -> a list of strings
        if isinstance(alias, basestring):
            aliases = [alias]
        else:
            aliases = alias
            
        taxon = Taxonomy.get(tax_id)
        # Update in two steps:
        # 1. Get rid of all old aliases for this record
        old_aliases = taxon.aliases
        for old_alias in old_aliases:
            old_alias.destroySelf()
        
        # 2. Create new aliases
        for alias in aliases:
            Alias(taxon=taxon, alias=alias)
        
        # Go to the details page
        redirect("/taxon/%s" % (taxon.id))

I tried it out and it seemed to work just fine. But after I added "canary damselfish" and went back to the detail page I saw a problem. The aliases line looked like this:

Aliases: canary damselcanary damselfish [edit aliases]
There is no space between the aliases! I'll change that so it's a comma separated list of aliases. The old code was
Aliases: <span py:strip="True" py:for="alias in taxon.aliases">${alias.alias}</span> [<a href="/edit_aliases?tax_id=${taxon.id}">edit aliases</a>]<br />
There are several ways to fix it. I decided to push everything into Python and use the 'join' method of strings
Aliases: ${", ".join([alias.alias for alias in taxon.aliases])} [<a href="/edit_aliases?tax_id=${taxon.id}">edit aliases</a>]<br />
I saved the template and reloaded the details page only to find the next problem. The detailed aliases output was
Aliases: canary damsel, canary damselfish, , , , [edit aliases]
It had a few blank records! These were the empty form elements passed into the "update_aliases" function. What I'll do is remove extra spaces from the input and skip adding empty strings to the database. I'll use a neat Python idiom for this which also fixes things like multiple spaces between words (eg, "homo  sapiens" instead of "homo sapiens"). Here's the new step 2 for the controller
        # 2. Create new aliases
        for alias in aliases:
            # Clean up whitespace problems
            new_alias = " ".join(alias.split())
            # only add non-blank aliases
            if new_alias:
                Alias(taxon=taxon, alias=alias)
The trick is to strip the string, which splits the string into a list of non-whitespace words, then join the string using a " ". The result is a single-space separated list of words from the original string.

Fixing the database was easy. I went to the edit page and pressed the "Update database" button. This removed all of the old aliases, including the multiple blank ones, and added only the non-blank ones back into the database. Which reminds me, I should add code to remove duplicates, just in case the same name was added more than once by accident. With that in place here's the final update_aliases code

    @expose()
    def update_aliases(self, tax_id, alias=[]):
        # one input alias -> a string
        # multiple input aliases -> a list of strings
        if isinstance(alias, basestring):
            aliases = [alias]
        else:
            aliases = alias
            
        taxon = Taxonomy.get(tax_id)
        # Update in two steps:
        # 1. Get rid of all old aliases for this record
        old_aliases = taxon.aliases
        for old_alias in old_aliases:
            old_alias.destroySelf()
        
        # 2. Create new aliases
        already_added = {}
        for alias in aliases:
            # Clean up whitespace problems
            new_alias = " ".join(alias.split())
            # only add non-blank aliases
            if new_alias and new_alias not in already_added:
                Alias(taxon=taxon, alias=alias)
                already_added[new_alias] = True
        
        # Go to the details page
        redirect("/taxon/%s" % (taxon.id))

Version control

Remember version control? This project is under version control. Now that things are stabilized I'll check the code into version control. I need to add the new "edit.kid" template and check in the changes I made to model.py, controllers.py and the details.kid.

[taxonomy/TaxonomyServer/taxonomyserver] % svn add templates/edit.kid
A         templates/edit.kid
[taxonomy/TaxonomyServer/taxonomyserver] % svn commit
Sending        taxonomyserver/controllers.py
Sending        taxonomyserver/model.py
Sending        taxonomyserver/templates/details.kid
Adding         taxonomyserver/templates/edit.kid
Transmitting file data ....
Committed revision 25.
[taxonomy/TaxonomyServer/taxonomyserver] % 

Testing

I'll test the alias editing component by running through my user scenario. Ideally Jim would do it but he couldn't make it. To make it realistic I'll remove all aliases from the database before doing the testing. Here's how to do using a SQL command (through the sqlite3 interface):

DELETE FROM alias;
If you only wanted to delete some of the aliases then add a selection query after a "WHERE", like this:
DELETE FROM alias WHERE id > 10;

Here's how to clear the table using SQLObject, in this case through the TurboGears shell

>>> model.Alias.clearTable()
>>> model.hub.commit()
>>> 

I'll walk through the scenario and ask Jim "Add the common names 'canary damsel' and 'canary damselfish' to the species record Abudefduf luridus. This is a functional description in biologist terms. The goal is to see how easily someone can use the interface without any hints as to how it works.

Once done I'll follow up with "now add 'threespot damselfish' to Stegastes planifrons." There's a somewhat hidden requirement as well which is "add 'damselfish' as an alias for Pomacentridae" so I'll ask for that one as well.

Once this works, try a few variations. I've thought of several ways to improve the pages:

Remember, paper prototypes are likely the quickest and easiest way to get initial feedback on any design.

How do you determing which designs are better than others? One good measure is the time needed to complete a task, so use a stopwatch. There are ways to estimate the input time for a page with Fitts' Law, and add a couple of seconds for every web page load both for load time and for the user to figure out what the page does.



Copyright © 2001-2013 Andrew Dalke Scientific AB