Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2003/09/04/URIs

URIs for bioinformatics records

I mentioned BitTorrent yesterday as a way to pass around large data sets. Most people don't need a complete copy of GenBank, the PDB, etc, in part because it's hard to use the raw data unless you have some non-trivial programming skills. Instead, people start with a sequence, BLAST it, get the matching records, and import those records into MacVector or whatever other tools they have at hand.

The fundamental data item in this view is the record, not the database. One question is - how do you get the record? With a BLAST search, you can toggle the ones you want to retreive then download to a local file for import, and with some tools you can have them browse the remote site directly.

I want to make that easier, and better. Easier because the different tools each have their own way to find the data. Better because you should be able to have the system contact your local lab, university, or company copy of the file and if that doesn't exist, look, say, at your some place "close" to you, networkwise. Failing that, find it from the original data provider.

This is not a new idea. LSID offers one such, biomoby, EMBOSS, and SRS all have their own systems. But so far there isn't a consensus on an approach.

Suppose there is one. Let's say it looks like "pdb:2PLV" for the poliovirus structure, "swiss-id:100K_RAT" for a SWISS-PROT record, etc. (These proposals are easy to come up with, tricky because of details like versions and multiple, mixed sequence records like the PDB, and hard to get consensus on.) At some point, these names must get turned into a way to get the actual record.

Suppose those are actual URLs. Real entities you can point a broswer to and view. Then you can take advantage of the whole framework developed around URLs. For example, you can use a cache system like Squid, so that multiple requests to the same record are sped up. These can be layered, so you point to your group's cache which points to the geographically close cache, which points to the original data.

Another possibility URIs provide is content negotiation, sometimes shorted to "con-neg." When your browser connects to a web site, it tells the server what it can accept. Usually it says "give me anything you've got", but it could give preferences, eg, to return the Spanish version of a document if there is one, or to choose PNG over GIF.

For sequence data, it might be useful to be able to ask the server for the record in a given format, "give my 100K_RAT in FASTA format." Even more interesting, ask it for an RDF dump of associations known in the database to that record. (I've heard of difficulties with conneg, eg, a GET, edit, PUT becomes more complicated server-side. Another alternative is to provide link data to the alternatives.)

Finally, HTTP offers lots of ways to talk with the server, not just the GET and POST people usually think about. For example, there's a PUT, used to add a new record, and a DELETE for removing a resource. This approach is called REST, for "representational state transforms". This starts to offer a way to annotate sequence records. But this is a topic for another time.

Sadly, this is hard to implement. It's experimental, with little obvious initial benefit, which means the major data providers likely won't want to implement it. So you'll need to set up the primary server yourself, with a copy of everyone else's data. And you'll need to figure out the naming issues. And convince people to use it.

Wouldn't it be fun to try? :) If you are interested in funding us for this, let us know!


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB