Based on an email to Lincoln Stein, dated March 20, 2003
Distributing bioinformatics data sets using BitTorrent
I'm been thinking about ways to distribute bioinformatics data. At the Singapore biohackathon, Elia mentioned that EBI distributes about a petabyte of data a year, which got them in a bit of trouble with the UK academic network people. It's easy to see why; they have about 123 GB of data, and about half of that is updated every couple months. That's 20,000 full downloads a year to get a petabyte, or roughly 500 sites which have a local mirror.
Other domains also want to distribute large data files. For example, suppose you want to distribute a por^H^H^Hhome video or a Linux distribution. If only a few people download it, there's no problem. But if your movie becomes the next fad (like hamsterdance), and your outgoing connection has limited bandwidth and you can't afford paying for a bigger pipe or a caching service like Akamai, then your network will collapse under the load.
Collaborative mirrors help somewhat. You don't need to go to redhat.com to download the latest copy of RedHat's OS, so long as you know where the mirrors can be found.
Recently I came across BitTorrent, which is a different way to solve the problem. It distributes the serving to everyone who downloads the data. When a client connects to the central tracker for a file, it gets a list of all servers which know about the file. It can then ask one or more of the servers for different parts of the file. What makes BitTorrent useful is that each client also becomes a server for the data it downloaded. If a lot of people download data then the number of available servers automatically increases, and no single site becomes overly saturated with download requests. (You do need one client, called the seed, which provides the original, complete data.)
The really clever part of BitTorrent is that it works in the adversarial environment of the Internet. Some may just want to download and not upload, while others may provide false data, perhaps even data infected with a virus. BitTorrent solves the first using economics, by using an algorithm where the downloads performance is tied to the upload performance, so those who upload more get more download bandwidth. The second is done by having the tracker provide cryptographically strong checksums for subsections of the file.
BitTorrent seems like an excellent way to distribute GenBank, EMBL, PDB, and the other large bioinformatics data sets. It extends the current mirroring system so that anyone can join in. Right now, if you are in Japan you can go to DDBJ to get a local copy of GenBank, but what do you do in South Africa? Where's a good local mirror? With BitTorrent, if SANBI joins the GenBank swarm then others in South Africa automaticall start getting better download performance. And if a backhoe should dig through their connection to the outside world, then the system gracefully degrades to use other sites.
What's especially nice is that BitTorrent doesn't require a large first step. Anyone can start up a tracker and provide the initial seed. There's no need to have a joint international meeting of the top data providers to decide upon this.
So give it a go and let me know. It's an easy publication. Just give me an acknowledgement. ;).
Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me
Copyright © 2001-2010 Dalke Scientific Software, LLC.