CPU power vs. bandwidth
I posted this idea a couple years ago on the biopython mailing list. (Can't find it in the archive after a very cursory search.)
I'm a consultant, working for myself. When I started I only had dial-up access to the Internet. I tried to get DSL but "there weren't enough lines in the central office." This is in part because I live in Santa Fe, New Mexico, which isn't known for its telecommunications infrastructure. Lots of land, few people, fewer with money, and excepting Santa Fe not known for its high-tech companies. (For a city of under 70,000 people, Santa Fe has a surprising number of chemical informatics and complex systems companies.) We are likely within a few miles of the main Internet connection for Los Alamos National Labs including the Internet2, but locals can't simply tap into that line. The feds might frown on it if you tried.
I bought a house a few years ago and when I first looked at it I was glad to see it already had DSL. Bandwidth is much nicer now though still not as good as the 10 Mbit/sec. bandwidth I could get from school a decade ago. By comparison, the laptop I used to type this essay is at least an order of magnitude better than the high-end server machines back then.
I know all the stories about there being a lot of dark fiber in the main backbone because communications technology is improving faster than demands on it, but let's face it, getting that bandwidth to the edge of the network is hard. Maybe not to a US national center or large company , but at least a small consulting house in Santa Fe or a startup in Lake City, Florida or a research group in Capetown, South Africa. Many startups in the late 1990s flopped because they expected otherwise.
Disk space, CPU power, bioinformatics data, and bandwidth are all increasing exponentially. The doubling periods I've heard are 1 year, 18 month, and 2 years for the first three. The last is hard to pin down. Gilder's Law states it's 6 months, but that's for fiber optic, not for last mile. Call it every 4 years, based on my personal experience.
Let's assume for now this trend continues. At some point (I get about 10 years) it will be impossible to download all of the new biological data over a 10 Mbit/second link. As a matter of personal philosophy, I want small groups and even a single researcher to have effective use of the data. My extrapolation says that unless we change things, only large or well-financed groups will be able to do at least certain types of research problems, simply because of bandwidth limitations.
What can be done to change that? I have several ideas. I've expounded on BitTorrent but that helps the server and not so much the client. I'll talk about proxies and smart fetching in the future, but I have another idea which is more intriguing; User Mode Linux.
I mean that as a synechoche for any sort of virtual machine. Once upon a time in the simpler days of the Internet everyone was friendly and it was easy to get accounts on different machines. For example, the group I was in in grad school had some special compute hardware for doing molecular dynamics. They published a paper describing the system and included a user name and password for the machine so anyone could try it out themselves.
It would be nice if everyone was nice. Think about what you could do if you could log on to a machine at NCBI with all the data on a local disk or database, and with lots of scratch space and CPU power. You could very easily make your own BLAST-able subset of GenBank, or create a specialized index of some properties you found interesting. Only the results of a search would need to be sent back to you, and almost certainly that will be much smaller than the whole database. And when NCBI updates there's no real bandwidth problems since they could propogate the data on a local high-speed network.
Not everyone is nice. People hijack others' computers for distributed denial of service attacks, for sending junk email, for spying, and simply for fun. For my idea to work requires some way to let foreign code operate safely on a local machine and with some severe restrictions on what it can do. One option is Java's security sandbox but that requires everything be written in Java, and there is too much existing code in Perl, C, C++, Python, Fortran and other languages to make that a real solution. Another is a chroot'ed jail but it's hard to get a rich unix environment working that way. (Eg, how do you set up a cron job?)
Instead, what if NCBI/Ensembl/whoever ran an virtual operating system independent of the main OS, and let each user have root access to an instance of that virtual OS? The main OS could be in charge of resource limits (network connections, disk space, CPU time) but otherwise leave the users free to install pretty much anything. Want to upgrade to the CVS version of Python? Go ahead! Want to tweak the system install of BLAST? Not a problem.
That gets the program closer to the data but to be useful the compute provider also needs to organize the data for consistent access. At the start each site will have documents describing the filesystem layout, database schemas, and internal services. For portabilty, people will develop a naming scheme and translation layer so a program can fetch needed data from anywhere. This could happen even without my proposal; the O|B|F made an attempt during the hackathon. But I think currently the need isn't pressing enough.
There I go again, deep into the details. The overall idea is to make it easier for people to create and run data intensive programs. The administrative work is handled by the compute service provider and the intermediate bandwidth limitations have much less impact.
It's not likely to happen though. If there's anonymous access then it'll be used by people to find Mersenne primes or crack passwords. Instead the compute providers will need to issue accounts. A virtual OS does provide more defense-in-depth and lets users reconfigure the OS to fit personal needs. I just don't see those as being big enough advantages to warrant the administrative overhead, at least for the next 5 years.
Topics to think about:
- Is the idea of using virtual OSes a distraction? What does it add that a secure OS like FreeBSD can't provide. (Support for software that really, really want to be installed under /usr, like binary RPMs?)
- Who now has bandwidth problems? Is it really getting worse?
- Will people really want all the world's sequence data?
- At some point the sequence data will stop growing exponentially and bend into an S-curve. At some point we will have sequenced every organism on the planet. When will the inflection point be?
- What software developer can't set up a local database mirror? (Students and those using people friendly languages like Python?)
- Maybe we just need an easier way to mirror data.
Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me
Copyright © 2001-2013 Andrew Dalke Scientific AB