Python in Bioinformatics and Chemical Informatics

PyCon 2004 / Andrew Dalke / Dalke Scientific Software LLC

Hello, my name is Andrew Dalke. I'm a software consultant for the computational life sciences. That a cumbersome phrase but it's the best I've been able to come up with. What it means is I help design and develop software systems which help computational chemists and biologists do their research.

I'm going to tell you a story. Well, several stories all intertwined. One is the story of high-level languages in the computational life sciences. Another is the story of my career and my interests in this field. A third is information about some of the available Python resources, and I'll probably include a few other subplots to keep things interesting.

As with all autobiographical stories, parts of this one likely are not true. I haven't done the research and fact checking to back everything up. But I hope it's at least entertaining and perhaps enlightening.

In the beginning there was Fortran. Developed first in the 1950s it became the predominate programming language in scientific computing. It was one of the first high-level languages, as compared to macro assemblers, and it was designed for scientists to do numerics, fast.

What I mean by "high-level language" is that it was the first language to let scientists with relatively little training in computers do science using a computer. And it was a wild success. One of the accounts I read was of of the original FORTRAN developers giving a scientist a manual and within a short time that scientist wrote a working program.

Fortran is still widely used for some types of scientific programming, which surprises software people from nearly every other field. Not only is there nearly 50 years of working library code but it's still makes fast code. I chuckle every time I see a paper talking about how Java approaches the performance of C++ since I remember papers 10 years ago talking about how with expression templates C++ performance could approach that of FORTRAN.

It's hard to say just exactly when computer programs started playing a role in understanding chemistry. Before the general purpose computer, back when you rewired boards, it was hard to tell the difference between the program and the computer. It's also hard to tell without digging up a lot of old papers because people are usually interested in the science, and not the software which enables the science.

It's also hard to tell because chemistry is a very broad field, with many sub-disciplines. The subfield now known as computational chemistry started in the late 1950s with people like Pople developing self-consistent molecular orbital theory, which uses a computer to find an internally consistent solution to the quantum description of a molecule. There was a large enough group of people that QCPE (the Quantum Chemistry Program Exchange) started in 1962 as way for people to share software for quantum chemistry.

I mentioned high-level language before. A language is used to communicate, and most of the time people think of high-level languages in programming as a way for someone to compute with a computer. It's more than that. A language is a way for people to communicate with other people. In the case of QCPE this was Fortran (Fortran II actually, and something called the Fortran Assembler Program). It's the earliest example I know of people sharing the results of their work, and it's hard to imagine that happening without a common high-level language.

QCPE is still around. You can go to their site and download source code. Mostly Fortran but a few others as well. It's a bit strange though in that they charge based on the number of lines in the program, with the most expensive category as 10,000 lines or more.

Another early subfield to use computers was crystallography. This uses various chemical tricks to make a crystal out of the protein, DNA, or whatever else is being studied. This is placed in an X-ray beam which causes the X-rays to scatter in a pattern specific to the crystal structure. By reversing the process computationally (basically a Fourier transform) the original structure can be found. Easy to say but harder in practice. Even if the structure is well defined and easy to crystallize there usually isn't enough data in the scattering to know where all the atoms are. Instead, crystallographers combine that data with constraints given by the chemistry and found by experiments to develop an initial model of the structure and use numerical methods, structure visualization and intuition to refine model.

The programs for doing this started in the mid-1960s. One early Fortran program, ORTEP, from 1965, draws a simple molecular structure along with the uncertainties in position. It is still available to this day. I think it's pretty cool that software older than I am is still being used. Either that or it says something about the slow progress in this field ;)

The refinement process is complicated. There are many ways to do it based both on the structure being studied and the person doing the work. The term used in science is "protocol" but you know it as "algorithm." Rather than code each protocol as a new Fortran program, perhaps using an extensive software library, several projects in the late-1970s/early-1980s decided to embed an interactive domain-specific language. The most important of these are CHARMm (1983) and X-PLOR (shortly thereafter). (I say "are" hesitantly since I know X-PLOR has been succeeded by CNS.)

These scripting languages made an environment that was easier to use. Just a few lines of scripting could replace many lines of Fortran and the interactivity meant researchers could work with the data more easily and flexibly. In addition, these programs came with source code so you could add your own features if you wanted to. This not the same as open source; at best you could distribute changes to others who had a license.

Structure visualization also started in the 1960s. The first work was by Levinthal using an oscilloscope as the display! Computer graphics required expensive, specialized hardware for decades so the programming language used depended on the computer systems. By the late 1970s there were a few widely used visualization programs; GRIP started in 1971 and was written in PL/I, Frodo in 1978 and Bilder in 1980.

Of these, Frodo had the most effect in part because it was widely shared. People would use it at one site, like it, work on it, and bring a copy of the source with them elsewhere. (On tape naturally.) Software branched because different sites were doing different research or had different visualization hardware. Without the internet there was no good way to merge branches. Plus, many people in this field are good enough to write code for themselves but know it isn't good enough for others. This hasn't changed.

Frodo ended up highly fragmented. Besides the normal problems in fragmentation that you know about, eg, time wasted on redoing someone else's work, fragmentation is more worrisome for science because it's harder to reproduce someone else's work. You may need to get the specific version used by the author and it may be hard to track that down. It's made worse because people will often forget to list exactly which version of the software was used, preferring the canonical literature reference.

Towards the end of the 1980s Alwyn Jones worked on reconciling the different threads and produced O, written in C. It's still being improved and is one of the most widely used programs for crystallography. Still, if you want, you can download TurboFrodo.

As you might have inferred, visualization is a bit different than the other scientific programs. For a long time it was dismissed as "making pretty pictures" akin to what a draftsman does -- and for the most part it still is. It's relatively hard to publish, esp. as new features are added and so people who work on visualization software are less competitive in the publish or perish environment of academic life.

The scientific examples I gave so far were modeling oriented. Loosely speaking, that means applying techniques from physical chemistry to understand how a molecular structure works and interacts. Given a compound, do something with it. Another subfield is chemical informatics which helps find the compound in the first place. For example, find all published chemical data about a given compound, list all compounds in the in-house corporate collection (of already synthesized drugs) which have a given substructure, or are similar to some other compound.

The field of chemical informatics is quite old. It is closely tied to chemical notation, which started with Berzelius nearly 200 years ago. The late 1800s found the rise of paper based information systems, which moved to machines in the 1920s and computers in the 1950s after the Univac became available. The very earliest programs I dug up were in assembly for the Univac. I haven't yet found much about the languages used in the 1960s and 1970s but I suspect it was Fortran and PL/I.

I've mentioned PL/I before but didn't talk about its impact on chemistry. That's because I don't know it. I suspect it was used where C would be used a decade or so later, as a fast, portable language that could handle complex data structures, as compared to Fortran IV. See, from a computer science sense, molecular modeling isn't all that complex. Most of the programs are structured like this: a list of atom types, a list of bond types, a list of coordinates, some data table lookups and code to solve some linear algebra or numeric integration problem.

Databases are different, because they handle complex record types and deal with graph searches, bitmap tests and other non-numeric algorithms. Visualization code also handles complex data and has its own set of non-numeric algorithms, and it has to manipulate a lot of user-defined state. Fortran is not the right programming language for this.

C started being used in the early 1980s, in large part because of DEC and the affordable minicomputer. This allowed a single research group to have its own computer. Because more people had computers, it became possible for people to start commercializing software.

That's not to say there weren't commercial software systems before then, but they were rare. The ones I knew about were time-share based. For example, in the 1960s ISI (the Institute for Scientific Information, a for-profit company) indexed the most popular chemistry journals and provided chemistry and literature searches. In part because of their early use of computers to help enter and maintain all the chemistry data, they kept up to date with the literature. By comparison, CAS (the Chemistry Abstract Service) was about three years behind.

In the 1960s ISI could mail the results of the searches to their customers, acting as a clipping service (or early form of RSS), and in 1969 they supported ICI's (Imperial Chemical Industries) CROSSBOW (Computerized Retrieval of Organic Structures Based on Wiswesser). Written in a mixture of COBOL and assembly, it let people do queries against their server. In the 1970s they and other companies provided more powerful clients, with support for graphical input. But for the most part these were service based commercial systems and not ones distributed as products.

The first application suites started in the 1980s by merging capabilities from many different subfields (molecular modeling, chemical informatics, and others) into a GUI-based end-user program. The first one I know of was Tripos's Sybyl which started in 1979 but it had a major rewrite in the mid-1980s and was converted from Fortran into C. Similar programs included Quanta (1984) and Insight (1984).

For this talk, Tripos is the most interesting. It included SPL, the Sybyl Programming Language as part of the rewrite into C. One of the main developers used to work at Bell Labs where he learned about embedding application-specific languages. They figured it was a good idea for prototyping and figured they would rewrite the prototype for production. Turns out that SPL was fast enough and good enough that more and more of Sybyl was written in SPL and not C. This meant the company could get more done with fewer people, which let them stay competitive. Sybyl is still one of the leading modeling packages, in part because of the flexibility available from including SPL.

So far I've talked a lot about chemistry software but not about biology software. That's partially because I know the chemistry software field much better than the biology one so I can weave a more cohesive story. It's also because the history of chemistry software is much better documented because it's an older field, and that in turn is because chemistry is much more mathematically oriented than is most biology.

Still there are a few interesting observations of how high-level languages are used in biology software which are different from chemistry. One is GCG, also called the Wisconsin Package. It is a set of interoperable programs for doing DNA and protein sequence analysis. Each program did one thing well and the output from one program could be used as the input for another. This let scientists develop new programs using a shell language (under Unix or VMS). Shell languages are neat because they let you experiment with things interactively and because they let you mix and match programs written in different languages.

Okay, I've taken you up through history to the early 1990s. This is a good breaking point for two reasons; it's about when the modern high level language (like Python) were released and 1992 is when I first started in this field.

In retrospect my interest in high-level languages started early. In high school I wrote a BASIC in BASIC and in college my larger programs usually included an interpreter of some sort. The lab I worked in did image analysis of liquid crystal flows. The image tools were a set of Unix utilities so I learned shell scripting, and grew to enjoy awk.

I worked a bit on molecular dynamics in 1992 and in grad school joined a molecular dynamics group in 1993. We developed a parallel molecular dynamics program called NAMD. It was written in C++ in part because we didn't know (or want to know) FORTRAN and in part because the harder part of distributed computing is data management, not calculations.

We also developed a visualization program called VRChem, which became VMD. This was also written in C++ but it had it's own simple scripting language for doing command-oriented tasks like "load a molecule" or "rotate around the x axis by 5 degrees". I had some ideas on making a new language for molecular modeling but I'm one who tends to use other packages than make my own so I looked around a bit first.

I used Perl 4 a lot for CGI scripts and as a replacement for awk. I loved working in it for simple scripting but it couldn't handle complex data structures and it was hard to embed. I looked at Python but heeded people's cautionary statements about indentation (boy was I wrong). Tcl's syntax was a powerful superset of the one we had already so it was a more natural fit.

Tcl turned out to be a great fit, especially once I figured out how to make atom selection "objects" and added support for people to display their own graphics. This let the researchers do complicated analysis of their structure data and display the results along with the structure itself. Some of the software they wrote was nearly 1,000 lines long and quite impressive on what it display.

We weren't the only ones to use Tcl. Several other visualization packages included an embedded Tcl including Cerius2, a commercial system. (Though I don't like the way they did atom selections. :)

Tcl wasn't as popular with the software developers. Oh, it was great for file parsing but it was slow and didn't have namespaces (this was in the pre-8.0 days) and didn't really handle complex data structures -- we had to use a 3rd party package to support real dictionaries! This meant there were two classes of users; "researchers" who did things in Tcl, and "programmers" who did things in C++. I wanted a language which was more scalable than that.

I was reintroduced to Python by Jim Phillips in I believe 1996. After reading through the documentation I decided that that was the language I wanted to use. It had all sorts of useful things for programmers -- real data structures, a large number of libraries, support for C extensions -- and looked simple enough for computational scientists. But it would take a couple of years and two changes of fields before I got a chance to use it.

I worked for a while at a company called MAG ("Molecular Applications Group"). I helped develop a web-based bioinformatics server in 1997/1998. We wanted to sell this as a product which meant we had to do it in Perl, because that's what bioinformatics people used.

Why Perl? Think back to the early days of the web, now more than 10 years ago. Interactive web pages could only be done with CGI. The old NCSA server (yoohoo.ncsa.uiuc.edu) had a dozen examples of writing CGI programs in different languages and the easiest to figure out was Perl. Indeed, that's how I learned about Perl.

There was a third-party package called cgi-lib.pl from Steven Brenner and nearly everyone doing CGI scripts in Perl used this library. What's interesting is that Steven is in bioinformatics. Just as interesting, when Perl5 came out Lincoln Stein developed a replacement package for cgi-lib based on Perl5's new module system. And Lincoln is also in bioinformatics.

The most popular CGI libraries for Perl were both developed by bioinformatics people! I think the main reason why is because bioinformatics has a long history of sharing data, even back to the 1920s with the early work in drosophila. Sharing was needed to do good science. It took a lot of work to figure out the genome and no one lab could do it all or come up with all the ideas. It's hard to make money from sequence data. We have the human genome now which means with a good target and half a billion dollars of research we might have a new drug.

So nearly all of the bioinformatics data and even most of the software is freely available, funded in large part by taxes and research grants. Compare this to the pharmaceutical datasets used in chemical informatics where the tie between the data and making money is much clearer. In that field only about 5% of the data is public. The rest is in proprietary databases, either commercial or in-house and very few programs are freely available.

Perl specifically was picked because bioinformatics gets its start in the 1980s managing early sequence databases, which were frequently developed on Suns, which were cheap and came with the Sybase database. Many of the early people had Unix experience, which fits quite naturally with Perl.

I also think Perl worked because of its strong support for string, string editing and regular expressions. Both are core parts of bioinformatics, where sequences are represented as strings. Early bioinformatics work didn't need complex data structures like trees so Perl's limitations weren't that big a problem.

(See also Lincoln Stein's How Perl Saved the Human Genome Project)

Steven started bioperl in the mid-1990s as a way for people to share their Perl programs. Perl5 came out in 1995 with support for modules and complex data types and within a few years of that bioperl became a set of Perl libraries for doing bioinformatics. Bioperl had their first meeting (where the different developers got together) in 1998 and Biopython joined them in 1999 which was the basis for the Open Bioinformatics Foundation.

Remember Python? This talk is about Python. We left off with me working on a bioinformatics server written in Perl, because that's what our potential customers were using. I wrote something like 8,000 lines of Perl over 6 months. Clean code and it had a lot of functionality. But I had a hard time explaining the result to non-programmers and even to programmers. I managed to convince Jeff Chang, a co-worker, that Python was a better language than Perl. He was working on an in-house project he got to try it out. He agreed with me and has stayed with Python since then.

At the Objects in Bioinformatics conference in 1999 I held a birds-of-a-feather meeting on high-level languages in bioinformatics. At it, Jeff and I decided to start Biopython along the same lines as Bioperl. Rather than duplicate effort, we asked the Bioperl people if we could work with them and use their servers. They agreed, and the BioJava folk followed soon afterwards.

Together we decided to start the Open Bioinformatics Foundation, a non-profit company to manage the computer systems needed by the different groups. We also have a meeting every year, BOSC, which is a satellite conference of ISMB, the big yearly bioinformatics conference. The OBF mostly host infrastructure projects; libraries for the different languages, and standard object models like XML and CORBA, and an SQL schema. At about the same time, Jeff Bizzaro started Bioinformatics.org which is more along the lines of Sourceforge and is mostly aimed for applications and small packages.

So if you're looking for packages for bioinformatics you should try out those two sites first.

Biopython, Bioperl, etc. are libraries. These style of programming is not new. The earliest commercial success I know is the Daylight toolkit, a set of libraries for chemical informatics which started some 20 years ago. After I left MAG I worked for a while at Bioreason which developed data mining tools for analyzing high throughput chemical screens. We used Daylight as our bases for understanding the chemistry. It was convenient in part because Daylight and Bioreason are both located in Santa Fe, so it was easy to get help and report bugs.

Daylight sells toolkits with C and Fortran bindings. For several years I wanted to do Python so I looked into the ways to make Daylight's C API accessible from Python. Luckily, Dave Beazley wrote SWIG and Roger Critchlow wrote DaySWIG. But the result still felt like C. I instead wrote PyDaylight which is a thick wrapper to the toolkit which feels like Python. It handles garbage collection, iterators, attributes, and exceptions all in a very object-oriented fashion.

This let us do our work a lot faster and let more people in the company develop new code with fewer errors. We released PyDaylight as LGPL in the 1998 and presented it formally in 1999. I wrote a Dr. Dobb's article about it too. I'm rather happy about that program because it's the basis of my consulting work that I've been doing the last 4 years.

Vertex Pharmaceuticals was one of my first clients. I had convinced Pat Walters there to take a look at Python, especially with PyDaylight. He was a big Perl programmer but decided to switch over to Python and now Vertex is a big Python (and Java) shop.

PyDaylight depends on a commercial toolkit. Brian Kelly, who worked with me at Bioreason, has developed FROWNS, a free version of the PyDaylight API.

There are also toolkits for Python for molecular modeling. The earliest is MMTK from Konrad Hinsen which is mostly for macromolecules.

Another company in Santa Fe, OpenEye, developed first OELib (now Open Babel) then OEChem, an extensive set of libraries for modeling both small molecule and large molecule support, eg, for doing docking. Bob Tolbert developed Python bindings for OELib which persuaded OpenEye to hire him and have him develop the Python bindings for OEChem. I like to think I helped with that persuasion.

As for visualization there are many choices. VMD now also supports Python, and UCSF's Chimera is written in a combination of Python and C++ as is Warren DeLano's PyMol. PyMol is the current best open source molecular modeling package. Michel Sanner also has a complete set of data flow tools for molecular visualization. (BTW, Michel's been pushing for a common Pythonic API for molecules and ran a conference on this topic last fall.)

There's also PyQuante for doing quantum chemistry, I've heard about in-house work at Tripos doing programs with Python, and I know various pharmas and biotechs are using Python for in-house work.

So by comparison to bioinformatics, where Perl is king, Python is well on the road to being the high-level programming language for computational chemistry.

There are a few other high-level languages for the life sciences which I'll mention. Lisp is used at a few places. EcoCyc is one, mostly I think because it was started by someone who's been using Lisp for decades. Like I said earlier, the programming for these fields are mostly boring from a computer science sense. The problems almost never need the cool macro solutions that sets Lisp apart from most other languages.

There are also some people doing Smalltalk, including a bioSqueak. Baylor College of Medicine is the only place I know using Smalltalk.

Klotho used Prolog as a declarative language for molecular structure and assembly.

And there's SVL, the Scientific Vector Language used by CCG's MOE. It's a non-OO imperative language with special scalar types like molecule as well as support for higher-dimensional arrays.

SVL points out some of the things that Python does not do well. I and others would really like a built-in multidimensional vector data type, including C-speed numerical operations. I hope Numeric/numarray becomes that some day but I know it will still be a while.

A good built-in numeric package including statistics would make Python more useful for gene expression analysis. Currently many of those analyzes are done in R.

Python doesn't take real advantage of multiple processors, because of the global interpreter lock. I know about some ways around it but they are kludges.

I want some way to run Python code in a sand box so a researcher can download and run someone else's analysis code from off the web without worrying about the evil it might do. PyPy should help with this.

One of the biggest complaints I get about Python is that it doesn't have a good standard cross-platform GUI framework. That's the main reason people still use Java. I don't understand that one because the complaint is that Tkinter doesn't use native widgets but I thought Java didn't either. But I'm not a GUI person.

And finally something that Warren first explained to me is that Python doesn't make for a good shell or command language. "ls -l *.txt" is lengthy to write in Python and "rotate x 5 degrees" is easier to explain than "viewMatrix.rotateX(5)". His PyMol has a dual mode interface similar to the Interactive Python used by SciPy that can distinguish between a shell-like command and a Python statement. The fundamental issue here is that basic Tcl is still easier for a few things than basic Python. I would like to experiment some to see how much of a problem this is.

There you have it, a summary of the last 50 years of high-level languages in the computational life sciences and how Python fits into that history. You got a lot of information in the last 30 minutes, I hope you found it interesting, or at least useful.

Any questions?