Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2003/09/06/RSS

RSS for bioinformatics

One of the reasons I wrote PyRSS2Gen was to experiment with RSS for data collection in bioinformatics. Last year I came across PubCrawler, which periodically searches PubMed and GenBank and emails you a summary of new matches to your searches. It's a nice idea, in part because managing that data yourself is error-prone.

The trend these days is to make that data available through RSS. With a good RSS client this should be better than email because it can accumulates all the entries over time, and the trackbacks would let you make comments about the hits, potentially sharing it with others. (This is all theory - I haven't used a high-end RSS client.)

During that time I also found a site which did some RSS feeds for PubMed searches, but I can't find it now. I did come across HubMed and my.PubMed which do have RSS feeds. (I tested both to find one of my papers; search for "dalke" with a refinement of "tcl". I found HubMed the easier of the two. It wasn't obvious how to refine a search in my.PubMed.)

In theory, a lot of searches could have RSS front-ends. What about a BLAST job run every week, where the RSS feed tells you about the new matches? What about an annotation system where you can comment on regions of a sequence and let others know about it. (DAS does some of that, but I would like it to integrate with other non-biology tools. I think it's close, and something to consider for DAS2.) And how does PIE's editing features fit into all this?

There's a few prerequisites to doing this. The first is a way to automate PubMed, GenBank, BLAST, and other searches. Biopython, bioperl and the other Bio* projects all do this to some extent, though I think our EUtils code contributed to Biopython is the most powerful. The second is support for RSS generation; not a hard task, but there are still a lot of incorrectly formatted RSS feeds, so we developed PyRSS2Gen.

The third is time and money, since there are too many interesting things to work on and too many bills to pay. And the last is access to end-users, which is essential to know that what we're doing makes sense in the real-world.

All of our clients the last few years have been chemists, not biologists. Chemists also do searches, but it's a bit different than in biology. There isn't anywhere near the amount of public data for chemistry as there is for biology. There is ACD and the other commercial databases, but very few are freely available, and I'm told that those databases are only a small fraction of the data locked up in the various pharmas and other chemistry companies. And outside of conferences it's rare for people to talk to people in other companies about their research. Even in conferences its often highly vetted by the laywers.

This means most of the data systems are local, with a larger diversity of servers. Any software must know how to integrate with Thor/Merlin, Isis, Unity, local Oracle schemas, and whatever else might be hanging around. Since relatively little new data comes into the company compared to in-house generated data, it's often easier for one researcher to talk to the person doing the experiment instead of going through a computer system. Only in the large pharmas will you start approximating the problems that RSS and PubCrawler resolve.

This is another project we would enjoy working on, so if you are interested in funding us, let us know!


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB