Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2003/09/29/Numberings

Sequence Numberings

Spent some time helping Bill with one of his projects. He needed a bit of coding done to make BLAST output into a form importable into UCSC's genome browser. Wouldn't have been too hard except he has no local network, a windows machine w/o Python installed, and only dialup access to the rest of the world, including where the data was. I must have spent half my time waiting for the cursor to catch up to my typing.

Besides the tedium of using emacs over a dialup, we kept running into an annoying complication in bioinformatics: sequence numberings. Suppose you have the sequence CCAAAC. What is the subsequence range matching AAA? There are three common answers:

You can probably guess from my explanation that I prefer the computer science numbering, at least for the innards of a computer program. It makes the math so much easier, and since most software is written by non-biologists, it means I can call them directly instead of doing an extra conversion step. But when I need to count things manually, I agree with the biologists and count things starting at 1 and including the last number in the range.

The right solution is to write software which uses computer science numbering internally does conversion from other numbering systems in the I/O layers -- including the GUIs for the biologists. Anything else causes problems.

I was helping Bill with a simple programming task - convert from one alignment format to another. BLAST uses biological numbering. Biopython's BLAST parsing code keeps the numbers as it finds them. But the output format doesn't document which of the three numberings it uses. The only way to figure it out is to load the file into a viewer and do a manual check. It used the computational biology ordering.

There's one more wrinkle, complementary sequences. One strand of DNA is the reverse complement of the others (so ATCG is the revese complement of CGAT). Suppose the protein is aligned to the negative strand DNA; how is the range on the negative strand represented? Are the start and end positions swapped? Are they in forward ordering but with a flag to say "these are on the reverse strand"? Are the positions recomputed from the other end (so x goes to N-x, where N is the length of the sequence)?

Three numbering systems... Three different ways to handle reverse complements... There are nine different ways to represent ranges! No wonder it's so annoying. And it won't be going away any time soon.


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2020 Andrew Dalke Scientific AB