[ previous | newer ]     /home/writings/diary/archive/2003/09/29/Numberings

## Sequence Numberings

Spent some time helping Bill with one of his projects. He needed a bit of coding done to make BLAST output into a form importable into UCSC's genome browser. Wouldn't have been too hard except he has no local network, a windows machine w/o Python installed, and only dialup access to the rest of the world, including where the data was. I must have spent half my time waiting for the cursor to catch up to my typing.

Besides the tedium of using emacs over a dialup, we kept running into an annoying complication in bioinformatics: sequence numberings. Suppose you have the sequence CCAAAC. What is the subsequence range matching AAA? There are three common answers:

• biological numbering - 3 to 5
Position 1 is the first character in the sequence, and the range i to j starts at i and includes j.

This is favored by biologists and most of the rest of the world. All of the common formats, including GenBank, SWISS-PROT, or PDB, use this range definition.

• computational biology numbering - 3 to 6
Position 1 is the first character in the sequence, and the range i to j starts at i and goes up to but does not include j.

Once you start working with biological numbering for a bit you realize there are a lot of +1 and -1 terms. Forgetting the term creates a fencepost error in programming vernacular. For example, suppose a PROSITE pattern matches residues 5 to 10; how long is the match? It's 10-5+1==6 not 5 which is the first thought many have. Suppose one block of an alignment goes from 10 to 15 and the next goes from 20 to 25; how long is the gap? It's 20-15-1==4 and not 5.

These fencepost errors are reduced by using a range definition which does not include the last number in the range. If the PROSITE goes from residues 5 to 10 then the range is given as 5 to 11, and the length is 11-5==6, which is the obvious answer. In the alignment example, the two blocks are given as (10,16) and (20, 26) and the gap size between them is 20-16==4.

I called this computational biology numbering because I couldn't think of a better name. It's used in the Life Sciences CORBA spec and a few other projects. The people who use it tend to be scientists (chemists or biologists) who become software developers or programmers who are told they need to have sequence numbers start at 1.

• computer science numbering - 2 to 5
Position 0 is the first character in the sequence, and the range i to j starts at i and goes up to but does not include j.

Some tricky problems remain in computational biology numbering. Suppose you have an alignment from a protein to a DNA sequence; no gaps, and the first residues of the protein is aligned to the first three bases in the DNA. What's the range for the three DNA bases which correspond to the fourth protein residue?

In biological numbering, the protein is at position 4 and the DNA range is 10 to 12. The equation for the DNA start position is 3*n-2 and for the end position is 3*n. Not to hard to figure out, but it does take some thought.

In computational biology numbering the range is 10 to 13, and the formulas are 3*n-2 and 3*n+1. Still requires a bit of thought, which increases the odds of making an error.

The problem is because the numberings start from 1. Suppose it starts from 0. This is what most computer scientists and some mathematicians and physicists use, even to the point of labelling the first chapter in a book "Chapter 0." These are the sort of people who get a thrill that European elevators use 0 for the ground floor and are dismayed that US elevators use 1 instead.

Using computer science numbering, the protein residue is at position 3 (because position 0 is the first character), and the corresponding DNA range is 9 to 12. The formula is 3*x to 3*(x+1) and very easy to remember.

In fact, the equation 3*n+2 used for the start position of the other two numbering can be rewritten 3*(n-1)+1, which basically moves the protein's 1 numbering to 0 numbering, multiplies by 3, and moves back to the DNA's 1 numbering.

You can probably guess from my explanation that I prefer the computer science numbering, at least for the innards of a computer program. It makes the math so much easier, and since most software is written by non-biologists, it means I can call them directly instead of doing an extra conversion step. But when I need to count things manually, I agree with the biologists and count things starting at 1 and including the last number in the range.

The right solution is to write software which uses computer science numbering internally does conversion from other numbering systems in the I/O layers -- including the GUIs for the biologists. Anything else causes problems.

I was helping Bill with a simple programming task - convert from one alignment format to another. BLAST uses biological numbering. Biopython's BLAST parsing code keeps the numbers as it finds them. But the output format doesn't document which of the three numberings it uses. The only way to figure it out is to load the file into a viewer and do a manual check. It used the computational biology ordering.

There's one more wrinkle, complementary sequences. One strand of DNA is the reverse complement of the others (so ATCG is the revese complement of CGAT). Suppose the protein is aligned to the negative strand DNA; how is the range on the negative strand represented? Are the start and end positions swapped? Are they in forward ordering but with a flag to say "these are on the reverse strand"? Are the positions recomputed from the other end (so x goes to N-x, where N is the length of the sequence)?

Three numbering systems... Three different ways to handle reverse complements... There are nine different ways to represent ranges! No wonder it's so annoying. And it won't be going away any time soon.

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me