Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2010/11/13/ontologies_and_algorithms

Ontologies of Algorithm Implementations

Last week I was at the 2010 German Conference on Chemoinformatics. Janna Hastings presented some of the chemistry ontology work they are doing at EBI.

I have difficulties with ontologies. Part of it is a knee-jerk reaction to deep hierarchical classification schemes, which I don't understand well enough to be able to explain fully here. Nature is messy, and definitions have soft, vague edges. The ontologies I've seen which were more than a controlled vocabulary seem to want sharp edges, especially if the goal is to apply some sort of machine reasoning or first-order predicate logic to the result. (But remember, I'm not going to go into this, only show you my colors.)

One of the ontology terms she brought up was "results computed by algorithm X." I have more concrete arguments against modeling things this way.

Algorithms are not implementations

I've been working on a pair of fingerprint file formats. An optional metadata field contains the fingerprint type. I thought of defining that "MACCS" is used for all programs which implement the MACCS keys but ran into a few problems.

The biggest is that OpenBabel 2.3.0 and OEChem 1.7.4 contain bugs in their respective implementations. Briefly, the OpenBabel pattern definitions used 5-membered rings for the 4-membered ring field, 6- for the 5-, and so on, while in OEChem the bit for "has more than 1 component" turned out to be "has more than 0 components."

Both groups have been informed. The OpenBabel patterns are fixed in version control and I assume the OEChem ones will be fixed in the next release. But I don't know if other errors remain. I do know there are structures where they give different fingerprints.

That's where I decided to call their respective outputs "OpenBabel-MACCS/1" and "OpenEye-MACCS/1", replacing "1" with "2" once the new forms are available.

Therefore, it's important to track not just the algorithm (or algorithms) in use, but also the programs which implemented the algorithm.

More than one algorithm is involved

RDKit implements one of the MACCS bits as the SMARTS pattern"c:n". This is an aromatic carbon connected to an aromatic nitrogen by an aromatic bond. But what does "aromatic" mean?

The Daylight aromaticity model looks for the "smallest set of smallest rings (SSSR). If the number of π electrons in the ring == 2 mod 4 then it's a ring. The Daylight toolkit automatically determines aromaticity on every molecular structure it processes.

A number of tookits (OpenBabel and CDK) follow this approach, but still others do not. For example, OpenEye does not force a specific aromaticity model. The input structure is presumed to be in the right form and you are free to repercieve aromativity if you want, either with your own function or via one of the 5 aromaticity model perceivers included with OEChem.

In the table from the above link you can easily see cases where a "C-N" is not aromatic in the MDL, Tripos, and MMFF definitions, but is aromatic in Daylight's and OpenEye's models. Thus, the pattern "c:n" will be 1 for some models, and 0 for others.

In other words, multiple algorithm implementations are required in order to reproduce a certain output. Even knowing the program version (say, "OEChem 1.7.0") isn't enough, since the user can decide which aromaticity model to use.

More than one algorithm is involved, part 2

The problem isn't limited to the MACCS fingerprints. Even if the molecular perception was indentical, the hash fingerprints can still give different results. Many of these use the hash to seed a (pseudo) random number generator, then use the first few (1 to "several") outputs to populate a fingerprint.

Different programs use different RNGs. RDKit uses the one from Boost. Boost 1.40 fixed a bug which caused the RNG to produce non-uniform distributions. As a downstream consequence, RDKit fingerprints generated different hash fingerprints.

This is an example of just how hard it might be to track down the different factors which go into making a data set. RDKit doesn't even have a direct way to figure out which version of Boost it was compiled with.

My solution in the fingerprint format is a metadata field modeled on the user agent field in HTTP. Fingerprint generation tools can include multiple versioned implementation names so as to (hopefully) characterize what they did.

What is the algorithm?

I was the primary VMD developer in the mid-90s. People wanted 3D structure alignment, so I implemented the Kabsch algorithm from Acta Crystallographica (1976). It worked, excepting a few cases which I couldn't figure out.

Someone pointed out to me that Kabsch wrote a followup article in 1978. As I dimly recall, the original algorithm would sometimes generate a mirror image. The updated paper correctly handled that case.

Now consider the Indigo cheminformatics toolkit. It implements the Kabsch algorithm but the reference it gives is only to the 1976 paper. Given the quality of their work, I'll assume it includes the correction, but which does it use? (I couldn't find "Kabsch" mentioned in their code so I don't know.)

Or look at the Wikipedia page and see how Cealign implements "the Kabsch algorithm." I looked in the source code and found:

# @SUMMARY: -- QKabsch.py.  A python implementation of the optimal superposition
#     of two sets of vectors as proposed by Kabsch 1976 & 1978.

So when someone says "the Kabsch algorithm", do they mean the 1976 paper or the 1978 paper? Until I updated the Wikipedia page, there was no mention of the 1978 correction, and the Indigo example lends strength to my argument that people consider the 1976 paper as being "the reference" but the implementations include the 1978 correction.

You can easily see how different variations (what if the correction fixed a minor point? and was written by someone else?) might make things more complex. I recall learning some math theorem which goes by 3 letters (the surname initials of the 3 main developers) in every country except England, where an Englishman is the 4th name. But these are hypotheticals and I wanted to stay in the concrete.

Which algorithms are relevant?

If someone uses Python's string.find() as part of a text search, should they reference its "BMHBNFS" algorithm? If someone does a Tanimoto similarity, should they reference HAKMEM 169 or how does one refer to the similar method of Anderson, with bug fixes and improvements by others?

(For a bioinformatics example, if one does a pairwise sequence alignment, does one quote say it's Needleman-Wunsch algorithm with its O(n*m) time and space or Hirschberg's algorithm at O(n*m) time but only O(min(n, m)) memory. Wikipedia calls the latter a divide and conquer version of the Needleman-Wunsch algorithm. Both algorithms give the same answer, but one is more tractable.)

I point to these because they are efficiency improvements, either in space or time. Are those factors which go into the decision of how to reference an algorithm? Should one include all of the performance tweaks that goes into a piece of code? What about the unpublished ones? The answer is, of course, "sometimes." Let me know when you figure out an objective way to decide which algorithms to list.

How does an ontology try to capture this? If the ontology is complex enough to handle this, is it still easy enough for the simple cases? Once there's an agreed upon ontology for this, will there be a new special case which doesn't fit into the existing ontology?

I don't know, and I don't care enough to work on the ontologies. I prefer at best well-defined (and nearly non-hierarchical) controlled vocabularies along with enough background information so that people can figure out how the bits and pieces go together.


(P.S. There really is no strong point or conclusion to this post. It's details that Janna asked me to post so it would be externally referenceable as part of their ontology discussion.)

Updated: On 14 November 2010 I added a couple of paragraphs concerning algorithms which produce the same results but do so in less time or less space.

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me

Copyright © 2001-2020 Andrew Dalke Scientific AB