Dalke Scientific Software: More science. Less time. Products
[ previous | next ]     /home/writings/diary/archive/2012/05/16/testing_hard_algorithms

Testing hard algorithms

I developed software to find a maximum common subgraph (MCS) given a set of molecules represented as a chemical graph. It's called fmcs. My previous three essays were about the background of the MCS problem, introducing fmcs, and an example of when MCS is used.

What I didn't describe was the mental effort it took to develop this program. This is the second time I've written code to find the multiple structure MCS, and both times it took a couple of months and put my head in a very strange place. You would think the second time is easier, but it means that I spent more time adding features and doing things that my first version couldn't begin to handle. (Did someone just mutter "second system effect"? Pshaw!)

This time too I had a better understanding of the development process. I think I know why it's so much harder than most of the software I develop. Back in the late 1990s, I almost never wrote automated tests, and never wrote an extensive test suite. Nowadays I'm pretty thorough, and use coverage-guided techniques to get good, extensive tests through some semi-permanent API layer that will allow me to refactor the internals without changing the tests.

Think about the problem for a moment. How would you write an algorithm to find the largest common subgraph between two graphs?

Take your time...

You probably came up with a graph walking algorithm which tries different match pairings and uses backtracking or lazy evaluation to search all of graph space. The more mathemtically inlined might have converted this into a maximum clique algorithm.

In both cases there are a lot of arbitrary decisions to make. Which pairings should you investiage first? Are there times when you can prune the search space?

There are a lot of heuristics, and some of them are only triggered under unusual circumstances, after a bunch of combinitorial possibilities. Outside of a few minor components, I couldn't figure out how to write unit tests for the code.

I ended up pushing most of the testing into validation testing of the complete software, which meant I wrote some 1,000+ lines of code without strong testing. Moreover, I used a new approach, so the algorithm I was working on doesn't have the track record of the standard clique-based or backtracking approaches.

So the stress I faced was a combination of not knowing if the algorithm would work, and not being able to develop enough test cases to provide some validation during the development.

As a rule of thumb, it's easiest to fix bugs which are caught early. Unit tests and evolutionary prototyping are two of the techniques that people use to tighten the feedback loop between specification, implementation, and valiation. I think my stress level is propotional to the size of the feedback loop.

What testing did I do?

I mean, I did have tests during development. I came up with a few examples by hand, I did a substructure search of a large data set and I verified that the MCS code found that substructure, but I know that's not enough tests. I know this because after six weeks of development and over 1000 lines of code, I spent another two weeks doing a large amount of post-development testing, and found several bugs.

I did most of my tests based on the ChEMBL data: random pairs of structures, the k=2, k=10, and k=100 nearest neighbors, the k<=100 neighbors with and similarities of at least 0.95, 0.9, and 0.8. I also did, at the end, test based on the ChEBI structure ontology.

What bugs did I find? I think it's educational to characterize a few of them.

Typo caught by an assertion check

One of the simplest bugs was a poorly formatted string. I used "%d" when I should have used ""%%%d". A bit of jargon for those in the know; I generate SMARTS strings for the intermedate subgraphs. If there are more than 9 open rings then the syntax goes from a single digit closure number to a closure number like "%10". I forgot to include the '%' for that case, and probably because the '%' was already there for the format string.

This wasn't triggered by my random-pairs test nor my various similarity-search based tests. Only when I ran through the ChEBI data, did I get an assertion failure when RDKit refused to accept my SMARTS string. That was the first time where I had a SMARTS with 10 unclosed rings.

As it happens, this error could have been caught by one style of unit testing. It's a four line function which takes a string and a number. I didn't test it because I feel that testing directly against internal functions inhibits refactoring. I prefer to test against higher-level, more stable APIs.

It's usually easy to come up with high-level test cases which trigger a certain function call. But not in this case. The MCS search algorithm, while deterministic, uses so many arbitrarily defined values that I couldn't on my own come up with a test case with at least 10 open ring closures. And even if I did, a new search heuristic might change things so that only 7 open ring closures were needed.

I felt that my validation testing and the assertion check would be enough to identify if there was a problem. A low-level unit test might have helped, especially as I still don't have a specific test for that failure.

cross-validation testing

The first MCS papers are about as old as I am. Some of the other implementations are available both at no cost and with no prohibitions on using it to develop a new MCS algorithm. (Some commercial companies don't like you using their software to write new software which is competitive to it, or even use their software for benchmark comparisons.)

I tested pairs of structures against SMSD and I did more extenstive tests against Indigo's MCS implementation.

This is cross-validation testing. It's a relatively rare technique because the cost of producing multiple identical implementations usually isn't worth the benefits. Even here the results aren't exactly identical because of differences in how the toolkits perceive chemistry, and more specifically, aromaticity. I ended up spending a lot of investigation time staring at cases with different answers and trying to figure out if it was a chemistry problem or an MCS algorithm implementation problem.

I found the SMSD had a bug in one of its options, which I reported. The code had been fixed internally but not pushed to the outside world. Its default mode and fmcs matched quite well, except for a couple of chemistry differences. The new version is out - I need to test it again.

The only problem I found in the Indigo code was a part of their setup code which didn't check the timeout. That's also been fixed after I reported it.

The cross-validation with Indigo found problems in my code. For example, I was often getting smaller MCSes then Indigo. After looking at them, I figured out that my code didn't correctly handle the case when a molecule was fragmented after a bond was removed because its bondtype wasn't in all of the structures.

My code usually got the right answer when using highly similar structures, for obvious reasons. It was only the random pair testing where the problem really stood out.

It's not easy to build a specific test case for this error. I don't know which of the two fragments will be first - it depends on so many arbitrary decisions which could change as the software is improved.

Bad heuristics

Most of the implementation contains heuristics. There's a branch and bound search, there's subgraph canonicalization to reduce the number of substructure tests, and so on. Each of these is supposed to help make the code faster.

One of the tests takes the current subgraph and the list of "outgoing" bonds to see how much of the remainder of the graph is accessible for growth, using that subgraph as a seed. It took a couple of days to think of the algorithm, write the code, and get it working. I then did the timing tests to find out it was 1% faster.

After looking at the algorithm a lot, I finally realized that I had forgotten to exclude additional bonds from consideration. With two changed lines, the performance was 30% faster.

I think I spent two weeks on running test upon test. I tried random pairs of compounds pulled from ChEMBL, I tried k-nearest neighbors, I tried all similar structures (up to k=100) at different similarity thresholds. Even then, I still found a bug when I switched over to exploring the ChEBI structure ontology. . If unit testing co-exists with development, and integration tests are downstream from that, then cross-validation is solidly in hte "veriif The MCS problem is NP-hard. There are a huge number of hur


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2010 Dalke Scientific Software, LLC.