[ previous | newer ]     /home/writings/diary/archive/2012/05/21/testing_hard_algorithms

## Testing hard algorithms

I developed software to find a maximum common subgraph (MCS) given a set of molecules represented as a chemical graph. It's called fmcs. My previous three essays were about the background of the MCS problem, introducing fmcs, and an example of when MCS is used.

What I didn't describe was the mental effort it took to develop this program. This is the second time I've written code to find the multiple structure MCS, and both times it took a couple of months and put my head in a very strange place. You would think the second time is easier, but it means that I spent more time adding features and doing things that my first version couldn't begin to handle. (Did someone just mutter "second system effect"? Pshaw!)

This time too I had a better understanding of the development process. I think I know why it's so much harder than most of the software I develop.

In this essay, I reflect on some of the reasons why this is a hard problem to test and I consider how unit tests, or any other incremental-based testing approach, are not well-suited to a certain class of complex algorithm development. While unit tests can provide a basic sanity check during development, they fail to provide the test coverage which one might expect from applying them to algorithmically simpler modules. Further, development methodologies built primarily on unit tests, like test-first style test-driven development, aren't that applicable. Instead, other methods, like system testing and a deep understanding of the algorithm and possible problems, are required. This strengthens my view, explored in Problems with TDD that "TDD is a weak method for developing those tests of confidence."

### How would you implement an MCS algorithm?

Think about the problem for a moment. How would you write an algorithm to find the largest common subgraph between two graphs?

Take your time. This really is a hard problem. I dug up some of the early papers from the Journal of Computer Documentation. People ended up simplifying the problem by, for example, not supporting rings.

Finished thinking?

You probably came up with a graph walking algorithm which tries different match pairings and uses backtracking or lazy evaluation to search all of graph space. The more mathemtically inlined might have converted this into a maximum clique algorithm.

In both cases there are a lot of arbitrary decisions to make. Which pairings should you investigate first? Are there times when you cann prune the search space? Did you make any assumptions about the nature of the chemical graph (eg, that it's topologically a planar graph)?

### How would you test your MCS algorithm?

Back in the late 1990s, I almost never wrote automated tests, and never wrote an extensive test suite. Nowadays I'm pretty thorough, and use coverage-guided techniques to get good, extensive tests through some semi-permanent API layer that will allow me to refactor the internals without changing the tests.

I couldn't do that here.

There are a lot of heuristics, and some of them are only triggered under unusual circumstances, after a bunch of combinitorial possibilities. Outside of a few minor components, I couldn't figure out how to write unit tests for the code.

I ended up pushing most of the testing into validation testing of the complete system, which meant I wrote some 1,000+ lines of code without strong testing. Moreover, I used a new approach to the MCS problem, so the algorithm I was working on doesn't have the track record of the standard clique-based or backtracking approaches.

So stress factors included not knowing if the algorithm would work, and not being able to develop enough test cases to provide good validation during development.

As a rule of thumb, it's easiest to fix bugs which are caught early. Unit tests and evolutionary prototyping are two of the techniques that people use to tighten the feedback loop between specification, implementation, and valiation. I think another stress factor is propotional to the size of the feedback loop.

## What testing did I do?

I mean, I did have tests during development. I came up with a few examples by hand, I did a substructure search of a large data set and I verified that the MCS code found that substructure, but I know that's not enough tests. I know this because after six weeks of development and over 1000 lines of code, I spent another three weeks doing a large amount of post-development testing, and found several bugs. In the process of writing this essay I also found that four days of that development work ended up making things slower, so I'll have to remove it.

I did most of my tests based on the ChEMBL data: 10,000 random pairs of structures, the k=2, k=10, and k=100 nearest neighbors, and the k<=100 neighbors with similarities of at least 0.95, 0.9, and 0.8. I also did, at the end, tests based on the ChEBI structure ontology. There were easily 20,000 different machine-generated test cases, although in most cases I didn't know the expected results beforehand.

## What bugs did I find?

What bugs did I find? I think it's educational to characterize a few of them.

### Typo caught by an assertion check

One of the simplest bugs was a poorly formatted string. I used "%d" when I should have used ""%%%d". A bit of jargon for those in the know; I generate SMARTS strings for the intermedate subgraphs. If there are more than 9 open rings then the syntax goes from a single digit closure number to a closure number like "%10". I forgot to include the '%' for that case, and probably because the '%' was already there for the format string.

This wasn't triggered by my random-pairs test nor my various similarity-search based tests. Only when I ran through the ChEBI data, did I get an assertion failure when RDKit refused to accept my SMARTS string. That was the first time where I had a SMARTS with 10 unclosed rings.

As it happens, this error could have been caught by a unit testing, as some people practice unit tests. It's a four line function which takes a string and a number. Testing it is trivial. I didn't test it because I feel that testing directly against internal functions inhibits refactoring. I prefer to test against higher-level, more stable APIs.

My view is that it's usually easy to come up with high-level test cases which trigger a certain function call. But not in this case. The MCS search algorithm, while deterministic, uses so many arbitrarily defined values that I couldn't on my own come up with a test case with at least 10 open ring closures. And even if I did, a new search heuristic might change things so that only, say, 7 open ring closures were needed.

I felt that my system testing and the assertion check would be enough to identify if there was a problem, and it did. A low-level unit test might have helped, especially as I still don't have a specific test for that failure.

I think the right thing to do is add that failure as a specific test case, and use monkey-patching to insert wrong code for a repeat of that test case. The first one tests that the code is correct, and the second tests that the test case is still exercising the code under test.

### Cross-validation testing

The first MCS papers are about as old as I am. Many people have written implementions, although relatively few are are available to me both at no cost and with no prohibitions on using it to develop a new MCS algorithm. (Some commercial companies don't like people using their software to write new software which is competitive to it, or even to use their software for benchmark comparisons.)

I tested pairs of structures against SMSD and I did more extenstive tests against Indigo's MCS implementation.

This is cross-validation testing. It's a relatively rare technique because the cost of producing multiple identical implementations usually isn't worth the benefits. Even here the results aren't exactly identical because of differences in how the toolkits perceive chemistry, and more specifically, aromaticity. I ended up spending a lot of investigation time staring at cases with different answers and trying to figure out if it was a chemistry problem or an MCS algorithm implementation problem.

I found the SMSD had a bug in one of its options, which I reported back to the author. The code had been fixed internally but not pushed to the outside world. Its default mode and fmcs matched quite well, except for a couple of chemistry differences. The new version is out now - I need to test it again.

The only problem I found in the Indigo code was a part of their setup code which didn't check the timeout. That's also been fixed after I reported it.

The cross-validation with Indigo found problems in my code. For example, I was often getting smaller MCSes then Indigo. After looking at them, I figured out that my code didn't correctly handle the case when a molecule was fragmented after a bond was removed because its bondtype wasn't in all of the structures.

Why didn't my hand-written test cases find it? None of them had a case where there was a bond in the "middle" of a structure chosen as the reference structure, and where the MCS was not in the first fragment.

My code usually got the right answer when using highly similar structures, for obvious reasons. It was only the random pair testing where the problem really stood out.

Could I create a simple unit test for this error? Perhaps, but it's not easy. I don't know which of the two fragments will be first - it depends on so many arbitrary decisions which could change as the algorithm is improved. The only test I can think of for this was to generate a diverse set of tests, make sure some fail if the code isn't implemented correctly, record the results, and make sure that future tests never find a worse (or better?) test case.

### Bad heuristic to determine the maximum possible subgraph growth

Most of the MCS implementation is heuristics. There's a branch and bound search, there's subgraph canonicalization to reduce the number of substructure tests, and so on. Each of these is supposed to help make the code faster.

One of the tests takes the current subgraph and the list of "outgoing" bonds to see how much of the remainder of the graph is accessible for growth from the current subgraph. The rest of the molecule might not be accessible because it's on another fragment, but it also might not be accessible due to an earlier decision to exclude certain parts of the molecule from future growth. (My algorithm tests all subgraphs which include a given bond, then all subgraph which don't include the given bond.)

It took a couple of days to think of the algorithm, write the code, and get it working. I then did the timing tests to find out it was 1% faster, to get the same answers.

Does that mean my code worked? I only had a suspicion that the new algorithm should be faster. Perhaps the overhead of searching for the accessible atoms was too costly?

After looking at my implementation - a lot - I finally realized that I told the algorithm to exclude the subgraph from consideration for growth but had forgotten to also exclude the set of previously excluded bonds from consideration. With two changed lines, the overall performance doubled for the random-pairs case. Most of the time it's faster and sometimes its slower, so it takes an aggregate of tests to measure this correctly.

I don't think this heuristic could have been written as a unit test. No, I take that back. It could have, but it would have required some careful thought to set up. Not only was this part of the code not "built for test", but setting up the right conditions requires a mental understanding of the problem which I know I didn't have when I was doing the development.

BTW, I am not saying that unit tests can't be used to measure performance. Some algorithms have simple timing characteristics where it's easy to say that the code must complete by a given time, or that it must be at least twice as fast as a reference implementation. Occasionally these tests will hiccup due to unusual loads on the test machine, but not usually enough to be a problem.

Indeed, the fmcs code has some unit tests for the timeout code. I found a pair of structures which takes over 10 seconds to find the MCS, I set the timeout to 0.1 second, and assert that no more than 0.5 seconds elapsed during the function call. (I don't require a high precision for this code.) I do worry though that future changes to the MCS search code might speed things up by a faster of 20. On the other hand, the effect of broken timeout code is very obvious when running the validation suite, so I'm not going to worry about it.

As is widely acknowledged, this stretches the idea of a "unit test." Unit tests are supposed to be fast; preferably several hundred or more per second. The goal is that these tests run often, perhaps even after every save. Stick in a few 0.1 second timeouts into the system and it bogs everything down, which discourages people from running the tests so often.

But let's go back to this new code. On average, over a large number of tests, the performance is 50% faster. What's a good test case? Can I pick one structure or a small set of structures? Does the speedup occur only when there are large molecules? Only for molecules with lots of internal symmetry? Only for those which are easily fragmented?

Only now, when writing this essay, did I find good test cases. When I use a 10 second timeout using the k=10 nearest neighbors tests, I get 8 timeouts in the first 32 records using the old algorithm, and only 1 timeout using the new algorithm. The time for the very first test goes from over a minute (I finally gave up) using the old algorithm to 0.08 seconds using the new one.

Obviously (in retrospect) the right solution is to pull out a couple of those from my validation suite and turn them into test cases.

I never would have come up with these test cases before implementing the new code - up until ten minutes ago I thought that the new code only added a factor of two the code, not possible factors of 1,000!

## Canonicalization: an in-depth analysis

I'm not finished with testing. I haven't fully characterized all the parts of the implementation. For that matter, there are heuristics I can still add. Here I'll describe what I did to analyze subgraph SMARTS canonicalization. I conclude by finding that while it works, it slows down the code and several days of development work ought to be removed.

My MCS algorithm enumerates subgraph of one structure, converts that into a SMARTS string, and tests if the pattern exists in other structures.

There are many alternate ways to make a SMARTS string from a subgraph. Given the linear chain of C O N, there's the reasonable CON or NOC variations, the more confusing O(C)N and O(N)C, and crazy variations like C1.N1O and O%987.N7.C%98. It's easy to make the non-insane versions be generating a spanning tree. The question is where to start the search and how to handle branches.

I believe that there are many duplicate patterns in a query. For example, a benzene ring will generate many "ccc" queries. I can minimize the number of substructure matches by caching earlier SMARTS match results. But caching doesn't work if sometimes I get a CON and sometimes I get a NOC. The solution is to generate a "canonical" SMARTS for the subgraph - a unique representation for all of the variations.

I implemented the CANGEN algorithm from SMILES. 2. Algorithm for generation of unique SMILES notation by Weininger, Weininger, and Weininger J. Chem. Inf. Comput. Sci., 1989, 29 (2), pp 97-101 in order to choose a canonical SMARTS. (But see Assigning Unique Keys to Chemical Compounds for Data Integration: Some Interesting Counter Examples, Neglur, Grossman, and Liu, 2nd International Workshop on Data Integration in the Life Sciences (DILS 2005), La Jolla.)

Canonicalization is notoriously hard to get right. It too has a lot of tricky corner cases. For example, in the late 1990s Daylight tracked down a reported bug in their canonicalization implementation. They algorithm requires a stable sort algorithm, but they used an unstable sort. Their own internal testing never found the problem - it was a customer who found the Solaris and IRIX boxes occasionaly returned different values.

There could be similar bugs in my canonicalization algorithm. I don't know. The usual solution is to generate large numbers of test cases, randomize the order of the atoms and bonds, and verify that all of them are the same. Here too it may be that only certain rare topologies, or rare bond patterns, triggers an error. I'm not convinced that I could come up with the right set of test cases for it.

What makes it harder is that the underlying search algorithm doesn't need a canonical SMARTS, so a canonicalization failure doesn't ever show up in the output. I could have an infrequent bug and not notice it! Canonicalization is only there for performance reasons, but my implementation is in Python, while the substructure matching is in C++, so it might even be that the canonicalization time might be more than the time saved.

I can make that a bit more nuanced. Canonical SMARTS generation has three phases: 1) initial ranking, 2) canonicalization/tie-breaking, and 3) SMARTS generation. I can disable the canonicalization step and get a "semi-canonical" SMARTS (my initial ranking is more like a circular fingerprint of diameter 3 than the atom characteristic from Weininger et al.). I can also disable caching. Both of these are doable with only a couple of changes to the code.

This leads to three cases: canonical SMARTS with caching, semi-canonical SMARTS with caching, and semi-canonical SMARTS without caching. (There should be a fourth step, which is arbitrary SMARTS, but that requires more extensive changes.)

What I want to understand most is the effect of canonicalization on my code. Again, I have to resort to aggregate timings and statistics across a set of benchmarks. I used the ChEMBL-13 data set and generated various benchmarks. One is based on computing the MCS between 10,000 pairs selected at random, another contains the k=2, k=10, and k=100 nearest neighbors of randomly chosen fingerprints (timing 500 data sets each time), and the last is the total of 500 tests of the k<=100 compounds with Tanimoto score of at least 0.95 to randomly selected fingerprints. My results (times are reproducible to within a few percent) include the number of unique SMARTS (canonical or non-canonical) generated and the total number of substructure tests ("SS") which were carried out.

Random pair timings# unique SMARTS# SS tests
no cacheTotal: 10000/467.3s (21.4/s) Complete: 9997/437.3s (22.9/s) Incomplete: 3/30.0s9216661339292
semi-canonicalTotal: 10000/421.3s (23.7/s) Complete: 9997/391.3s (25.6/s) Incomplete: 3/30.0s921666921666
canonicalTotal: 10000/442.1s (22.6/s) Complete: 9997/412.0s (24.3/s) Incomplete: 3/30.0s709636709636
k=2 nearest neigbhors# unique SMARTS# SS tests
no cacheTotal: 500/287.3s (1.7/s) Complete: 484/127.1s (3.8/s) Incomplete: 16/160.2s264057320238
semi-canonicalTotal: 500/276.7s (1.8/s) Complete: 484/116.6s (4.2/s) Incomplete: 16/160.2s264057264057
canonicalTotal: 500/298.6s (1.7/s) Complete: 483/128.4s (3.8/s) Incomplete: 17/170.2s186682186682
k=10 nearest neigbhors# unique SMARTS# SS tests
no cacheTotal: 500/520.5s (1.0/s) Complete: 471/230.0s (2.0/s) Incomplete: 29/290.5s4476483525563
semi-canonicalTotal: 500/490.1s (1.0/s) Complete: 472/209.5s (2.3/s) Incomplete: 28/280.6s4476482872040
canonicalTotal: 500/509.3s (1.0/s) Complete: 471/218.7s (2.2/s) Incomplete: 29/290.6s3300102109332
k=100 nearest neigbhors# unique SMARTS# SS tests
no cacheTotal: 500/414.4s (1.2/s) Complete: 486/271.4s (1.8/s) Incomplete: 14/143.0s1287819932263
semi-canonicalTotal: 500/363.5s (1.4/s) Complete: 487/230.8s (2.1/s) Incomplete: 13/132.7s1287817456877
canonicalTotal: 500/361.8s (1.4/s) Complete: 488/239.5s (2.0/s) Incomplete: 12/122.4s1076486196764
k<=100 at or above Tanimoto threshold of 0.95# unique SMARTS# SS tests
no cacheTotal: 500/642.1s (0.8/s) Complete: 458/220.3s (2.1/s) Incomplete: 42/421.9s4686614388074
semi-canonicalTotal: 500/624.5s (0.8/s) Complete: 461/232.8s (2.0/s) Incomplete: 39/391.7s4686613828813
canonicalTotal: 500/640.9s (0.8/s) Complete: 460/239.2s (1.9/s) Incomplete: 40/401.8s3648023222366

Whew! That's a lot of data to throw at you. Please believe me when I say that it took two days to generate correctly. BTW, "complete" means that the MCS search algorithm went to completion, the "incomplete" means that it timed out - here after 10 seconds - and gave a partial solution.

What does this tell me? Obviously, caching is good. As expected, the "semi-canonical" solution is always better than the "no cache" case, and the canonicalization always reduces the number of substructures and substructure tests. This means the canonicalization code is doing something right.

Unfortunately, it seems like canonicalization has a big performance impact. In the pairwise tests I do 922,000 canonicalizations in Python to save 212,000 substructure tests, and I lose 21 seconds in doing so. For the k=100 nearest neighbor benchmark, which is the only one where the canonicalization code saves time, I do 129,000 canonicalizations to save 1,260,000 comparisons, and gain about 2 seconds. This suggests that each canonicalization takes as long as 10 substructure matches, and that RDKit can do 60,000 SMARTS matches per second. A quick checks using one SMARTS gives 100,000 SMARTS matches per second, which helps validate this estimate.

This means I should take out the canonicalization until it can be implemented in C++. Granted, there may be a bug in the code, like there was with an earlier hueristic. But there would have to be an order of magnitude performance increase for this to be effective, and I don't think that's likely.

## Unit tests aren't enough to drive development

Which leads me back to my thesis. It would have been possible to develop a few more unit tests for the canonicalization code. There would have been some extra scaffolding in order to do that, but that's a minor cost. I suspect that many of the tests would be very implementation-dependent, and tied to specific internal function calls. I don't like this because it means that a re-implementation of this internal component in C++, which submerges many of the internal helper functions and places them out of reach of Python-based unit tests, would cause the unit tests to be un-runnable. And I am certain that those unit tests would not be rigorous enough to be confident that the code was working correctly.

Think about the question "should I write this code in Python or C++?". It's a development-time question. Test Driven Development (TDD) is a development methodology which uses unit tests to help make development-time decisions. I think TDD is most helpful when used to establish the minimum requirements for a project.

I don't see how TDD is helpful here. I can't think of any unit tests which would guide this development decision. Sometimes you can write a "spike solution" (also called a "prototype"):

A spike solution is a very simple program to explore potential solutions. Build the spike to only addresses the problem under examination and ignore all other concerns.
This sounds good, but what development style should you use to develop the spike solution? Spikes are supposed to be throw-away code, but I can't figure out how something other than a complete, tested implementation of the canonical algorithm or the remaining growth heuristics would have been useful. After all, a two line change in the latter went from "working but 1% faster" to "working and 50% faster", and I didn't even know if there was going to give a speedup in the first place.

## If not TDD, what methodology do I use?

Read Peter Seibel's Unit testing in Coders at Work. The author interviewed various famous programmers and learned more about how they tested their software. You may also want to read the related comments on Hacker News.

Seibel recounts that Joshua Bloch "wrote a monstrous 'basher'" which found "that occasionally, just occasionally, the basher would fail its consistency check." Bloch then wrote tests to pin down the failure, eventually finding the fault in a system mutex. Note that under normal unit test development you assume that your underlying components are correct, so this isn't something you would normally code for.

Seibel also reports that others assemble a large amount of test cases and uses that to check their code. This is of course the technique I used for the MCS problem.

It feels as trite as saying that the secret to losing weight is diet and exercise, but my method for programming is to understand the problem, implement it, and test it until you are confident that it's good enough. That knowledge of how to do that comes from experience, practice, reflection, and discussion.

For those who scoff and call this "big design up front," I merely point you to earlier parts of this essay. When I started this code I had a rough idea that it would work. I had previously implemented substructure enumeration, and the Weininger et al. canonicalization algorithm and SMARTS generation, so I knew that the components were possible, but I didn't know how they went together. Instead, I let the code development itself guide me. I thought some, I implemented code, I "ran the code in my head", I reflected on what the tricky parts were, I thought about how to make them more clear, and I tried various was to improve the code.

That was a lot to keep in my head, I wasn't sure if the result would be fast enough, and I had no way good way to test its usefulness until most of the code was in place. No wonder it was mentally taxing!

I believe tests - including unit tests implemented during development - are important. I don't believe that unit tests are good enough to guide complex algorithm development. However, most programming (over 95%?) is not complex algorithm development; complicated? yes, but not complex.

I once jokingly said that TDD is not useful if there's two or more embedded for loops. That's a bit of a simplification, but not far from my feelings. Some classes of problems, like complex algorithm development and security analysis, require a different attitude towards programming, emphasizing "what can go wrong?" and "how can I improve my confidence that the code is working correctly?" It's my view that this doubt-based philosophical attitude is missing from most discussions of software development practices.

And perhaps as fundamental, when I have to be that critical of my own work and probe it for failures and be open to the possibility of nasty gotchas lurking in the deep dark corners, then some of that doubt passes over into my personal life. I empathize with the idea of not living in that "strange place" all the time, and I can see why this isn't a common attitude.

## Comments

This essay is meant to be a thoughtful reflection on the difficulties of programming, using the MCS problem to provide specific structure. If you want to leave comments about the MCS portion then use the MCS comment site. Otherwise, leave a comment on testing hard algorithms.

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me

Copyright © 2001-2013 Andrew Dalke Scientific AB