Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2005/03/02/faster_fingerprint_substructure_tests

Faster fingerprint substructure tests

At OpenEye's CUP conference last week one of the presenters mentioned they used fingerprints as a way to screen out compounds for their search algorithm. They reminded me of the algorithm I developed a few years ago for speeding up fingerprint searches. It's very simple and I didn't want to do the research to see if it was publishable. I did get Daylight to try it out and they reported a 10-20% performance increase in their tests. I formally declared that this idea is the public domain and that I assert no rights over the code whatsoever. I suspect Daylight's using it. After you read this you can too. Please let me know if you used it and if it made a performance.

In chemical informatics a fingerprint is a characteristic bit string for a compound. There are several ways to generate fingerprints. Mesa summarized them so I refer you there. These fingerprints have the property that if a compound has some chemical property "X" then one or more bits of the bit string is turned on. If two compounds have many similar properties and those properties are captured by the fingerprint scheme then their fingerprints will be similar. The converse may not be true, depending on the scheme.

This lets fingerprints be used as one way to test the similarity of two compounds. Generate their fingerprints and compare the number of on bits in common, the number of off bits in common, and the number of bits in one fingerprint but not the other. These can be combined in several ways giving rise to different similarity measures, including Euclidean, Tversky and Tanimoto.

Fingerprints are also used as a filter in doing superstructure and substructure tests. Without loss of generality, assume a substructure is given as a query to find all the compounds which are superstructures of that substructure. By design the on bits of the fingerprint of the substructure must also be on bits of any superstructure. Computers are wicked fast at doing bit tests so this can quickly winnow out obvious rejects. Some systems will use several screens like this, based on different fingerprint schemes, to remove even more compounds from consideration. The remaining candidates are then tested using more precise but slower algorithms. The final test is a subgraph isomorphism test.

Let QFP be the fingerprint of the substructure, also known as the query fingerprint. Let FPi be the fingerprint of compound i where Nfp is the total number of fingerprints and 0<=i<Nfp. Then the fingerprint test function looks something like this in Python:

NOTE: None of this example code was tested.

def is_subset(FPA, FPi):
  """True if and only if all on bits of FPA are on bits of FPi"""
  return (FPA & FPi) == FPA
The & is the bitwise and operator in Python and C. It isn't used that often in Python. This test works because the result of (FPA & FPi) can have no more bits set than is in FPA. And if FPA has an on bit where FPi has an off bit then the and will have an off bit, making the result be different from FPA.

If you're interested in performance you'll want to do the test at the C level. Here's the roughly equivalent code. It needs a few more variables than the Python code. A fingerprint contains N bits and in the following N will always be a multiple of the word size. Assume the word size is 32 bit, then the number of words Nw = N/32. All arrays are word aligned.

int is_subset(unsigned int Nw,
              unsigned int *FPA,
              unsigned int *FPi) {
  int j;
  for (j=0; j<Nw; j++) {
    if ((FPA[j] & FPi[j]) != FPA[j]) {
      return 0;
    }
  }
  return 1;
}
Note that because this deals with a word at a time the algorithm may be able to return 0 (for False) early, without testing all the words in the fingerprint. Compare that to Python where the full bitwise and is made and stored in a temporary variable before doing the test for equals.

(For even more performance you could write the function as

int is_subset(unsigned int Nw,
              unsigned int *FPA,
              unsigned int *FPi) {
  int j;
  for (j=Nw; j>0; j--) {
    if ((*FPA & *FPi) != *FPA) {
      return 0;
    }
    FPA++; FPi++;
  }
  return 1;
}
but for this essay I'm aiming more for clarity than performance.)

The full screen to identify fingerprints which pass the test will look like this. I'll assume that the fingerprints to search are arranged in one memory block so that the first fingerprint is at FP0, the second at FP0+Nw, the third at FP0+Nw*2, etc.

void find_possible_compounds(unsigned int Nw,
                             unsigned int *FPA,
                             unsigned int Nfp, unsigned int *FP0) {
  unsigned FPi = FP0;
  for (int i=0; i<Nfp; i++) {
    if (is_subset(Nw, FPA, FPi)) {
      /* fingerprint i passes the test; do something here */
    }
    FPi += Nw;
  }
}
For even better performance we could inline the is_subset call, and reverse the direction of the for-loop to test against 0 instead of Nfp. I assume that last hand-written optimization doesn't make a difference with modern compilers.

We can make this faster.

It turns out that some of the words in the fingerprint are better at filtering than other words. The 3rd word might be the best at filtering, then the 4th, then the 1st then the 2nd. If we know the order then we can modify the is_subset function to test the best words first, like this:

int is_subset(unsigned int Nw, unsigned int *order,
              unsigned int *FPA,
              unsigned int *FPi) {

  int j;
  for (j=0; j<Nw; j++) {
    if ((FPA[order[j]] & FPi[order[j]]) != FPA[order[j]]) {
      return 0;
    }
  }
  return 1;
}

(I learned this trick from Roger Sayle, who presented it at MUG 2000.)

In theory this will do fewer word subset tests, but the code ends up slower than the original code. The extra order[j] lookup takes about the same amount of time as the subset test, so this effectively doubles each test. Rarely is one word discriminatory enough to justify this extra overhead.

To make this algorithm work and be faster than the original you have remove the extra layer of indirection. You need to compile the order into the is_subset function. For example, you could write the following to a file

int is_subset2301(unsigned int *FPA,
                  unsigned int *FPi) {
  return ((FPA[2] & FPi[2] == FPA[2]) &&
          (FPA[3] & FPi[3] == FPA[3]) &&
          (FPA[0] & FPi[0] == FPA[0]) &&
          (FPA[1] & FPi[1] == FPA[1])    );
}
then use the C compiler to turn it into a shared library. Load the library and call this optimized function for the test. Or you could write your own in-memory assembler and avoid the overhead of needing a working compiler.

(My solution was to use the C compiler. Roger did the in-memory assembler. He's hard core that way. :))

That done, what's the order to use? It's data dependent so one option is to order the words such that those with fewer on bits are tested first (accumulating counts over all the fingerprints in the search set). Hopefully these will be more likely to filter than the other words. But suppose that's the word which indicates features like "contains uranium", "contains more than 30 rings" and "is cubane". Few drug-like contain these features so that word will mostly contain off bits. But the same is true for the query. This ordering slows down the search because most of the subset tests for that word will end up passing.

Real life isn't that perverse. Roger Sayle tried the above and for Daylight's fingerprint scheme and his test set he found a 5-10% performance boost. (XXX need to double-check that.)

We can make this faster.

It would be better if the ordering could dynamically adapt to the query and the search fingerprints. I came up with one tweak that's surprisingly easy. Take a look at the code before reading the explanation:

void find_possible_compounds(unsigned int Nw,
                             unsigned int *FPA,
                             unsigned int Nfp, unsigned int *FP0) {
  unsigned FPi = FP0;
  unsigned best_word = 0;

  for (int i=0; i<Nfp; i++) {
    if (FPA[best_word] & FP0[best_word] == FPA[best_word]) {
      if (is_subset(Nw, FPA, FPi)) {
        /* fingerprint i passes the test; do something here */
      }
      best_word = (best_word + 1) % Nw;
    }
    FPi += Nw;
  }
}

This starts by assuming that the first word is a good filter. It tests that word and if it is a good filter goes on to the next search fingerprint. If the word fails to filter then it does the full fingerprint test as usual. It then assumes the first word is no longer a good filter and advances to use the second word as the best filter.

If a word is a good filter then this algorithm will prefentially use it as the best test word. Words which are bad filters are quickly skipped. So this algorithm dynamically adjusts to use the best word more often, at the cost of one extra word test. Even that could be eliminated by only testing words 1, 2, 3, ... when best_word == 0, only testing words 0, 2, 3, ... when best_word == 1, etc. I wouldn't do this though because having that much code is ugly and clumsy for large fingerprints.

One nice feature of the algorithm is that it doesn't require any analysis of the data set. Another is that it depends on the local variance in how well each of the words works as filter. As Jack Delany pointed out, many data sets contain local order, eg, the steroids are located together. It might be that the 2nd word is a better filter in that range while the 5th word is a better filter with the rest of the data set.

Of course if everything is a superstructure of the query structure then the performance is going to be slower than the original code since it needs to do a bit more work. I considered other options besides simply stepping to the next column. I could keep counts of how many times a word was a good filter and use that as a weighting to decide if it's no longer a good filter. But all the ones I could think of took extra overhead and I didn't get a good feel that the result would be that much better. I tried to model it mathematically but my queuing theory knowledge is rather poor.

Daylight tested my algorithm and found a performance improvement on their test sets of about 10-20%. (XXX need to double-check that.) I formally contributed the code to the public domain and I think they added it to both Merlin and the Oracle cartridge.

You are also free to use it and you don't have to pay me anything or give me credit me for it. If you want to be picky about the license I'm using the creative commons definition of public domain.

I would like it if you dropped me a note saying that you tested it out and letting me know if it improved your search performance and any other observations you might have had.


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB