Even faster in-memory search with intersection popcount

[ previous | newer ] /home/writings/diary/archive/2020/10/07/intersection_popcount

Even faster in-memory search with intersection popcount

This is part of a series of essays on how to write a similarity search program for the RDKit Morgan fingerprints distributed by ChEMBL.

Simple FPS fingerprint similarity search: variations on a theme
Simple k-NN FPS Tanimoto and cosine similarity search
Simple in-memory ChEMBL similarity search
Simple BitBound ChEMBL similarity search
Using RDKit BulkTanimotoSimilarity
Faster in-memory ChEMBL search by using more C
Faster BitBound ChEMBL search by using more C
Even faster in-memory search with intersection popcount
Cache and reuse popcount-sorted ChEMBL fingerprints

This is part of a series of essays I started to write a week ago where I use a few different approaches to implement cheminformatics fingerprint similarity search.

My initial file scan implementation took 2 seconds per query. I then switched to a two stage in-memory search with a load step and a search step. Loading the data (an initialization cost) took 4.6 seconds and per-query search took about 0.5 seconds, making it more effective if there are three or more queries.

Two days ago I sped up in-memory search by moving more of the search into C. This had even higher initialization overhead, at about 7 seconds, but could process 14 queries per second (70 ms/query). Yesterday I re-implemented last week's BitBound algorithm to use more C code. With the threshold of 0.7 I've been using for benchmarking, the new load time was nearly 9 seconds but search performance increased to almost 60 queries per second.

In this essay I'll improve the search performance by reducing the number of popcount calculations needed during the Tanimoto calculation.

If you want to jump right to the code:

Download and uncompress chembl_27.fps.gz.
Download the updated popc.py.
Run python popc.py to (re)generate the _popc module.
Download chembl_bitbound_search_in_C_v2.py.
Run python chembl_bitbound_search_in_C_v2.py --times to do the default caffeine search and see the internal timings.
Run python chembl_bitbound_search_in_C_v2.py --help to see the command-line help, then try your own queries, including with different thresholds.

threshold=0.35

BitBound search gives a linear speedup to search. The higher the threshold, the faster the speedup. My benchmarking used a threshold of 0.7 solely so the caffeine search of ChEMBL returns two matches, and not for any deeper reason.

In practice, different fingerprint types have different thresholds where the Tanimoto similarity becomes chemically significant. For the ECFP-like fingerprints (including RDKit's Morgan fingerprint) this can be as low as 0.35 or so. I'll compare my full in-memory search program with my BitBound implementation to see how they compare at a more chemically relevant score:

% python chembl_in_memory_search_in_C.py  --times --queries wikipedia_1000.fps --threshold 0.35 | sort > old_sorted.out
load: 8.65 search: 69.32 #queries: 994 #/sec: 14.34
% python chembl_bitbound_search_in_C.py --times --queries wikipedia_1000.fps --threshold 0.35 | sort > new_sorted.out
load: 9.33 search: 43.21 #queries: 994 #/sec: 23.01
% cmp old_sorted.out new_sorted.out
% fgrep -c '*' old_sorted.out
88
% wc -l old_sorted.out
  559225 old_sorted.out
% python -c  "print((559225-88)/994)"
562.5120724346076

(The 88 * lines correspond to queries with no hits.)

With a threshold of 0.35 there are on average 562 hits per query, and the BitBound algorithm is only 60% faster, not the 4x faster we saw with a threshold of 0.7.

Reducing the number of popcounts

Most of the search time is spent in C, in this part of threshold_tanimoto_search() defined in popc.c:

    for (target_idx = 0; target_idx<num_targets; target_idx++) {
        score = byte_tanimoto_256(query_fp, target_fps + target_idx*256);
        if (score >= threshold) {

and most of that time is spent in this part of byte_tanimoto_256():

    for (int i=0; i<num_words; i++) {
        intersect_popcount += __builtin_popcountll(fp1_64[i] & fp2_64[i]);
        union_popcount += __builtin_popcountll(fp1_64[i] | fp2_64[i]);
    }

I can speed that up by reducing the number of popcount calculations.

And I can do that with one key observation. If we somehow know that popcount(fp1) = A and popcount(fp2) = B and if we compute C = popcount(fp1 & fp2), then Tanimoto(fp1, fp2) = C / (A + B - C). The - C in the denominator (A + B - C) prevents the common bits from being double-counted.

Finally, my BitBound implementation already computes the target fingerprint popcounts (to put them into the right bins, based on the popcount) and the query fingerprint (in order to get the BitBound limits), so I already have A and B. I just need some way to compute C.

Intersection popcount

I added a new function definition in popc.py, byte_intersect_256(). It computes the intersection popcount between two 2048-bit/256-byte fingerprints. It's a simplification of byte_tanimoto_256() so I think I can simply show it without explanation:

static int
byte_intersect_256(const unsigned char *fp1, const unsigned char *fp2) {
    int num_words = 2048 / 64;
    int intersect_popcount = 0;

    /* Interpret as 64-bit integers and assume possible mis-alignment is okay. */
    uint64_t *fp1_64 = (uint64_t *) fp1, *fp2_64 = (uint64_t *) fp2;

    for (int i=0; i<num_words; i++) {
        intersect_popcount += __builtin_popcountll(fp1_64[i] & fp2_64[i]);
    }
    return intersect_popcount;
}

Similarity search using the intersection popcount

I'll write a new C function, threshold_bin_tanimoto_search(), defined in the updated popc.py, to do the search. If we know the query_popcount and target_popcount then the main search code is:

double total_popcount = (double) (query_popcount + target_popcount);
  ...
/* Do the search */ 
for (target_idx = 0; target_idx<num_targets; target_idx++) {
    intersect_popcount = byte_intersect_256(query_fp, target_fps + target_idx*256);
    score = intersect_popcount / (total_popcount - intersect_popcount);
    if (score >= threshold) {
        hit_indices[num_hits] = target_idx;
        hit_scores[num_hits] = score;
        num_hits++;
    }
}

The double total_popcount is to not only pre-compute the sum (A+B) once, but to change it to a double so the A+B-C step results in a double, so the division is an int / double, resulting in the desired double. If I left the sum as an integer then C says that an integer divided by an integer is an integer.

However, take a second look at that division. If both query_popcount and target_popcount are 0 then score = 0/0. I'll define Tanimoto(0, 0) to be 0, but I still need to handle that case myself instead of letting IEEE 754 math return a NaN or worse.

I don't need to handle that case in the main loop because I can check if the two popcounts are 0. While I'm at it, if either popcounts is 0 then the Tanimoto must also be 0. So depending on the threshold, either all fingerprints will match, or none of them will match.

That test looks like:

/* Handle edge cases if one or both of the fingerprints has no bits set. */
if ((threshold > 0.0) && ((query_popcount == 0) || (target_popcount == 0))) {
    /* Nothing can match: Tanimoto(0, x) = 0 < threshold */
    return 0;
} else if ((threshold <= 0.0) && (total_popcount == 0.0)) {
    /* Everything will match: Tanimoto(0, 0) = 0 >= threshold */
    for (target_idx = 0; target_idx<num_targets; target_idx++) {
        hit_indices[target_idx] = target_idx;
        hit_scores[target_idx] = 0.0;
    }
    return num_targets;
}
/* Do the search */
    ...

Changes at the Python level

Finally, I need to update the Python code to handle the new C function. Here's the relevant code from yesterday's chembl_bitbound_search_in_C.py:

for target_ids, target_fingerprints in target_popcount_bins[min_popcount:max_popcount+1]:
    # Do the Tanimoto search. 
    num_hits = threshold_tanimoto_search(
        query_fp, threshold,
        len(target_ids), target_fingerprints,
        hit_indices, hit_scores)

The target popcount is implicit in the Python iterator but I need an explicit integer. I decided to iterate over indices, which changed the code slightly.

for target_popcount in range(min_popcount, max_popcount+1):
    target_ids, target_fingerprints = target_popcount_bins[target_popcount]
    # Do the Tanimoto search. 
    num_hits = threshold_bin_tanimoto_search(
        query_fp, query_popcount, threshold,
        len(target_ids), target_fingerprints, target_popcount,
        hit_indices, hit_scores)

(A perhaps more Pythonic solution is to store the index as part of the target_popcount_bins, so instead of a the 2-element (list of fingerprints, fingerprint block) it's the 3-element (popcount, list of fingerprints, fingerprint block).)

And that's it. The final program is chembl_bitbound_search_in_C_v2.py. You'll need to download and run the updated popc.py to generate a new _popc module with the new C functions.

Verification and timings

First, a quick check with caffeine as the query:

% python chembl_bitbound_search_in_C.py --times
No query specified, using caffeine.
caffeine	CHEMBL113	1.000
caffeine	CHEMBL1232048	0.710
load: 8.81 search: 0.01 #queries: 1 #/sec: 84.53
% python chembl_bitbound_search_in_C_v2.py --times
No query specified, using caffeine.
caffeine	CHEMBL113	1.000
caffeine	CHEMBL1232048	0.710
load: 8.89 search: 0.01 #queries: 1 #/sec: 119.54

Looks good so far, and with some speedup. Now with 1,000 Wikipedia SMILES:

% python chembl_bitbound_search_in_C.py --times --queries wikipedia_1000.smi --threshold 0.7 | sort > old_sorted.out
load: 11.33 search: 17.93 #queries: 994 #/sec: 55.44
% python chembl_bitbound_search_in_C_v2.py --times --queries wikipedia_1000.smi --threshold 0.7 | sort > new_sorted.out
load: 10.80 search: 11.36 #queries: 994 #/sec: 87.52
% cmp old_sorted.out new_sorted.out

Identical results (whew) and a 40% query speedup.

What about at a threshold of 0.35?

% python chembl_bitbound_search_in_C.py --times --queries wikipedia_1000.smi --threshold 0.35 > /dev/null
load: 11.87 search: 49.19 #queries: 994 #/sec: 20.21
% python chembl_bitbound_search_in_C_v2.py --times --queries wikipedia_1000.smi --threshold 0.35 > /dev/null
load: 10.56 search: 28.26 #queries: 994 #/sec: 35.17

Also about 40% faster.

intersection popcount speeds up full in-memory search too

Incidentally, if you go into chembl_bitbound_search_in_C_v2.py and change the line:

if threshold > 0.0:

if 0 and threshold > 0.0:

then that disabled the BitBound limits and the implementation reverts to a full in-memory search, but with performance benefit of computing fewer popcounts during search time. I did that, and the timings for a threshold of 0.35 are:

load: 9.30 search: 39.45 #queries: 994 #/sec: 25.20

The full in-memory search using byte_tanimoto_256() took 49.19 seconds, so switching to byte_intersect_256() gives a 20% faster performance.

What's left?

If you've made it this far, and understood what I've done, then congratulations!

One of my goals in writing this series of similarity search implementations was to get you to the point where you could understand how similarity search implementations work, try it out for yourself, and see that it's really not that hard, and even rather fun.

We know there are ways to speed up the code still further because chemfp's simsearch is about twice as fast:

% simsearch --times --queries wikipedia_1000.smi --threshold 0.35 chembl_27.fps > /dev/null
open 5.83 read 0.25 search 15.39 output 0.67 total 22.15

However, the big, easy gains are gone. (Unless you have access to a machine with an AVX-512 popcount instruction!) What's left is an improved search method to reduce the number of divisions needed, an AVX-2 popcount implementation, memory prefetching, and a bunch of standard optimization techniques. I mentioned them all in my paper The chemfp project.

If you're the type of person who really enjoys this sort of problem - have at it! But if you want just want fast cheminformatics similarity search, why not try out chemfp?

Plus, source code licensing is available, if you really want to dig into the details of how chemfp works.

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me