Andrew Dalke's writings

Are 32-bit floats enough for your Tanimoto scores?

Mon, 15 Feb 2021 12:00:00 GMT

Chemfp uses 64-bit IEEE 754 floating point (a.k.a. double or binary64, or float64) to represent cheminformatics fingerprint similarity scores.

Some fingerprint search systems instead use 32-bit IEEE 754 floating point (a.k.a. single or binary32 or float32), usually to reduce the amount of memory needed, and sometimes also for the higher performance of 32-bit floating point operations.

I chose to represent everything in binary64 for a simple reason: chemfp is designed to work with Python. Python's native float data type is binary64, and I find that mixing 32-bit and 64-bit floats results in subtle conversion problems that are frustrating to figure out.

(Consider that the binary32 representation closest to 0.6 has a value slightly larger than 0.6, while the corresponding binary64 representation's value is slightly smaller than 0.6.)

Other than possible conversion problems, is there any reason to NOT use a binary32 value?

In this essay I'll show that binary32 isn't able to distinguish some Tanimoto scores for fingerprints which are 4142 bits long, or longer, though in practice this is essentially never a concern.

Real-world Tanimotos scores and binary32

As a brief background, most fingerprints are binary bitstrings of length around 1,000 bits. Let ‖fp‖ be the number of on-bits in a fingerprint bitstring. The most common way to compare two fingerprints is the Jaccard similarity, which in cheminformatics is referred to as the Tanimoto. It is computed as ‖fp₁&fp₂‖/‖fp₁|fp₂‖, and ranges between 0.0 and 1.0. In chemfp, 0/0 = 0.0.

A simple examination shows that 64 bits is overkill. I can't recall ever reading a paper using over 2¹⁴ = 16,384 bits. This means the numerator and denominator of every Tanimoto requires at most 14 bits, so should easily fit in 32-bits.

This seems like a binary32 value should work. However, some of those bits are used to store values that aren't needed for Tanimoto scores. The sign bit isn't needed because Tanimoto scores can never be negative. A further 8 bits are reserved for 254 possible exponents, but only 7 or so of those exponents are needed.

In the worst case, we only have 24 bits of significand precision, which gives 12 bits each for the numerator and denominator, corresponding to a maximum fingerprint length of 4096 bits.

But surely we can do a bit better than 4096 because the values under 0.1 use a different exponent. Perhaps we can scale up to 4096 / 0.9 = 4551 bits? Furthmore, some of those Tanimoto ratios are not in reduced form, that is, 1/2 = 2/4 = 4/8 = 0.5. Perhaps that gives some more breathing room?

When is 32-bit IEEE 754 floats not enough?

What is the smallest fingerprint length with two distinct possible scores that cannot be distinguished using binary32? The brute-force solution is to enumerate all ratios and report when there are duplicates, as in the following:

# Find the smallest fingerprint length where two different Tanimoto
# scores have the same 32-bit float representation.

import struct

# Convert the value to the 4-byte/32-bit representation.
pack_f32 = struct.Struct("f").pack

# Mapping from packed_score (float-as-bytes) to (value, numerator, denominator)
seen_packed_scores = {}

# Try each ratio
for denominator in range(1, 5000):
    for numerator in range(1, denominator):
        # This uses Python's native 64-bit float
        score = numerator / denominator

        # Get the 4-byte representation of the score
        packed_score = pack_f32(score)

        # Check if that packed score was seen before
        prev = seen_packed_scores.get(packed_score, None)
        if prev is None:
            # Not seen, so record it in case I see it again.
            seen_packed_scores[packed_score] = (score, numerator, denominator)
            continue

        # The packed score can be the same if the scores are the same.
        # (4/8 and 5/10 give the same scores)
        prev_score, prev_numerator, prev_denominator = prev
        if score != prev_score:
            print(f"{score} = {numerator}/{denominator} and {prev_numerator}/{prev_denominator}")
            raise SystemExit(1)

print("No duplicates found.")

This takes about 10 seconds before it reports:

0.5111057460164172 = 2117/4142 and 2094/4097

That means that binary32 cannot distinguish between the Tanimoto scores 2117/4142 and 2094/4097, so cannot be used to represent all possible Tanimoto scores for fingerprints that are 4142 bits long or longer.

Don't worry!

On the other hand, do you really need to represent all possible Tanimoto scores? Most people don't work with fingerprints beyond about 4096 bits. Most fingerprints are sparse, with an on-bit density of less than 10%. If the fingerprint type is 8192 bits long and has a maximum bit density of 20% then the Tanimoto denominator is at most 3277 bits, which is within the expected limit.

I honestly don't think that real-world work will run into this issue.

To be sure, it is possible to get Tanimoto scores in this range. The densest fingerprints I know are the RDKit fingerprints. The following program uses chemfp's interface to RDKit to find the ChEMBL records with a successively increasing number of bits set, using 16384-bit fingerprints:

import chemfp
from chemfp import bitops
fptype = chemfp.get_fingerprint_type("RDKit-Fingerprint fpSize=16384")

max_num_bits = 0 
for id, fp in fptype.read_molecule_fingerprints("chembl_27.sdf.gz", errors="ignore"):
  num_bits = bitops.byte_popcount(fp)
  if num_bits > max_num_bits:
    print(id, num_bits)
    max_num_bits = num_bits

The output until I stopped in manually is:

CHEMBL153534 819
CHEMBL440060 1244
CHEMBL440245 1290
CHEMBL440249 2656
CHEMBL504077 3546
CHEMBL501923 3600
CHEMBL501238 3648
CHEMBL441925 3783
CHEMBL439119 5981
CHEMBL438241 5999
CHEMBL1207820 6007
CHEMBL411986 6036
CHEMBL263314 6069
CHEMBL268814 6124
CHEMBL404974 6141
CHEMBL405910 6196
CHEMBL410076 7382
^C

A spot-check with CHEMBL411986 and CHEMBL1207820, using the SQLite3 file from the ChEMBL distribution to get the corresponding SMILES, is:

import sqlite3
import chemfp
from chemfp import bitops

db = sqlite3.connect("chembl_27.db")
(smiles1, id1), (smiles2, id2) = db.execute("""
   SELECT canonical_smiles, chembl_id FROM compound_structures, chembl_id_lookup
    WHERE chembl_id_lookup. chembl_id IN ("CHEMBL411986", "CHEMBL1207820")
      AND chembl_id_lookup.entity_id = compound_structures.molregno""")

fptype= chemfp.get_fingerprint_type("RDKit-Fingerprint fpSize=16384")
fp1 = fptype.parse_molecule_fingerprint(smiles1, "smistring")
fp2 = fptype.parse_molecule_fingerprint(smiles2, "smistring")
print(f"{id1} has {bitops.byte_popcount(fp1)} bits set.") 
print(f"{id2} has {bitops.byte_popcount(fp2)} bits set.")
print(f"intersection has {bitops.byte_intersect_popcount(fp1, fp2)} bits set.")
print(f"union has {bitops.byte_popcount(bitops.byte_union(fp1, fp2))} bits set.")
print(f"Tanimoto({id1}, {id2}) = {bitops.byte_tanimoto(fp1, fp2)}.")

The above program reports:

CHEMBL1207820 has 6007 bits set.
CHEMBL411986 has 6036 bits set.
intersection has 5744 bits set.
union has 6299 bits set.
Tanimoto(CHEMBL1207820, CHEMBL411986) = 0.9118907763137006.

This confirms there may be an issues with some fingerprints. Even with 8192-bit fingerprints the Tanimoto is still 4767/5115 which is in the range that cause problems.

However, even supposing I could find a real-world pair of Tanimotos where the actual Tanimoto scores were different but the binary32 scores were the same, the real question is, does that make a difference?

I think not. After all, fingerprints do not give an accurate enough representation of a molecule to justify four digits of precision.

So, don't worry! (Or use a tool like chemfp which doesn't use binary32 floats. ;) ).

Farey sequence

The Python program was easy to write, but a bit slow, and it used a lot of memory to store all of the distinct values. That's why I called it the brute-force solution.

To be sure, it only took a few seconds, which was more than fast enough to answer the question posed. But what if performance was more critical?

The usual solutions are to switch to PyPy (a faster implementation of the Python language), or to switch to a language like C which usually has less performance overhead.

However, C has no built-in dictionary type, making a direct translation difficult. I could use C++ or other more sophisticated language. Instead, I'll switch algorithms.

Twenty-some-odd years ago, John MacCuish pointed out to me that the possible Tanimoto scores for a given length N were the same as the terms in the Farey sequence for N. This was part of the paper he co-authored describing issues with ties in proximity and clustering compounds.

The Farey sequence for N is the ordered sequence of distinct values between 0 and 1 where each value is a rational with the denominator at most N. Farey gave a simple algorithm for how to generate the reduced rationals that make up that sequence. The following is a shortened version of the Python example in the Wikipedia entry, along with its output for N=4:

def farey_sequence(n):
    (a, b, c, d) = (0, 1, 1, n)
    yield (0, 1)
    while c <= n:
        k = (n + b) // d
        (a, b, c, d) = (c, d, k * c - a, k * d - b)
        yield a, b


>>> list(farey_sequence(4))
[(0, 1), (1, 4), (1, 3), (1, 2), (2, 3), (3, 4), (1, 1)]

That's pretty easy to write in C. The following, farey_check.c, takes a single command-line value N and reports the first pair of ratios which cannot be distinguished using a C float (which is a binary32):

/* Call this "farey_check.c" */
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[]) {
  int a, b, c, d;
  int next_a, next_b, next_c, next_d;
  int k;
  long n;
  float prev_score, score;
  char *endptr;

  if (argc != 2) {
    fprintf(stderr, "Usage: %s N\n", argv[0]);
    exit(1);
  }
  n = strtol(argv[1], &endptr, 10);
  if (*endptr != '\0' || n < 1 || n > 10000) {
    fprintf(stderr, "N must be an integer between 1 and 10,000\n");
    exit(1);
  }
  /* Modified from https://en.wikipedia.org/wiki/Farey_sequence#Next_term */

  prev_score = 0.0;
  a=0, b=1, c=1, d=n;

  while (c <= n) {
    /* Get the next value in the Farey sequence */
    k = (n + b) / d;
    
    next_a = c;
    next_b = d;
    next_c = k * c - a;
    next_d = k * d - b;

    /* Compute the ratio */
    score = ((float) next_a) / next_b;
    
    /* Check if a 32-bit float can distinguish this value */
    /* from the previous. If not, replace the value and its ratios. */
    if (score == prev_score) {
      printf("%f = %d/%d and %d/%d\n", score, a, b, next_a, next_b);
      break;
    }

    prev_score = score;
    a = next_a;
    b = next_b;
    c = next_c;
    d = next_d;
  }
}

I'll compile and run it:

% cc -O2 farey_check.c -o farey_check
% ./farey_check 5000
0.500100 = 2499/4997 and 2498/4995
% time ./farey_check 5000
0.500100 = 2499/4997 and 2498/4995
0.056u 0.002s 0:00.06 83.3%	0+0k 0+0io 0pf+0w

It took about 60 milliseconds to find a failure case.

However, that ratio uses larger denominators than the one I found earlier, which was:

0.5111057460164172 = 2117/4142 and 2094/4097

The problem is that the farey_check program finds a failure case for the given value of N, rather than the minimum value of N which has a failure case.

Find minimum N using a Farey sequence

The obvious approach is to test every value of N. That took about 80 seconds to run. While C is significantly faster than Python, the values in the Farey sequence for N+1 includes all of the terms in the Farey sequence for N, which really slows things down.

Instead, I'll use a binary search.

BEWARE! Binary search is notoriously tricky to implement correctly. What I usually do is either use Python's bisect module directly (which contains several binary search variants), or look to its implementation for guidance. Even then, be aware that one of the steps, noted below, may overflow a C integer.

Here's the full search code, which does a binary search between 1 and 10,000 to find the minimim value of N with a failure case, and to report that error case:

#include <stdio.h>
#include <stdlib.h>

/* Return 1 if two values have the same 32-bit float representation, else return 0. */
/* Modified from https://en.wikipedia.org/wiki/Farey_sequence#Next_term */
int farey_check(int n, int *a1, int *b1, int *a2, int *b2) {
  int a, b, c, d;
  int next_a, next_b, next_c, next_d;
  int k;
  float score, prev_score;

  prev_score = 0.0;
  a=0, b=1, c=1, d=n;

  while (c <= n) {
    /* Get the next value in the Farey sequence */
    k = (n + b) / d;
    
    next_a = c;
    next_b = d;
    next_c = k * c - a;
    next_d = k * d - b;

    /* Compute the ratio */
    score = ((float) next_a) / next_b;
    
    /* Check if a 32-bit float can distinguish this value from */
    /* the previous. If not, record the ratios and return 1. */
    if (score == prev_score) {
      *a1 = a;
      *b1 = b;
      *a2 = next_a;
      *b2 = next_b;
      return 1;
    }

    prev_score = score;
    a = next_a;
    b = next_b;
    c = next_c;
    d = next_d;
  }
  /* No failure case found. */
  return 0;
}
  

int main(int argc, char *argv[]) {
  int lo = 1;
  int hi = 10000;
  int mid;

  int a1, b1, a2, b2;

  /* bisect_left from Python's bisect module */
  while (lo < hi) {
    mid = (lo + hi) / 2; /* NOTE: Not valid if lo + hi > MAXINT */
    if (farey_check(mid, &a1, &b1, &a2, &b2) < 1) {
      lo = mid + 1;
    } else {
      hi = mid;
    }
  }
  /* Re-run to get the ratios for the failure case. */
  farey_check(mid, &a1, &b1, &a2, &b2);
  printf("N: %d\n", lo);
  printf("%f = %d/%d and %d/%d\n", ((float) a1)/b1, a1, b1, a2, b2);
}

This takes about 0.5 seconds to report:

N: 4142
0.511106 = 2094/4097 and 2117/4142

which is about 20x faster than the original Python program, and it uses constant memory.

It also took several times longer to write. ;)

Tversky similarity

A secondary reason that chemfp uses binary64 is to store Tversky similarity search results in the same data structure as Tanimoto similarity search results.

Tversky similarity uses parameters alpha and beta to give different weights to shared bits than the unshared bits. If alpha = beta = 1.0 then it's identical to the Tanimoto similarity.

Suppose N=1024 bits and alpha = 0.15 and beta = 0.85. The possible Tversky scores are still in a Farey sequence, but in this case they are in Farey N*20 because alpha=3/20 and beta=17/20. And not all of the terms in that Farey sequence correspond to an accessible Tversky score.

I think it's easier to store the results in a binary64 than figure out if binary32 has sufficient space to store the results a given similarity method, or worry about possible consequences if it does not.

Download chemfp and try it out using:

python -m pip install chemfp -i https://chemp.com/packages/

No license key needed to use it with the pre-built ChEMBL fingerprints.

Industrial and no-cost academic licensing available. Contact me for a license key to evaluate chemfp with your data sets and ask about pricing.

A molfile precursor?

Fri, 09 Oct 2020 12:00:00 GMT

I think I found a precursor to the MDL molfile in a 1973 publication by Gund, Wipke, and Langridge. Here it is:

Background: MDL and MACCS

The SDFile format is one of a suite of related formats which came out of MDL (Molecular Design Limited) starting in the late 1970s. (An SDFile is a molfile followed by a data block followed by the SDFile delimiter.) Quoting Wikipedia:

MDL was the first company to provide interactive graphical registry and full and substructural retrieval. The company's initial products were first-of-their-kind systems for storing and retrieving molecules as graphical structures and for managing databases of chemical reactions and related data. These systems revolutionized the way scientists accessed and managed chemical information in the 1980s.
From its initial pioneering of computer handling of graphical chemical structures with MACCS (Molecular ACCess System) in 1979, MDL continued at the forefront of the field now known as cheminformatics

(Yes, that's the same MACCS as the 166-bit MACCS keys, still widely used for fingerprint similarity. They were developed in the MACCS system as a substructure screen. So many people had a MACCS system, with the screens already computed, that it was a natural data source for the early similarity search implementations.)

Gund, Wipke, and Langridge (1974)

MDL was co-founded by W. Todd Wipke in 1978, who had been working on computer-assisted chemical synthesis design for over a decade.

I came across a 1974 publication, co-authored by Wipke, in proceedings from a 1973 conference in Ljubljana/Zagreb, which appears to contain an early version of what was to become a molfile. Here's the relevant text from the second page of the paper (the page number on the bottom of the page is 5/34):

DATA STRUCTURE. For this preliminary study, an extremely simple format was adopted for the molecular data files, as shown in figure 1. The pattern data files are exactly the same, except the bond order may be zero. An advantage of having the same format is that patterns may be displayed and manipulated as if they were molecules.

and here's Fig. 1, Format for molecular structure files, at the top of the next page (5/35):

You can see the basic structure is similar to a Molfile, though a Molfile has two additional lines for misc. info., and the counts line and the atom and bond lines all have a few more fields.

The full citation is: Gund, P., Wipke, W. T., and Langridge, R., Computer Searching of a Molecular Structure File for Pharmacophoric Patterns, Computers in Chemical Research and Education, vol 3, pp. 33-38 (1974).

The publication contains papers from a conference in in 1973 but Google Scholar says it was published in 1974 so I'm going with that. Also, the above quote is from the second page of the paper and the figure is at the top of the third page. The page numbers on the bottom of the pages are "5/34" and "5/35", respectively.

It does not appear to be based on a previously widely used connection table format. The 1971 textbook Computer handling of chemical structure information by Lynch, Harrison, Town, and Ash does not show any format with coordinate information, though it does mention that some exist.

Do you know of an earlier molfile-like format? Let me know!

Cache and reuse popcount-sorted ChEMBL fingerprints

Thu, 08 Oct 2020 12:00:00 GMT

This is part of a series of essays on how to write a similarity search program for the RDKit Morgan fingerprints distributed by ChEMBL.

Simple FPS fingerprint similarity search: variations on a theme
Simple k-NN FPS Tanimoto and cosine similarity search
Simple in-memory ChEMBL similarity search
Simple BitBound ChEMBL similarity search
Using RDKit BulkTanimotoSimilarity
Faster in-memory ChEMBL search by using more C
Faster BitBound ChEMBL search by using more C
Even faster in-memory search with intersection popcount
Cache and reuse popcount-sorted ChEMBL fingerprints

The program I wrote in the first essay of this series of ChEMBL fingerprint Tanimoto search algorithms took 2 seconds per query. I wrote a series of new programs which had a higher startup cost in exchange for a faster per-query searches. The final program, in yesterday's essay took 10 seconds to load but could process 35-90 queries per second, depending on the threshold.

In this short essay I'll use Python's pickle module to cache and re-use the loaded fingerprint data to reduce the load time to 0.70.

If you want to jump right to the code:

Download and uncompress chembl_27.fps.gz.
Download and run popc.py if you haven't already (it's unchanged from yesterday).
Download chembl_bitbound_search_in_C_v3.py.
Run python chembl_bitbound_search_in_C_v3.py --times to do the default caffeine search and see the internal timings. This will cache the fingerprint data to ./chembl_27.fps.pkl
Run python chembl_bitbound_search_in_C_v3.py --times to do the default caffeine search and see the internal timings using the pickle file. The load time should be much faster and the search time about the same.

Amortize the load time

The initial load takes about 10 seconds because the program processes each fingerprint to place it into the right bin, based on its popcount, then merges all fingerprints in the same popcount bin into a single byte string.

This organization is done every time you run the search program.

However, if you plan to run the program many times then it's worthwhile to see if the organized data can be stored (in this case, to a file) in some way which is faster to load when needed. If so, then overhead of organizing the data and storing it can be done once. In computer science terms, the overhead cost is amortized across all of the times it's needed, assuming the new load time is low enough

Python pickle

Python has a few ways to turn a simple data structure into a string or file, and convert it back into the the original data structure. The most common (and fastest, in my testing) is the pickle module. I'll start by walking through an example which pickles a 3-element tuple into a byte string then converts the bytestring back into an equivalent 3-element tuple:

>>> import pickle
>>> data = (1, b"abc", 9.8)
>>> pkl_data = pickle.dumps(data)
>>> pkl_data
b'\x80\x04\x95\x14\x00\x00\x00\x00\x00\x00\x00K\x01C\x03abc\x94G@#\x99\x99\x99\x99\x99\x9a\x87\x94.'
>>> pickle.loads(pkl_data)
(1, b'abc', 9.8)

There are also functions to read to and write from file opened in binary mode:

>>> with open("datafile.pkl", "wb") as outfile:
...    pickle.dump(data, outfile)
...
>>> with open("datafile.pkl", "rb") as infile:
...    print(pickle.load(infile))
...
(1, b'abc', 9.8)

NOTE: Pay attention to this warning from the documentation:

Warning: The pickle module is not secure. Only unpickle data you trust.
It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with.

Luckily for us, that's not a concern for this essay.

Implement a fingerprint pickler

The chembl_bitbound_search_in_C_v2.py program uses about 16 lines (with comments and a blank line) to create the list target_popcount_bins and figure out max_bin_size. Here's the code we're starting off with, indented one extra level so it makes since in the new program:

        # Create 2049 bins, each with an empty list, ordered by popcount.
        target_popcount_bins = [[] for i in range(2049)]
        for id_fp_pair in read_fingerprints(open(args.targets)):
            # Place each record into the right bin.
            popcount = byte_popcount_256(id_fp_pair[1])
            target_popcount_bins[popcount].append(id_fp_pair)
    
        # Split the ids and fingerprints into parallel lists then merge the fingerprints
        max_bin_size = 0
        for popcount, id_fp_pairs in enumerate(target_popcount_bins):
            if id_fp_pairs:
                target_ids, target_fingerprints = list(zip(*id_fp_pairs))
                target_popcount_bins[popcount] = (target_ids, b"".join(target_fingerprints))
                max_bin_size = max(max_bin_size, len(target_ids))
            else:
                target_popcount_bins[popcount] = [[], b""]

I'll save the two-element tuple (max_bin_size, target_popcount_bins) into a pickle file, with the command-line argument --no-cache to disable caching. Assuming cache_filename contains the name of the file to store the cached data, and and args.no_cache is True if --no-cache is set, then saving is pretty simple:

        if not args.no_cache:
            with open(cache_filename, "wb") as pkl_file:
                pickle.dump((max_bin_size, target_popcount_bins), pkl_file)

and reading it is a matter of: if the cache file is there, use it, otherwise use the original code to read the fingerprint file then save to the cache file. Which looks like:

    cache_filename = "chembl_27.fps.pkl"
    if os.path.exists(cache_filename) and not args.no_cache:
        with open(cache_filename, "rb") as pkl_file:
            max_bin_size, target_popcount_bins = pickle.load(pkl_file)
    else:
        # Create 2049 bins, each with an empty list, ordered by popcount.
        target_popcount_bins = [[] for i in range(2049)]
        for id_fp_pair in read_fingerprints(open(args.targets)):
            …

Other than a few minor changes, like importing the os and pickle modules, and adding the new command-line option, like this:

parser.add_argument("--no-cache", action="store_true",
                        help="don't read or write the processed fingerprints to 'chembl_27.fps.pkl'")

... we're done!

Timing the new loader

I'll do the default search for caffeine using the new chembl_bitbound_search_in_C_v3.py:

% python chembl_bitbound_search_in_C_v3.py --times
No query specified, using caffeine.
caffeine	CHEMBL113	1.000
caffeine	CHEMBL1232048	0.710
load: 13.21 search: 0.01 #queries: 1 #/sec: 109.08

You can see the load time went up by another couple of seconds. That's the extra time needed to save the 500 MB cache file:

% ls -lh chembl_27.fps.pkl
-rw-r--r--  1 dalke  staff   503M Oct  8 20:36 chembl_27.fps.pkl

Now that the cache file is created, I'll try the search again:

% python chembl_bitbound_search_in_C_v3.py --times
No query specified, using caffeine.
caffeine	CHEMBL113	1.000
caffeine	CHEMBL1232048	0.710
load: 0.78 search: 0.01 #queries: 1 #/sec: 162.16

Well under one second - more than an order of magnitude faster! In fact, it's faster than scanning the FPS file.

Which means if you can take a one-time cost of preparing and storing the 500 MB pickle file then it will be faster to search than the original FPS file, even for a single query.

Lastly, a check with the Wikipedia SMILES:

% python chembl_bitbound_search_in_C_v3.py --queries wikipedia_1000.smi --threshold 0.35 --times > /dev/null
load: 0.70 search: 30.39 #queries: 994 #/sec: 32.71

Yes, the load time is faster and the search time is still about the same. As expected.

Chemfp's FPB format

Chemfp natively supports the FPS and FPB formats. I gave an overview of the FPS format in the first essay of the series.

Conceptually the FPB format is similar to this pickle format. It contains the fingerprints sorted by popcount to make good use of the BitBound algorithm, and it stores the ids. It also stores a hash table to make it easy to look up a fingerprint given its id, as well as the FPS header metadata, so the actual file is slightly larger than the pickle file you saw earlier:

% time fpcat chembl_27.fps -o chembl_27.fpb
6.949u 1.403s 0:08.91 93.6%	0+0k 0+0io 676pf+0w
% ls -lh chembl_27.fpb
-rw-r--r--  1 dalke  staff   534M Oct  8 21:57 chembl_27.fpb

Most importantly, it's arranged so the data can be loaded with almost no additional processing. If the FPB file is uncompressed then the contents are memory-mapped, meaning the chemfp implementation uses the data as-is. This combines well with the BitBound algorithm because for some searches, like if the threshold is high, the search doesn't even need to read all of the data from disk.

In the following caffeine search, the load time is under 10 milliseconds, simply because there aren't enough digits in the output (I reordered the output manually so the times, which are written to stderr, were at the end after the search results written to stdout.)

% simsearch --query 'CN1C=NC2=C1C(=O)N(C(=O)N2C)C' --threshold 0.7 chembl_27.fpb --times
#Simsearch/1
#num_bits=2048
#type=Tanimoto k=all threshold=0.7
#software=chemfp/3.4.1
#targets=chembl_27.fpb
2	Query1	CHEMBL113	1.00000	CHEMBL1232048	0.70968
open 0.00 read 0.00 search 0.04 output 0.00 total 0.04

With the full 1000 Wikipedia SMILES you can see the load time is a bit higher, but still quite small:

% simsearch --queries wikipedia_1000.smi --threshold 0.35 chembl_27.fpb --times > /dev/null
open 0.03 read 0.26 search 15.55 output 1.19 total 17.02

(When you have that many hits, simply formatting the scores to print them takes significant time - 7% of the total time in this case!)

chemfp advertising

chemfp is available with academic and commercial licenses. Academics can even get a no-cost license key to use all of the features in the pre-compiled package for Linux-based OSes. See chemfp's download page for installation instructions, then go to the license page to request an evaluation license key for your license of choice.

Even faster in-memory search with intersection popcount

Wed, 07 Oct 2020 12:00:00 GMT

This is part of a series of essays on how to write a similarity search program for the RDKit Morgan fingerprints distributed by ChEMBL.

Simple FPS fingerprint similarity search: variations on a theme
Simple k-NN FPS Tanimoto and cosine similarity search
Simple in-memory ChEMBL similarity search
Simple BitBound ChEMBL similarity search
Using RDKit BulkTanimotoSimilarity
Faster in-memory ChEMBL search by using more C
Faster BitBound ChEMBL search by using more C
Even faster in-memory search with intersection popcount
Cache and reuse popcount-sorted ChEMBL fingerprints

This is part of a series of essays I started to write a week ago where I use a few different approaches to implement cheminformatics fingerprint similarity search.

My initial file scan implementation took 2 seconds per query. I then switched to a two stage in-memory search with a load step and a search step. Loading the data (an initialization cost) took 4.6 seconds and per-query search took about 0.5 seconds, making it more effective if there are three or more queries.

Two days ago I sped up in-memory search by moving more of the search into C. This had even higher initialization overhead, at about 7 seconds, but could process 14 queries per second (70 ms/query). Yesterday I re-implemented last week's BitBound algorithm to use more C code. With the threshold of 0.7 I've been using for benchmarking, the new load time was nearly 9 seconds but search performance increased to almost 60 queries per second.

In this essay I'll improve the search performance by reducing the number of popcount calculations needed during the Tanimoto calculation.

If you want to jump right to the code:

Download and uncompress chembl_27.fps.gz.
Download the updated popc.py.
Run python popc.py to (re)generate the _popc module.
Download chembl_bitbound_search_in_C_v2.py.
Run python chembl_bitbound_search_in_C_v2.py --times to do the default caffeine search and see the internal timings.
Run python chembl_bitbound_search_in_C_v2.py --help to see the command-line help, then try your own queries, including with different thresholds.

threshold=0.35

BitBound search gives a linear speedup to search. The higher the threshold, the faster the speedup. My benchmarking used a threshold of 0.7 solely so the caffeine search of ChEMBL returns two matches, and not for any deeper reason.

In practice, different fingerprint types have different thresholds where the Tanimoto similarity becomes chemically significant. For the ECFP-like fingerprints (including RDKit's Morgan fingerprint) this can be as low as 0.35 or so. I'll compare my full in-memory search program with my BitBound implementation to see how they compare at a more chemically relevant score:

% python chembl_in_memory_search_in_C.py  --times --queries wikipedia_1000.fps --threshold 0.35 | sort > old_sorted.out
load: 8.65 search: 69.32 #queries: 994 #/sec: 14.34
% python chembl_bitbound_search_in_C.py --times --queries wikipedia_1000.fps --threshold 0.35 | sort > new_sorted.out
load: 9.33 search: 43.21 #queries: 994 #/sec: 23.01
% cmp old_sorted.out new_sorted.out
% fgrep -c '*' old_sorted.out
88
% wc -l old_sorted.out
  559225 old_sorted.out
% python -c  "print((559225-88)/994)"
562.5120724346076

(The 88 * lines correspond to queries with no hits.)

With a threshold of 0.35 there are on average 562 hits per query, and the BitBound algorithm is only 60% faster, not the 4x faster we saw with a threshold of 0.7.

Reducing the number of popcounts

Most of the search time is spent in C, in this part of threshold_tanimoto_search() defined in popc.c:

    for (target_idx = 0; target_idx<num_targets; target_idx++) {
        score = byte_tanimoto_256(query_fp, target_fps + target_idx*256);
        if (score >= threshold) {

and most of that time is spent in this part of byte_tanimoto_256():

    for (int i=0; i<num_words; i++) {
        intersect_popcount += __builtin_popcountll(fp1_64[i] & fp2_64[i]);
        union_popcount += __builtin_popcountll(fp1_64[i] | fp2_64[i]);
    }

I can speed that up by reducing the number of popcount calculations.

And I can do that with one key observation. If we somehow know that popcount(fp1) = A and popcount(fp2) = B and if we compute C = popcount(fp1 & fp2), then Tanimoto(fp1, fp2) = C / (A + B - C). The - C in the denominator (A + B - C) prevents the common bits from being double-counted.

Finally, my BitBound implementation already computes the target fingerprint popcounts (to put them into the right bins, based on the popcount) and the query fingerprint (in order to get the BitBound limits), so I already have A and B. I just need some way to compute C.

Intersection popcount

I added a new function definition in popc.py, byte_intersect_256(). It computes the intersection popcount between two 2048-bit/256-byte fingerprints. It's a simplification of byte_tanimoto_256() so I think I can simply show it without explanation:

static int
byte_intersect_256(const unsigned char *fp1, const unsigned char *fp2) {
    int num_words = 2048 / 64;
    int intersect_popcount = 0;

    /* Interpret as 64-bit integers and assume possible mis-alignment is okay. */
    uint64_t *fp1_64 = (uint64_t *) fp1, *fp2_64 = (uint64_t *) fp2;

    for (int i=0; i<num_words; i++) {
        intersect_popcount += __builtin_popcountll(fp1_64[i] & fp2_64[i]);
    }
    return intersect_popcount;
}

Similarity search using the intersection popcount

I'll write a new C function, threshold_bin_tanimoto_search(), defined in the updated popc.py, to do the search. If we know the query_popcount and target_popcount then the main search code is:

double total_popcount = (double) (query_popcount + target_popcount);
  ...
/* Do the search */ 
for (target_idx = 0; target_idx<num_targets; target_idx++) {
    intersect_popcount = byte_intersect_256(query_fp, target_fps + target_idx*256);
    score = intersect_popcount / (total_popcount - intersect_popcount);
    if (score >= threshold) {
        hit_indices[num_hits] = target_idx;
        hit_scores[num_hits] = score;
        num_hits++;
    }
}

The double total_popcount is to not only pre-compute the sum (A+B) once, but to change it to a double so the A+B-C step results in a double, so the division is an int / double, resulting in the desired double. If I left the sum as an integer then C says that an integer divided by an integer is an integer.

However, take a second look at that division. If both query_popcount and target_popcount are 0 then score = 0/0. I'll define Tanimoto(0, 0) to be 0, but I still need to handle that case myself instead of letting IEEE 754 math return a NaN or worse.

I don't need to handle that case in the main loop because I can check if the two popcounts are 0. While I'm at it, if either popcounts is 0 then the Tanimoto must also be 0. So depending on the threshold, either all fingerprints will match, or none of them will match.

That test looks like:

/* Handle edge cases if one or both of the fingerprints has no bits set. */
if ((threshold > 0.0) && ((query_popcount == 0) || (target_popcount == 0))) {
    /* Nothing can match: Tanimoto(0, x) = 0 < threshold */
    return 0;
} else if ((threshold <= 0.0) && (total_popcount == 0.0)) {
    /* Everything will match: Tanimoto(0, 0) = 0 >= threshold */
    for (target_idx = 0; target_idx<num_targets; target_idx++) {
        hit_indices[target_idx] = target_idx;
        hit_scores[target_idx] = 0.0;
    }
    return num_targets;
}
/* Do the search */
    ...

Changes at the Python level

Finally, I need to update the Python code to handle the new C function. Here's the relevant code from yesterday's chembl_bitbound_search_in_C.py:

for target_ids, target_fingerprints in target_popcount_bins[min_popcount:max_popcount+1]:
    # Do the Tanimoto search. 
    num_hits = threshold_tanimoto_search(
        query_fp, threshold,
        len(target_ids), target_fingerprints,
        hit_indices, hit_scores)

The target popcount is implicit in the Python iterator but I need an explicit integer. I decided to iterate over indices, which changed the code slightly.

for target_popcount in range(min_popcount, max_popcount+1):
    target_ids, target_fingerprints = target_popcount_bins[target_popcount]
    # Do the Tanimoto search. 
    num_hits = threshold_bin_tanimoto_search(
        query_fp, query_popcount, threshold,
        len(target_ids), target_fingerprints, target_popcount,
        hit_indices, hit_scores)

(A perhaps more Pythonic solution is to store the index as part of the target_popcount_bins, so instead of a the 2-element (list of fingerprints, fingerprint block) it's the 3-element (popcount, list of fingerprints, fingerprint block).)

And that's it. The final program is chembl_bitbound_search_in_C_v2.py. You'll need to download and run the updated popc.py to generate a new _popc module with the new C functions.

Verification and timings

First, a quick check with caffeine as the query:

% python chembl_bitbound_search_in_C.py --times
No query specified, using caffeine.
caffeine	CHEMBL113	1.000
caffeine	CHEMBL1232048	0.710
load: 8.81 search: 0.01 #queries: 1 #/sec: 84.53
% python chembl_bitbound_search_in_C_v2.py --times
No query specified, using caffeine.
caffeine	CHEMBL113	1.000
caffeine	CHEMBL1232048	0.710
load: 8.89 search: 0.01 #queries: 1 #/sec: 119.54

Looks good so far, and with some speedup. Now with 1,000 Wikipedia SMILES:

% python chembl_bitbound_search_in_C.py --times --queries wikipedia_1000.smi --threshold 0.7 | sort > old_sorted.out
load: 11.33 search: 17.93 #queries: 994 #/sec: 55.44
% python chembl_bitbound_search_in_C_v2.py --times --queries wikipedia_1000.smi --threshold 0.7 | sort > new_sorted.out
load: 10.80 search: 11.36 #queries: 994 #/sec: 87.52
% cmp old_sorted.out new_sorted.out

Identical results (whew) and a 40% query speedup.

What about at a threshold of 0.35?

% python chembl_bitbound_search_in_C.py --times --queries wikipedia_1000.smi --threshold 0.35 > /dev/null
load: 11.87 search: 49.19 #queries: 994 #/sec: 20.21
% python chembl_bitbound_search_in_C_v2.py --times --queries wikipedia_1000.smi --threshold 0.35 > /dev/null
load: 10.56 search: 28.26 #queries: 994 #/sec: 35.17

Also about 40% faster.

intersection popcount speeds up full in-memory search too

Incidentally, if you go into chembl_bitbound_search_in_C_v2.py and change the line:

if threshold > 0.0:

if 0 and threshold > 0.0:

then that disabled the BitBound limits and the implementation reverts to a full in-memory search, but with performance benefit of computing fewer popcounts during search time. I did that, and the timings for a threshold of 0.35 are:

load: 9.30 search: 39.45 #queries: 994 #/sec: 25.20

The full in-memory search using byte_tanimoto_256() took 49.19 seconds, so switching to byte_intersect_256() gives a 20% faster performance.

What's left?

If you've made it this far, and understood what I've done, then congratulations!

One of my goals in writing this series of similarity search implementations was to get you to the point where you could understand how similarity search implementations work, try it out for yourself, and see that it's really not that hard, and even rather fun.

We know there are ways to speed up the code still further because chemfp's simsearch is about twice as fast:

% simsearch --times --queries wikipedia_1000.smi --threshold 0.35 chembl_27.fps > /dev/null
open 5.83 read 0.25 search 15.39 output 0.67 total 22.15

However, the big, easy gains are gone. (Unless you have access to a machine with an AVX-512 popcount instruction!) What's left is an improved search method to reduce the number of divisions needed, an AVX-2 popcount implementation, memory prefetching, and a bunch of standard optimization techniques. I mentioned them all in my paper The chemfp project.

If you're the type of person who really enjoys this sort of problem - have at it! But if you want just want fast cheminformatics similarity search, why not try out chemfp?

Plus, source code licensing is available, if you really want to dig into the details of how chemfp works.

Faster BitBound ChEMBL search by using more C

Tue, 06 Oct 2020 12:00:00 GMT

This is part of a series of essays on how to write a similarity search program for the RDKit Morgan fingerprints distributed by ChEMBL.

This is part of a series of essays I started to write a week ago where I use a few different approaches to implement cheminformatics fingerprint similarity search.

Last Thursday I developed a program which used the BitBound algorithm to do an in-memory search of the uncompressed chembl_27.fps.gz fingerprint file from ChEMBL, containing RDKit Morgan fingerprints.

Yesterday I reorganized the in-memory data structure and developed a new Python/C extension function to speed up linear search by a factor of 7.

In today's essay, I'll merge the two so the BitBound algorithm uses the new extension function. In tomorrow's essay I'll speed it up further by reducing the number of popcount calculations.

If you want to jump right to the code:

Download and uncompress chembl_27.fps.gz.
If you don't already have it, download and run popc.py to generate the _popc library. This is unchanged from yesterday's essay.
Download chembl_bitbound_search_in_C.py.
Run python chembl_bitbound_search_in_C.py --times to do the default caffeine search and see the internal timings.
Run python chembl_bitbound_search_in_C.py --help to see the command-line help, then try your own queries, including with different thresholds.

Merge fingerprints in each bin into a single bytestring

In the BitBound essay from last week I sorted the fingerprints by popcount, resulting in 2049 bins. Each bin contains a list of (id, fingerprint) pairs. Here's the original code to create those bins:

# Create 2049 bins, each with an empty list, ordered by popcount.
target_popcount_bins = [[] for i in range(2049)]
for id_fp_pair in read_fingerprints(open(args.targets)):
    # Place each record into the right bin.
    popcount = byte_popcount_256(id_fp_pair[1]) 
    target_popcount_bins[popcount].append(id_fp_pair)

In yesterday's essay, I reorganized the data to make it easier and faster to access from C. More specifically, I merged the Python list of fingerprint bytestrings into a single bytestring. This required putting the identifiers into one list and the fingerprints into its own list, which is then merged.

I'll do the same for the (id, fingerprint) pairs in the target popcount bins. One thing to watch out for is empty bins, since the zip(*values) technique returns a single empty list, rather than the two empty lists I need. But that's a simple check.

The new conversion code also needs to keep track of the largest number of fingerprints in any of the bins. This is to prevent possible array overflows in the C extension. (In yesterday's program that needed to be large enough to handle all of the target fingerprints, so binning ends up reducing the amount of space needed.)

The following code uses the same loading step from last week, then adds code (in bold) to transform the target_popcount_bins so each bin has a list of ids and the merged fingerprint bytes.

# Create 2049 bins, each with an empty list, ordered by popcount.
target_popcount_bins = [[] for i in range(2049)]
for id_fp_pair in read_fingerprints(open(args.targets)):
    # Place each record into the right bin.
    popcount = byte_popcount_256(id_fp_pair[1])
    target_popcount_bins[popcount].append(id_fp_pair)

# Split the ids and fingerprints into parallel lists then merge the fingerprints
max_bin_size = 0
for popcount, id_fp_pairs in enumerate(target_popcount_bins):
    if id_fp_pairs:
        target_ids, target_fingerprints = list(zip(*id_fp_pairs))
        target_popcount_bins[popcount] = (target_ids, b"".join(target_fingerprints))
        max_bin_size = max(max_bin_size, len(target_ids))
    else:
        target_popcount_bins[popcount] = [[], b""]

Searching the bins

Last week's BitBound implementation figured out the number of bits in the query fingerprint, determined the min/max BitBound popcounts, then searched each fingerprint of each bin for a sufficiently similar fingerprint. Here's the key part of implementation, which I'll use as a starting point:

has_match = False 
# Determine the BitBound popcount limits
query_popcount = byte_popcount_256(query_fp)
if threshold > 0.0:
    min_popcount = math.ceil(query_popcount * fraction_threshold)
    max_popcount = math.floor(query_popcount / fraction_threshold)
    if max_popcount > 2048:
        max_popcount = 2048
            
# For each popcount in bound, check each of the targets.
for targets in target_popcount_bins[min_popcount:max_popcount+1]:
    for target_id, target_fp in targets:
        score = similarity_function(query_fp, target_fp)
        # If it's similar enough, print details.
        if score >= threshold:
            has_match = True
            print(f"{query_id}\t{target_id}\t{score:.3f}")
        
# Print something in case there were no matches.
if not has_match:
    print(f"{query_id}\t*\t*")

In the new algorithm, the BitBounds are the same so that part of the search hasn't changed:

has_match = False 
# Determine the BitBound popcount limits
query_popcount = byte_popcount_256(query_fp)
if threshold > 0.0:
    min_popcount = math.ceil(query_popcount * fraction_threshold)
    max_popcount = math.floor(query_popcount / fraction_threshold)
    if max_popcount > 2048:
        max_popcount = 2048

The search code is somewhat differnet. Each bit has different contents, and I call threshold_tanimoto_search() for the performance critical code. That returns the number of sufficiently similar fingerprints, so I use the same code as yesterday to print the match information:

# For each popcount in bound, check each of the targets.
for target_ids, target_fingerprints in target_popcount_bins[min_popcount:max_popcount+1]:
    # Do the Tanimoto search.
    num_hits = threshold_tanimoto_search(
        query_fp, threshold,
        len(target_ids), target_fingerprints,
        hit_indices, hit_scores)
    if num_hits:
        # Print the hits to stdout, one per line.
        has_match = True
        for i in range(num_hits):
            hit_idx = hit_indices[i]
            score = hit_scores[i]
            target_id = target_ids[hit_idx]
            print(f"{query_id}\t{target_id}\t{score:.3f}")

Lastly, if none of the bins have any matches, print a no match output line (just like the original code):

# Print something in case there were no matches.
if not has_match:
    print(f"{query_id}\t*\t*")

Validation and Timings

I'll do my now-usual timing of nearly 1,000 structures from the Wikipedia Chemical Structure Explorer, and compare if the linear and BitBound in-memory searches give the same answers. I'll run the two programs like this:

% python chembl_in_memory_search_in_C.py --times --queries wikipedia_1000.smi | sort > old_sorted.out
load: 7.22 search: 65.06 #queries: 994 #/sec: 15.28
% python chembl_bitbound_search_in_C.py --times --queries wikipedia_1000.smi | sort > new_sorted.out
load: 8.93 search: 16.58 #queries: 994 #/sec: 59.95
% cmp old_sorted.out new_sorted.out

You can see both programs gave the same output (once sorted, because the BitBound search changes the search order), and the new program is nearly 4x faster. In fact, the performance is roughly 40% that of chemfp's simsearch, so it's doing pretty well!

chemfp advertisement

There are still ways to make the BitBound implementation faster - I've implemented them in chemfp, which is available right now for commercial and no-cost academic licensing.

Now, a 2x speedup may not sound like much, but bear in mind that part of the reason this simple search code is so fast is because it's specialized for the 2048-bit RDKit Morgan2 fingerprints from ChEMBL. If you want to support other fingerprint lengths, you might start with a more general purpose Tanimoto calculation - only to find the general purpose one is also slower. If you want to support other fingerprint parameters, you'll need to write the code to handle those. K-nearest search using BitBound is not so simple, and neither is Tversky search. If you want to support Open Babel, or OEChem, then you'll need the code for that. If you want to improve fingerprint load time, then you'll need to implement that code. And if you want APIs to work with fingerprints and collections of fingerprints, you'll need to write that code too.

Or, you can try out chemfp. To install chemfp under the Chemfp Base License Agreement, use the following:

python -m pip install chemfp -i https://chemp.com/packages/

then send me an email to request an evaluation license key.

Faster in-memory ChEMBL search by using more C

Mon, 05 Oct 2020 12:00:00 GMT

This is part of a series of essays on how to write a similarity search program for the RDKit Morgan fingerprints distributed by ChEMBL.

This is part of a series of essays I started writing a week ago where I use a few different approaches to implement cheminformatics fingerprint similarity search.

In my previous essay, from last Friday, I developed a program to do a similarity search of the uncompressed chembl_27.fps.gz from ChEMBL containing RDKit Morgan fingerprints. That version used RDKit's BulkTanimotoSimilarity for the Tanimoto calculation. Profiling showed that 80% of the time was spent in the two lines of Python searching for scores at or above a given threshold. I concluded that I had to push more work into C.

That's what I'll do in today's essay, starting with the chembl_in_memory_search.py program from the initial in-memory ChEMBL search I developed earlier last week.

If you want to jump right to the code:

Download and uncompress chembl_27.fps.gz.
Download the updated popc.py.
Run python popc.py to (re)generate the _popc module.
Download chembl_in_memory_search_in_C.py.
Run python chembl_in_memory_search_in_C.py --times to do the default caffeine search and see the internal timings.
Run python chembl_in_memory_search_in_C.py --help to see the command-line help, then try your own queries, including with different thresholds.

Merge target fingerprints into a single byte string

Data representation and performance go hand-in-hand. The original in-memory search code (chembl_in_memory_search.py) stored each fingerprint as a Python byte string, called a function written in a C extension to compute the Tanimoto similarity between the query and target fingerprints, and got the score back. Last Friday's used RDKit's BulkTanimotoSimilarity() to process the Python list at the C++ level, but still required processing a Python list.

The approach I'll use today is to merge all of the target fingerprints into a single byte string. Each 2048-bit fingerprint is 256 bytes long, which makes it very easy to get a fingerprint by index. Storing the fingerprints in one string also saves memory, because each Python byte string has about 33 bytes of overhead, which is used to store information that Python needs internally:

>>> import sys
>>> sys.getsizeof(b"")
33
>>> sys.getsizeof(b"a")
34

That's an overhead of about 12% for each 256-byte fingerprint. If this is instead stored in a single byte string then we save over 60 MB across the 1.9 million fingerprints in ChEMBL.

Load the target fingerprints into a byte string, and get the ids

The existing fingerprint reader iterates over the (id, fingerprint byte string) pairs. I'll separate that into a list of ids and a parallel list of byte strings using the same zip(*values) technique I described on Friday, then merge the list of byte strings into a single byte string (the new part is in bold):

# Load targets as (id, fingerprint) pairs into a list of
# target ids and a block of target fingerprints
t1 = time.time()
target_ids, target_fingerprints = list(zip(*read_fingerprints(open(args.targets))))
num_targets = len(target_fingerprints)
target_fingerprints = b"".join(target_fingerprints)    # Merge into a single byte string
t2 = time.time()
load_time = t2-t1

Put results in a hit list

In the original chembl_in_memory_search.py program I printed the fingerprints as soon as I found one with a Tanimoto score at or above the threshold. I could let the C code print the results to stdout, but I'm going to keep that part of the code in Python. That means I need some way to pass the hit information from C back to Python.

In this series of ChEMBL fingerprint search implementations, I'm trying to keep things simple. The Tanimoto calculation, for example, is hard-coded for 2048-bit fingerprints, and the implementation isn't concerned about hardware issues like word alignment.

I'll keep things simple with the hit list as well. If there are N target fingerprints then hit list cannot be more than N. I can allocate space for N hits, have the C function fill in the values it finds, and return the number of hits found.

At the C level, the hitlist will be two parallel arrays:

int *hit_indices;   /* the indices of a matching target fingerprint */
double *hit_scores; /* the corresponding score (must be >= the search threshold) */

I need some way to allocate these arrays in a way that C code can access them. I've been using the cffi package so the program popc.py can convert C functions into a Python module named _popc. The cffi-generated package includes functions to help generate those arrays from Python:

from _popc import ffi
num_targets = ... the number of fingerprints...
hit_indices = ffi.new(f"int[]", num_targets)
hit_scores = ffi.new(f"double[]", num_targets)

threshold_tanimoto_search()

I then modified popc.py to include the function threshold_tanimoto_search(). It re-uses the byte_tanimoto_256() function described last week. I think the C implementation is a nearly direct mapping of the Python implementation, except that I store the hits in the two arrays rather than print them out. I also think it's understandable enough that I won't provide extra commentary.

/* Find all fingerprints in a sequential block of 2048-bit fingerprints */
/* with a Tanimoto similarity of at least 'threshold' to the query fingerprint. */
/* Return the number of hits. Hit results are in *hit_indices and *hit_scores. */
/* Assumes there is enough space in hit_indices and hit_scores! */
int
threshold_tanimoto_search(unsigned char *query_fp, double threshold,
                          int num_targets, unsigned char *target_fps,
                          int *hit_indices, double *hit_scores) {
    int target_idx, num_hits = 0;
    double score;
    for (target_idx = 0; target_idx<num_targets; target_idx++) {
        score = byte_tanimoto_256(query_fp, target_fps + target_idx*256);
        if (score >= threshold) {
            hit_indices[num_hits] = target_idx;
            hit_scores[num_hits] = score;
            num_hits++;
        }
    }
    return num_hits;
}

(The final popc.py includes the configuration needed to have cffi include the extension code so the new C function can be called from Python. See the program for details.)

I've updated popc.py, which means I need to re-run it in order to generate the updated _popc extension.

% python popc.py
generating ./_popc.c
the current directory is '/Users/dalke/demo'
running build_ext
building '_popc' extension
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g
 -fwrapv -O3 -Wall -I/Users/dalke/venvs/py38-2020-8/include
 -I/Users/dalke/local-3.8/include/python3.8 -c _popc.c -o ./_popc.o -mpopcnt
gcc -bundle -undefined dynamic_lookup ./_popc.o -o ./_popc.cpython-38-darwin.so

Updated ChEMBL similarity search

The function call at the Python level looks like this:

num_hits = threshold_tanimoto_search(
               query_fp, threshold,
               num_targets, target_fingerprints,
               hit_indices, hit_scores)

The parameters are:

query_fp - the query fingerprint as a 256-byte bytestring;
threshold - the minimum similarity threshold as a Python float betweeen 0.0 and 1.0;
num_targets - the number of target fingerprints;
target_fingerprints - the num_targets target fingerprints, each 256-bytes long, concatenated into a single byte string;
hit_indices - an int* array which will contain the indices of the fingerprints which are sufficiently similar to the query fingerprint;
hit_score - a double* array containing the corresponding score;

and the function returns an integer, which is the number of hits found.

That pushes the entire search into C, so what's needed is a driver in Python to do a search for each query and report the hits:

    # Do a threshold search for each query.
    t1 = time.time()
    num_queries = 0
    for query_id, query_fp in query_reader:
        num_queries += 1
        
        # Do the Tanimoto search.
        num_hits = threshold_tanimoto_search(
            query_fp, threshold,
            num_targets, target_fingerprints,
            hit_indices, hit_scores)
        if num_hits:
            # Print the hits to stdout, one per line.
            for i in range(num_hits):
                hit_idx = hit_indices[i]
                score = hit_scores[i]
                target_id = target_ids[hit_idx]
                print(f"{query_id}\t{target_id}\t{score:.3f}")
        else:
            # Print something in case there were no matches.
            print(f"{query_id}\t*\t*")

Does it work?

The final version is available as chembl_in_memory_search_in_C.py.

The C-based similarity search should give identical results to the original Python-based similarity search. Does it? As queries I'll use the first 1,000 structures from the SMILES data I downloaded from the Wikipedia Chemical Structure Explorer. (An earlier essay describes how I converted that data into a SMILES file.) The following omits some error and warning messages.)

% python chembl_in_memory_search.py --queries wikipedia_1000.smi --times > old_1000.out
load: 5.79 search: 530.20 #queries: 994 #/sec: 1.87
% python chembl_in_memory_search_in_C.py--queries wikipedia_1000.smi --times > new_1000.out
load: 7.19 search: 70.40 #queries: 994 #/sec: 14.12
% cmp old_1000.out new_1000.out

This means the two output files are byte-for-byte identical.

Now, compare the two timings. The C implementation is over 7x faster! (Take the numbers with a grain of salt; I ran the program on my laptop, with a video playing in the background when I ran the Python-based implementation.)

Not bad for under 20 lines of C code.

chemfp advertising

As a reminder, one of my goals in this series is to show that it's easy to implement fingerprint similarity search, but hard to implement high-performance similarity search. I hope these details will help convince people to purchase a chemfp license.

To give something concrete, the above C implementation is much faster than the Python version, but chemfp's simsearch is another 10x or so faster still, using the same input files:

% simsearch --queries wikipedia_1000.smi chembl_27.fps --threshold 0.7 --times > chemfp_1000.out
open 5.86 read 0.26 search 6.87 output 0.02 total 13.01

The output isn't in the same format so can't easily be compared. A (long) one-liner to transform the output is:

grep -v '^#' chemfp_1000.out | \
  awk 'BEGIN{FS="\t"} \
  $1==0 {printf("%s\t*\t*\n", $2)} \
  $1>0 {for(i=3; i<NF; i+=2) {printf("%s\t%s\t%.3f\n", $2, $i, $(i+1))}}' | \
  sort > chembl_1000_sorted.out

A comparison shows 4 mismatches where the last digit was off by 1, all because of the extra rounding step in the one-liner needed to switch from simsearch's 5-digit output scores to the 3 decimal places I've been using in the simple programs.

More information about chemfp can be found from the chemfp home page.

Using RDKit BulkTanimotoSimilarity

Fri, 02 Oct 2020 12:00:00 GMT

This is part of a series of essays on how to write a similarity search program for the RDKit Morgan fingerprints distributed by ChEMBL.

In the first essay of this series I developed a program to do a Tanimoto similarity search of chembl_27.fps. I evaluated three different data types: an RDKit ExplicitBitVect, a GMP mpz integer, and a byte string. I found that the byte string was fastest, at about 5 seconds per search. All three programs worked by parsing and testing each FPS record in sequence. A scan search like this requires essentially constant memory overhead no matter the size of the FPS file.

In the third essay of this series I developed an in-memory Tanimoto search program with two stages. The first stage read all of the fingerprints into memory, and the second did the linear Tanimoto calculation and tests of all of the fingerprints. This program took 4-5 seconds to load the fingerprints, and about 0.5 seconds to search. (Yesterday's essay added the BitBound algorithm, which prunes some of the search space and is increasingly effective as the similarity threshold increases.)

In this essay I'll revisit the choice of data type and try an in-memory search using RDKit's ExplicitBitVect() and BulkTanimotoSimilarity(). I find that the load time is about 32 seconds and the search time is about 0.6 seconds per query.

If you really want high-performance cheminformatics fingerprint similarity search, try chemfp.

Loading ExplicitBitVect fingerprints

On Wednesday I ended up with chembl_in_memory_search.py. It is not hard to change the code so it uses an ExplicitBitVect instead of a byte string for the fingerprints. The result is at chembl_bulk_tanimoto.py.

The two differences are:

get_query_fp() returns the ExplicitBitVect() directly from GetMorganFingerprintAsBitVect() rather than convert it to a byte string.
read_fingerprints() uses DataStructs.CreateFromFPSText() to parse the hex-encoded fingerprints into an ExplicitBitVect(), instead of using bytes.fromhex() to create a byte string.

Load the data sets

I'm going to use BulkTanimotoSimilarity() to compute the similarity scores for each fingerprint. That takes an RDKit fingerprint as the query and a list (or tuple) of RDKit fingerprints as the targets. However, the reader returns iterator of (id, fingerprint) pairs. I'll need to convert that into two parallel lists; one with ids and one with fingerprints.

The obvious way to do that is something like:

target_ids = []
target_fingerprints = []
for target_id, target_fingerprints in fingerprint_reader:
    target_ids.append(target_id)
target_fingerprints.append(target_fingerprint)

However, there's an easy but non-obvious way to do that, using zip(*values).

zip(*values)

The zip() built-in function lets you iterate over zero or more lists in parallel. For example, if you have a list of identifiers and an associated list of SMILES then you might print them out together like this:

>>> id_list = ["CHEBI:18153", "CHEBI:18407", "CHEBI:27518", "CHEBI:33602"]
>>> smiles_list = ["C=C", "C#N", "C#C", "BBB"]
>>> for id, smiles in zip(id_list, smiles_list):
...     print(f"{id} is {smiles}")
...
CHEBI:18153 is C=C
CHEBI:18407 is C#N
CHEBI:27518 is C#C
CHEBI:33602 is BBB

The construct f(*args) is special syntax to unpack an argument list. It uses args[0] as the first argument, args[1] as the second, and so on. For example, suppose data contains a list of elements and I want to print the list out, with one element per line:

>>> data = [('CHEBI:18153', 'C=C'), ('CHEBI:18407', 'C#N'), ('CHEBI:27518', 'C#C'), ('CHEBI:33602', 'BBB')]
>>> print(*data, sep="\n")
('CHEBI:18153', 'C=C')
('CHEBI:18407', 'C#N')
('CHEBI:27518', 'C#C')
('CHEBI:33602', 'BBB')

The print(*data, sep="\n") call is identical to print(data[0], data[1], data[2], data[3], sep="\n"), which prints each element, with a newline between each one (and the default end="\n" printing the final newline.)

The zip() and the f(*args) work together to give the following:

>>> data = [('CHEBI:18153', 'C=C'), ('CHEBI:18407', 'C#N'), ('CHEBI:27518', 'C#C'), ('CHEBI:33602', 'BBB')]
>>> id_list, smiles_list = list(zip(*data))
>>> id_list
('CHEBI:18153', 'CHEBI:18407', 'CHEBI:27518', 'CHEBI:33602')
>>> smiles_list
('C=C', 'C#N', 'C#C', 'BBB')

The step-by-step transformations are:

list(zip(*data)) is the same as:
list(zip(data[0], data[1], data[2], data[3])) is the same as:
list(zip(('CHEBI:18153', 'C=C'), ('CHEBI:18407', 'C#N'), ('CHEBI:27518', 'C#C'), ('CHEBI:33602', 'BBB'))) is the same as:
[('CHEBI:18153', 'CHEBI:18407', 'CHEBI:27518', 'CHEBI:33602'), ('C=C', 'C#N', 'C#C', 'BBB')] (because zip yields all of first elements, then all of the second elements, and there are only two elements in each parameter).

As a result, id_list contains all of the first terms in the data elements, and smiles_list contains all of the second terms in the data elements.

For some people it may help to think of zip(*args) as working together to generate the transpose of args.

Actually loading the data sets

All that discussion was there to show how the old code, which read the ids and fingerprints into a list of (id, fingerprint) pairs using the following:

targets = list(read_fingerprints(open(args.targets)))

was transformed into reading a list of ids and a list of fingerprints by using the following:

target_ids, target_fingerprints = list(zip(*read_fingerprints(open(args.targets))))

Easy to write. Not obvious to understand the first few times you come across that idiom!

BulkTanimotoSimilarity()

RDKit's BulkTanimotoSimilarity() takes a query fingerprint and a list of target fingerprints, and returns a list of scores, one for each target fingerprint. It's straight-forward to adapt the original in-memory search program to use this bulk function:

# Compute the score with each of the targets.
scores = DataStructs.BulkTanimotoSimilarity(query_fp, target_fingerprints)
# Find the records with a high enough similarity.
for i, score in enumerate(scores):
    if score >= threshold:
        # Need to get the corresponding id
        target_id = target_ids[i]
        has_match = True
        print(f"{query_id}\t{target_id}\t{score:.3f}")

Since this function computes all of the Tanimoto scores, you could easily modify this code to sort it or use a priority queue and find the nearest k neighbors.

Also, RDKit implements many Bulk* functions (BulkAllBitSimilarity, BulkAsymmetricSimilarity, BulkBraunBlanquetSimilarity, BulkCosineSimilarity, BulkDiceSimilarity, BulkKulczynskiSimilarity, BulkMcConnaugheySimilarity, BulkOnBitSimilarity, BulkRogotGoldbergSimilarity, BulkRusselSimilarity, BulkSokalSimilarity, BulkTanimotoSimilarity, BulkTverskySimilarity) with the same function call API, so it's easy to modify the code to use one of these alterate comparison methods.

Timings

The result is chembl_bulk_tanimoto.py. I kept all of the timing code I developed for Wednesday's chembl_in_memory_search.py, which I'll enable so I can compare the two:

% python chembl_in_memory_search.py --queries wikipedia_100.smi --times > old.out
load: 5.27 search: 52.63 #queries: 100 #/sec: 1.90
% python chembl_bulk_tanimoto.py --queries wikipedia_100.smi --times > new.out
load: 32.92 search: 60.22 #queries: 100 #/sec: 1.66
% cmp old.out new.out

It's always a relief to see that a cross-comparison gives the same results!

I'm not surprised the RDKit search time is a bit slower than the raw byte string search code beause the RDKit implementation does more work, for examples, the BulkTanimotoSimilarity() function accepts ExplicitBitVect() and SparseBitVect() fingerprint types (the latter is for sparse fingerprints) so there's some overhead figuring out the right way to do the computation, and it double-checks that the fingerprints are of the same length. While the programs I wrote will only work for 2048-bit fingerprints and may cause Python itself to crash if you don't use it correctly.

Are there obvious ways to speed things up?

The same BitBound algorithm in yesterday's essay can be applied to improve the BulkTanimotoSimilarity() performance. However, most of the time is not spent in computing fingerprints! I used kernprof with the first 100 Wikipedia SMILES to profile the program and found almost 70% of the time is spent iterating through the scores:

         .... lines omitted ...
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
         .... lines omitted ...
   115         1   32567959.0 32567959.0      9.7      target_ids, target_fingerprints = list(zip(*read_fingerprints(open(args.targets))))
         .... lines omitted ...
   129                                                   # Compute the score with each of the targets.
   130       100   40625764.0 406257.6     12.1          scores = DataStructs.BulkTanimotoSimilarity(query_fp, target_fingerprints)
   131                                                   # Find the records with a high enough similarity.
   132 194141100  133618300.0      0.7     39.9          for i, score in enumerate(scores):
   133 194141000  127750939.0      0.7     38.2              if score >= threshold:
         .... lines omitted ...

The only way to make this significantly faster is push more work into C.

If you are interested in improving the performance of the parts of RDKit I covered here, I suggest you look into speeding up DataStructs.CreateFromFPSText(), which seems to take rather longer than I think it should.

Simple BitBound ChEMBL similarity search

Thu, 01 Oct 2020 12:00:00 GMT

This is part of a series of essays on how to write a similarity search program for the RDKit Morgan fingerprints distributed by ChEMBL.

In yesterday's essay I changed the scan-based Tanimoto search to an in-memory search and showed that after a start-up cost of about 5 seconds I was able to do about 2 searches per second of the 1.9 million ChEMBL fingerprints.

I ended by pointing out how chemfp was over 100x faster.

How is chemfp so much faster? One clear reason is the search code is all implemented in C. A comparison test like "score >= threshold" can be done in one CPU cycle, but the equivalent in Python takes likely thousands of CPU cycles. Another is Python's function call overhead - I suspect it takes longer to set up the call to byte_tanimoto_256() and convert the result to Python than it does to compute the actual Tanimoto.

The chemfp paper covers all of the techniques I used, and reviews the other ones I know about. In this essay I will implement the BitBound algorithm, which is a simple technique that can often reduce the number of fingerprints to test.

If you want to skip all the discussion, here are the instructions to use the final code, which assumes you have a gcc or clang C compiler and that RDKit's Python bindings are installed:

Download chembl_27.fps.gz.
Download the updated popc.py.
Run python popc.py to generate the _popc module.
Download chembl_bitbound_search.py.
Run python chembl_bitbound_search.py --times to do the default caffeine search and see the internal timings.
Run python chembl_bitbound_search.py --help to see the command-line help, then try your own queries, including with different thresholds.

The new program averages about 0.3 seconds per query with the default threshold of 0.7, and with an initial load time of about 8 seconds.

Update: A week later I implemented faster versions, first by reorganizing the data so I can do more work in C instead of Python, and second by replacing full byte_tanimoto_256() calculation with a pre-computed query and target fingerprints and a byte_intersect_256() calculation.

BitBound algorithm

Swamidass and Baldi pointed out that if a query fingerprint A has a bits set, and the goal is to find target fingerprints B_i which have a Tanimoto score at least T similar to the query, then we can set bounds on the number of bits b in the target fingerprints. More specifically: a T ≤ b ≤ a / T.

(They generalize this calculation to Tversky similarity, and also describe an search method to improve a nearest neighbor search.)

What this means is, if the number of bits in the target fingerprints can be pre-computed then those fingerprints which are out-of-bounds for a given query don't need to be considered.

There are few ways to implement this. One is to add the pre-computed popcount to each record. However, this still means testing every record. Another is to sort the records into different bins, one for each popcount. Then the search need only test the bins which might have a similarity match. I'll implement this seecond approach.

Compute the fingerprint popcount

I don't have a way to compute the popcount of each of the target fingerprints. I decided to use byte strings for them and used cffi to write popc.py, a program to generates the _popc module with the Tanimoto similarity and cosine similarity functions, hard-coded for 2048-bit fingerprints. I'll extend that module so it implements byte_popcount_256(), to compute the popcount of a 256-byte byte string.

The implementation is a straight-forward simplification of the similarity functions, so I'll present it here without further description (see popc.py for the full implementation):

static int
byte_popcount_256(const unsigned char *fp) {
    int num_words = 2048 / 64;
    int popcnt = 0;

    /* Interpret as 64-bit integers and assume possible mis-alignment is okay. */
    uint64_t *fp_64 = (uint64_t *) fp;

    for (int i=0; i<num_words; i++) {
        popcnt += __builtin_popcountll(fp_64[i]);
    }
    return popcnt;
}

Pre-compute the target fingerprints and put into bins

Just like in yesterday's essay, I'll first read the fingerprints into memory and then search them. However, this time I'll organize the fingerprints by popcount. There are 2049 possible values (from 0 bits set to all 2048 bits set), so I'll create a list of 2049 empty lists. For each fingerprint record, I'll compute the fingerprint popcount and append it to the appropriate list. Here's what that code looks like:

# Load targets into a popcount bins, ordered by the fingerprint popcount.
# Each bin contains a list of (id, fingerprint) pairs.
t1 = time.time()

# Create 2049 bins, each with an empty list, ordered by popcount.
target_popcount_bins = [[] for i in range(2049)]

for id_fp_pair in read_fingerprints(open(args.targets)):
    # Place each record into the right bin.
    popcount = byte_popcount_256(id_fp_pair[1])
    target_popcount_bins[popcount].append(id_fp_pair)

t2 = time.time()
load_time = t2-t1

Determine the bounds

It seems easy to compute the bit bounds with something like:

query_popcount = byte_popcount_256(query_fp)
min_popcount = query_popcount * threshold
max_popcount = query_popcount / threshold

and then use the integer popcount values between min_popcount and max_popcount. The lowest popcount value is the integer greater than or equal to min_popcount and the highest popcount value is the integer no greater than max_popcount. You might think this would work:

# This is wrong!
import math
query_popcount = byte_popcount_256(query_fp)
min_popcount = math.ceil(query_popcount * threshold)
max_popcount = math.floor(query_popcount / threshold)
for popcount in range(min_popcount, max_popcount+1):
   ...

In practice, this is tricky because the calculations will use IEEE 754 doubles. Some operations which should ideally result in an integer value ... don't. Here is an example where the min_popcount would be wrong:

>>> threshold = 0.55
>>> query_popcount = 1580
>>> threshold * query_popcount
869.0000000000001
>>> import math
>>> math.ceil(threshold * query_popcount)
870

and here's an example where the max_popcount would be wrong:

>>> threshold = 0.55
>>> query_popcount = 396
>>> query_popcount / threshold
719.9999999999999
>>> math.floor(query_popcount / threshold)
719

Solution #1: change by epsilon

One solution, which will work for normal threshold values, is to add or subtract a small value to ratio before doing the floor or ceiling call.

>>> import math
>>> EPSILON = 1e-10
>>> threshold = 0.55
>>> threshold * 1580
869.0000000000001
>>> math.ceil(threshold * 1580 - EPSILON)
869
>>> 396 / threshold
719.9999999999999
>>> math.floor(396 / threshold)
719
>>> math.floor(396 / threshold + EPSILON)
720

This works because 1) it's chemically unreasonable to specify a threshold with more than 3 significant digits, and 2) the differences between fingerprint scores cannot be more than 1/(2048²).

Solution #2: use rationals

I think a better solution is to do the calculation using rationals, so the operations can be done with integers. (This is especially important if you want to handle the Tversky BitBounds; in chemfp I gave up trying to get IEEE 754 math to work correctly and ended up multiplying alpha and beta values by 10,000 to work with integers.)

Since I'm using Python, I can use Python's built-in fractions module, which implements a rational data type:

>>> import math, fractions
>>> threshold = fractions.Fraction("0.55")
>>> math.ceil(threshold * 1580)
869
>>> math.floor(396 / threshold)
720

(I actually used the fractions module to find the examples where the simple application of IEEE 7554 doubles cause problems.)

Only a few minor changes are needed to use a fraction for the threshold, starting with the argparse --threshold parameter:

parser.add_argument("--threshold", "-t", type=fractions.Fraction, default=fractions.Fraction("0.7"),
                        help="minimum threshold (default: 0.7)")

For performance reasons, I keep track of the fraction threshold (only used to find the popcount bonds) and the floating point value (used when checking the Tanimoto score):

fraction_threshold = args.threshold
threshold = float(fraction_threshold)

The popcount bounds calculation, when using a fraction threshold, is the straight-forward, expected one:

min_popcount = 0
max_popcount = 2048
for query_id, query_fp in query_reader:
    ...
    query_popcount = byte_popcount_256(query_fp)
    if threshold > 0.0:
        min_popcount = math.ceil(query_popcount * fraction_threshold)
        max_popcount = math.floor(query_popcount / fraction_threshold)
        if max_popcount > 2048:
            max_popcount = 2048

The check for threshold > 0.0 is to prevent divide-by-zero errors. Of course, if the threshold is 0.0 then everything will match, and there will be a lot of overhead just printing the results.

Search the appropriate bins

Now that I have the bounds I can get the bins that are in range, and for each bin check each of the target fingerprints:

for targets in target_popcount_bins[min_popcount:max_popcount+1]:
    for target_id, target_fp in targets:
        score = similarity_function(query_fp, target_fp)
        # If it's similar enough, print details.
        if score >= threshold:
            has_match = True
            print(f"{query_id}\t{target_id}\t{score:.3f}")

Does it work?

The final code is in chembl_bitbound_search.py. But does it work?

I used chembl_in_memory_search.py from yesterday's essay to process the first 1,000 entries from the SMILES downloaded from the Wikipedia Chemical Structure Explorer, then processed the same entries with chembl_bitbound_search.py, then compared the two. The two programs have a different search order so the outputs cannot be compared directly. Instead, I compared the sorted outputs to each other (and excluded the warning and error messages in the following):

% python chembl_in_memory_search.py --times --queries wikipedia_1000.smi | sort > old.txt
load: 5.27 search: 491.73 #queries: 994 #/sec: 2.02
% python chembl_bitbound_search.py --times --queries wikipedia_1000.smi | sort > new.txt
load: 8.10 search: 297.42 #queries: 994 #/sec: 3.34
% cmp old.txt new.txt

They matched, and it was almost 40% faster. So that's a good first test. I'll also run with a higher threshold to see if the overall performance changes:

% python chembl_bitbound_search.py --times --queries wikipedia_1000.smi --threshold 0.9 & /dev/null
load: 5.53 search: 81.74 #queries: 994 #/sec: 12.16

Nice speedup!

A more sensitive test would be to construct fingerprints with bit patterns that trigger edge cases in the code, and to add instrumentation to ensure the BitBound calculation is not accidentally too wide. That's too much for this essay.

Can improve the Tanimoto calculation performance

If the target fingerprints are sorted by popcount then we know that all fingerprint in a given bin have the same count. We also know the number of bits in the query fingerprint. This can be used to improve the search performance by reducing the amount of work to compute the Tanimoto.

The heart of the current code, in byte_tanimoto_256(), is:

for (int i=0; i<num_words; i++) {
    intersect_popcount += __builtin_popcountll(fp1_64[i] & fp2_64[i]);
    union_popcount += __builtin_popcountll(fp1_64[i] | fp2_64[i]);
}
return ((double) intersect_popcount) / union_popcount;

If the query and target popcounts are known then this reduces to:

for (int i=0; i<num_words; i++) {
    intersect_popcount += __builtin_popcountll(fp1_64[i] & fp2_64[i]);
}
return ((double) intersect_popcount) /
        (query_popcount + target_popcount - intersect_popcount);

I'll leave that as an exercise for the student.

Can skip the Tanimoto calculation performance

Another possible speedup is to avoid the division in the Tanimoto calculation altogether. If you work out the math you'll find that the target fingerprint will pass the test if the intersection popcount C is:

C ≥ threshold * (query_popcount + target_popcount) / (1 + threshold)

This calculation is another place where it's easier to calculate using rationals than with IEEE 754 doubles.

chemfp advertisement

I've been trying to make the point that while it's easy to implement a similarity search program, there are a lot of complicated details to make a fast similarity search program.

You could go ahead and spend the days or weeks to implement these features yourself, or you could install and test chemfp chemfp (under the Base License Agreement) by doing:

python -m pip install chemfp -i https://chemfp.com/packages/

The Base License Agreement does not allow you to create FPB files or do an in-memory search with more than 50,000 fingerprints. If you are interested in those features, take a look at the other licensing options then contact me to ask for an evaluation license key.

Simple in-memory ChEMBL similarity search

Wed, 30 Sep 2020 12:00:00 GMT

This is part of a series of essays on how to write a similarity search program for the RDKit Morgan fingerprints distributed by ChEMBL.

In the previous two essays I showed how to search chembl_27.fps to find records with similar fingerprints to a query fingerprint, then how to implement a nearest-neighbor search and replace Tanimoto similarity with cosine similarity. The final program took about 5 seconds.

In this essay I'll show how to increase the search performance by shifting more of the work to a load step. This sort of precomputation can be useful if the load step is done once, with the extra overhead shared over multiple searches.

If you want to skip all the discussion, here are the instructions to use the final code, which assumes you have a gcc or clang C compiler and that RDKit's Python bindings are installed:

Download chembl_27.fps.gz.
Download popc.py.
Run python popc.py to generate the _popc module.
Download chembl_in_memory_search.py.
Run python chembl_in_memory_search.py --times to do the default caffeine search and see the internal timings.
Run python chembl_in_memory_search.py --help to see the command-line help, then try your own queries, including from a SMILES or FPS file.

The new program averages about 0.5 seconds per query, with an initial 4-5 second load time.

In tomorrow's essay, I'll describe how to implement the BitBound algorithm, which speeds up most searches by reducing the number of fingerprints to test. I'll revisit this in-memory search algorithm next week when I implement more of the search code in C.

Measuring FPS parsing overhead

I profiled the program and found that most of the time was spent in parsing the FPS file, not in computing the Tanimoto. To profile, I wrapped the entire program into a function (the profiler I'm using works on functions), used the @profile decorator, which tells kernprof which function(s) to profile, profiled the result with:

% kernprof -l ssimsearch_bytes.py

then viewed the profiler results with

% python -m line_profiler ssimsearch_bytes.py.lprof
Timer unit: 1e-06 s

Total time: 13.0334 s
File: ssimsearch_bytes.py
Function: main at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           @profile
     2                                           def main():
     3         1       6859.0   6859.0      0.1      from _popc.lib import byte_tanimoto_256
     4
     5         1     224418.0 224418.0      1.7      from rdkit import Chem
     6         1          6.0      6.0      0.0      from rdkit import DataStructs
     7         1       5629.0   5629.0      0.0      from rdkit.Chem import rdMolDescriptors
     8
     9         1          1.0      1.0      0.0      smiles = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"
    10         1          1.0      1.0      0.0      threshold = 0.7
    11
    12         1      11504.0  11504.0      0.1      mol = Chem.MolFromSmiles(smiles)
    13         1          2.0      2.0      0.0      if mol is None:
    14                                                   raise SystemExit(f"Cannot parse SMILES {smiles!r}")
    15
    16         2         98.0     49.0      0.0      rd_query_fp = rdMolDescriptors.GetMorganFingerprintAsBitVect(
    17         1          1.0      1.0      0.0          mol, radius=2, nBits=2048, useChirality=0,
    18         1          0.0      0.0      0.0          useBondTypes=1, useFeatures=0)
    19         1         15.0     15.0      0.0      query_fp = DataStructs.BitVectToBinaryText(rd_query_fp)
    20
    21
    22         1         97.0     97.0      0.0      infile = open("chembl_27.fps")
    23         7          8.0      1.1      0.0      for i in range(6):
    24         6         32.0      5.3      0.0          line = infile.readline()
    25         1          3.0      3.0      0.0      assert line.startswith("#date")
    26
    27   1941411    3356477.0      1.7     25.8      for line in infile:
    28   1941410    3207597.0      1.7     24.6          target_hex_fp, target_id = line.split()
    29   1941410    2605370.0      1.3     20.0          target_fp = bytes.fromhex(target_hex_fp)
    30
    31   1941410    2071065.0      1.1     15.9          score = byte_tanimoto_256(query_fp, target_fp)
    32   1941410    1544099.0      0.8     11.8          if score >= threshold:
    33         2         98.0     49.0      0.0              print(f"{target_id}\t{score}")

Lines 27-29 are all part of parsing the FPS file, and they take 70% of the search time. The actual Tanimoto calculation (line 31) and score test (line 32) take together about 25% of the time.

Multiple queries

What if you want to search for multiple queries? One implementation might do the full search of the first query, then of the second, and so on. This would take about 5*N seconds.

The above profile suggests that if you could load the fingerprint data into a more easily processed form, where the pre-processing was done once, then the overall performance would be faster. This may be useful in a web service where extra work can be done during startup in order to improve the per-request performance.

I changed the main loop (lines 27-33 in the above) so there are two stages. Stage 1 loads all of the fingerprints into memory, and stage 2 searches those fingerprints, and added timing measurements for each stage:

    import time
    t1 = time.time()
    targets = []
    for line in infile:
        target_hex_fp, target_id = line.split()
        target_fp = bytes.fromhex(target_hex_fp)
        targets.append( (target_id, target_fp) )

    t2 = time.time()
    for target_id, target_fp in targets:
        score = byte_tanimoto(query_fp, target_fp)
        if score >= threshold:
            print(f"{target_id}\t{score}")

This reports:

Load: 4.64 Search: 0.79

What multiple queries?

Previously I supported a hard-coded SMILES string, or a user-defined SMILES string on the command-line. If there are going to be multiple queries, what are they, and where do they come from?

I decided I was going to support a SMILES file or FPS file as input, based on if the given query filename ends with .smi or not. I already have a way to read_fingerprints() from an FPS file, and I have a way to compute the RDKit Morgan fingerprint given a SMILES, so all I need is a way to read a SMILES file that yields the same (id, fingerprint) pairs as read_fingerprints().

SMILES file

There are two main types of SMILES files: Daylight-like and CSV-like. A Daylight SMILES file contain one record per line. Each line contains a SMILES followed by a whitespace character followed by a molecule title/id, up to the rest of the line.

A CSV-like SMILES file contains a SMILES followed by a space or tab seperator character, followed by an molecule title/id, followed optionally by successive seperator characters and additional fields. In addition, the first line may contain column headers.

I'm going to support Daylight-like SMILES file.

SMILES file parser

It's pretty easy to read either SMILES file variant, though more complex if you want full configurability. In short, for every line in the file, split on the first whitespace. The first field is the SMILES, the second is the identifier. Convert the SMILES to a fingerprint, or skip it if the SMILES could not be processed. That give the needed (id, fingerprint) pairs:

# Read a SMILES file, convert the SMILES to a fingerprint,
# and yield (id, fingerprint) pairs for each parsable record.
def read_structure_fingerprints_from_smiles_file(infile):
    for lineno, line in enumerate(infile, 1):
        # Expecting the format: <SMILES> + whitespace + <title> + "\n"
        smiles, id = line[:-1].split(None, 1)
        
        try:
            # Get the corresponding RDKit Morgan fingerprint
            fp = get_query_fp(smiles)
        except ValueError as err:
            # Skip records which can't be parsed
            sys.stderr.write(f"ERROR: {err}: Skipping line {lineno}\n")
            continue
        
        yield id, fp

Reporting search results

Since there may be multiple queries, each with 0 or more hits, I need some way to figure out which query matches which targets. I decided to use a simple three-column tab-separated format with the query id in the first column the target id in the second column, and the score in the third, like the following:

Amygdalin	CHEMBL2303964	0.796
Amygdalin	CHEMBL2303958	0.700
Boron nitride	*	*
Bicarbonate	CHEMBL1353	0.875
Bicarbonate	CHEMBL2106975	0.875
Benzoic acid	CHEMBL541	1.000

A problem with this format is there's no obvious way to say that a query has no hits. Perhaps those should be omitted from the output? What I decided was to have one output line for that case, with * for the target id and score. You can see that in the Boron nitride line, above.

A search implementation

Assuming targets is a list of (target id, target fingerprint) pairs, and query_reader is an iterator of (query id, query fingerprint) pairs (either from read_fingerprints() or read_structure_fingerprints_from_smiles_file()) then the following shows the structure of the main search loop and reporting.

for query_id, query_fp in query_reader:
    has_match = False
    
    # Check each of the targets
    for target_id, target_fp in targets:
        score = similarity_function(query_fp, target_fp)
        # If it's similar enough, print details.
        if score >= threshold:
            has_match = True
            print(f"{query_id}\t{target_id}\t{score:.3f}")
    
    # Print something in case there were no matches.
    if not has_match:
        print(f"{query_id}\t*\t*")

(The final code includes a count of the number of queries, which is used for reporting.)

Command-line options with argparse

There are several Python packages for parsing command-line arguments. I use argparse because it's part of Python's standard library. Many are fond of click as a third-party package.

I decided I wanted to allow a single query SMILES input or input from either a SMILES file or an FPS file. I also wanted to be able to change the threshold, specify the location of the chembl_27.fps file (in case it isn't in the current directory), and support a --times option to report the load and search times.

Here's what the --help from the program looks like:

usage: chembl_in_memory_search.py [-h] [--threshold THRESHOLD]
          [--query SMILES | --queries FILENAME] [--times] [FILENAME]
in-memory ChEMBL fingerprint search

positional arguments:
  FILENAME              location of chembl_27.fps

optional arguments:
  -h, --help            show this help message and exit
  --threshold THRESHOLD, -t THRESHOLD
                        minimum threshold (default: 0.7)
  --query SMILES        query smiles
  --queries FILENAME    input query file (SMILES or FPS)
  --times               print timing information to stderr

This sort of configuration is pretty easy to do with argparse. The hardest thing is that the single-query SMILES search and the file-based search are exclusive, meaning they can't both be specified at the same time.

parser = argparse.ArgumentParser(
    description = "in-memory ChEMBL fingerprint search")

parser.add_argument("--threshold", "-t", type=float, default=0.7,
                        help="minimum threshold (default: 0.7)")

g = parser.add_mutually_exclusive_group()
g.add_argument("--query", metavar="SMILES",
                   help="query smiles")
g.add_argument("--queries", metavar="FILENAME",
                   help="input query file (SMILES or FPS)")

parser.add_argument("--times", action="store_true",
                        help="print timing information to stderr")

parser.add_argument("targets", metavar="FILENAME", default="chembl_27.fps", nargs="?",
                        help="location of chembl_27.fps")

An argparse.ArgumentParser() instance (like parser in the above) is usually used something like:

args = parser.parse_args()
print("Threshold:", args.threshold)

That is, parse_args() parses the command line and returns an argparse.Namespace() instance with the parsed command-line parameters accessible as attributes.

Setting up the query reader

The last tricky part to explain is how to handle both --query SMILES and --queries FILENAME. I decided to have a function which returns a query reader, that is, something which can be iterated over to get the successive pairs of (query id, query fingerprint). I have all the parts - I just need to hook them together correctly. And, if neither option is specified, I'll have the program search for caffeine, with a warning that that's what it will do.

# Helper function to figure out how the queries are specified
def open_query_reader(args):
    if args.query is None:
        if args.queries is None:
            # Default search
            sys.stderr.write("No query specified, using caffeine.\n")
            query_fp = get_query_fp("CN1C=NC2=C1C(=O)N(C(=O)N2C)C")
            return [("caffeine", query_fp)]
        
        elif args.queries.endswith(".smi"):
            # If --queries ends with ".smi"
            return read_structure_fingerprints_from_smiles_file(open(args.queries))
        else:
            # Otherwise, assume it's an FPS file
            return read_fingerprints(open(args.queries))

    else:
        # User specified a --query SMILES
        try:
            query_fp = get_query_fp(args.query)
        except ValueError as err:
            raise SystemExit("Cannot parse --query SMILES {args.query!r}")
        return [("Query", query_fp)]

chembl_in_memory_search.py

I've walked through all of the the important parts, though there are a few missing connector pieces. Rather than show it here, you can download chembl_in_memory_search.py for yourself. You'll need to download and run popc.py in order to generate the _popc Python module. (See Monday's essay for more details.)

Search Wikipedia structures

The fun Wikipedia Chemical Structure Explorer is an interactive tool to explore the chemical structures scrapped from Wikipedia. The structures can be downloaded as a SMILES file. It currently contains 17,846 structures, but it's NOT a SMILES file because the id and SMILES fields are swapped. I'll extract the first 10 to use for my timings and use awk to swap the two fields (the -F\t configures it to use the tab character as the field seperator):

% head -10 smiles.txt | awk '-F\t' '{print $2, $1}' > wikipedia_10.smi

Many if not most of the Wikipedia structures are in ChEMBL, along with similar structures, so I'll use a high similarity threshold in order to limit the number of lines in the output:

% python chembl_in_memory_search.py --queries wikipedia_10.smi --threshold 0.9
Ammonia	CHEMBL1160819	1.000
Aspirin	CHEMBL25	1.000
Acetylene	CHEMBL116336	1.000
Adenosine triphosphate	CHEMBL14249	1.000
Adenosine triphosphate	CHEMBL339385	0.942
Adenosine triphosphate	CHEMBL437508	0.942
Adenosine triphosphate	CHEMBL490984	1.000
Adenosine triphosphate	CHEMBL591905	0.942
Adenosine triphosphate	CHEMBL407938	0.942
Adenosine triphosphate	CHEMBL14830	0.981
Adenosine triphosphate	CHEMBL1331383	0.942
Adenosine triphosphate	CHEMBL1626485	0.942
Adenosine triphosphate	CHEMBL1610462	0.925
Adenosine triphosphate	CHEMBL1162201	0.942
Adenosine triphosphate	CHEMBL4298329	0.942
Ampicillin	CHEMBL174	1.000
Ampicillin	CHEMBL453388	0.978
Ampicillin	CHEMBL1078033	0.978
Ampicillin	CHEMBL1473908	1.000
Chemistry of ascorbic acid	CHEMBL196	1.000
Chemistry of ascorbic acid	CHEMBL486293	1.000
Chemistry of ascorbic acid	CHEMBL1161421	1.000
Chemistry of ascorbic acid	CHEMBL196	1.000
Chemistry of ascorbic acid	CHEMBL486293	1.000
Chemistry of ascorbic acid	CHEMBL1161421	1.000
Amphetamine	CHEMBL405	1.000
Amphetamine	CHEMBL19393	1.000
Amphetamine	CHEMBL612	1.000
Amphetamine	CHEMBL287386	0.950
Amphetamine	CHEMBL554211	0.950
Amphetamine	CHEMBL1788294	0.950
Aspartame	CHEMBL171679	1.000
Aspartame	CHEMBL312245	1.000
Aspartame	CHEMBL2110238	1.000
Amoxicillin	CHEMBL1082	1.000
Amoxicillin	CHEMBL1367635	1.000
Amoxicillin	CHEMBL1433952	1.000

If I add the --times option, then I see I'm doing about 2 searches per second:

load: 4.57 search: 5.01 #queries: 10 #/sec: 2.00

The equivalent chemfp search

Chemfp is a high-performance cheminformatics similarity search package I sell. Its command-line is similar to the chembl_in_memory_search.py program I developed above. I'll try it out:

% simsearch --queries wikipedia_10.smi --threshold 0.9 --times chembl_27.fps > /dev/null
open 0.00 read 0.00 search 8.55 output 0.00 total 8.56

This means it takes about 9 seconds to search the file, which is about the same time as the simple Python code I wrote!

--scan vs. --memory searches

That was unexpected. The problem is that chemfp has two search modes. The --scan mode searches the file chunk by chunk (like Monday' essay), while --memory loads the data set into memory before searching.

Years ago, the tradeoff was about 20 queries. Now I see it's much smaller! If I ask it to do an in-memory search then it takes only 5.7 seconds total:

% simsearch --queries wikipedia_10.smi --threshold 0.9 --times chembl_27.fps --memory > /dev/null
open 5.60 read 0.00 search 0.07 output 0.00 total 5.67

This means it took 5.6 seconds to read the fingerprints and organize them for fast search, and the search itself took only 0.07 seconds.

Looks like it's time to re-tune that parameter!

chemfp's FPB format

One reason I didn't notice this until now is that I tend to work with chemfp's FPB format, which is optimized for fast loading. It takes all of the FPS readers over 4 seconds to parse the file to put the data into a format which is more appropriate for faster searches.

The FPB format organizes the fingerprint data so it can loaded quickly. In fact, if the file is uncompressed it can be memory-mapped and used directly - for some searches chemfp doesn't even need to load the full data set!

First, I'll convert the FPS file to FPB format:

% fpcat chembl_27.fps -o chembl_27.fpb

I'll then do the same search with 10 Wikipedia queries using a warm disk cache:

% simsearch --queries wikipedia_10.smi --threshold 0.9 --times chembl_27.fpb > /dev/null
open 0.00 read 0.00 search 0.22 output 0.00 total 0.23

That's about 22 milliseconds per query, with very little load time needed. This is fast enough to include a ChEMBL similarity search in a scripting environment.

chemfp with 1,000 Wikipedia queries

I then tried with nearly 1,000 Wikipedia queries (RDKit could not parse a couple of structures; I've omitted RDKit's warning and error messages) using chemfp and the FPS and FPB formats:

% simsearch --queries wikipedia_1000.smi --threshold 0.9 --times chembl_27.fps > /dev/null
open 5.92 read 0.26 search 2.58 output 0.01 total 8.77
% simsearch --queries wikipedia_1000.smi --threshold 0.9 --times chembl_27.fpb > /dev/null
open 0.02 read 0.25 search 2.77 output 0.02 total 3.07

With 1,000 queries the read time is now noticeable; that includes the time for RDKit to parse the SMILES string and generate the Morgan fingerprint. In case you are curious, the performance is over 100x faster than the simple in-memory program I developed for this essay:

% python chembl_in_memory_search.py --queries wikipedia_1000.smi --threshold 0.9 --times > /dev/null
load: 4.69 search: 499.65 #queries: 994 #/sec: 1.99

chemfp advertisement

You can install and test chemfp chemfp (under the Base License Agreement) by doing:

python -m pip install chemfp -i https://chemfp.com/packages/

Andrew Dalke's writings

Are 32-bit floats enough for your Tanimoto scores?

Real-world Tanimotos scores and binary32

When is 32-bit IEEE 754 floats not enough?

Don't worry!

Farey sequence

Find minimum N using a Farey sequence

Tversky similarity

Advertisement

A molfile precursor?

Background: MDL and MACCS

Gund, Wipke, and Langridge (1974)

Cache and reuse popcount-sorted ChEMBL fingerprints

Amortize the load time

Python pickle

Implement a fingerprint pickler

Timing the new loader

Chemfp's FPB format

chemfp advertising

Even faster in-memory search with intersection popcount

threshold=0.35

Reducing the number of popcounts

Intersection popcount

Similarity search using the intersection popcount

Changes at the Python level

Verification and timings

intersection popcount speeds up full in-memory search too

What's left?

Faster BitBound ChEMBL search by using more C

Merge fingerprints in each bin into a single bytestring

Searching the bins

Validation and Timings

chemfp advertisement

Faster in-memory ChEMBL search by using more C

Merge target fingerprints into a single byte string

Load the target fingerprints into a byte string, and get the ids

Put results in a hit list

threshold_tanimoto_search()

Updated ChEMBL similarity search

Does it work?

chemfp advertising

Using RDKit BulkTanimotoSimilarity

Loading ExplicitBitVect fingerprints

Load the data sets

zip(*values)

Actually loading the data sets

BulkTanimotoSimilarity()

Timings

Are there obvious ways to speed things up?

Simple BitBound ChEMBL similarity search

BitBound algorithm

Compute the fingerprint popcount

Pre-compute the target fingerprints and put into bins

Determine the bounds

Solution #1: change by epsilon

Solution #2: use rationals

Search the appropriate bins

Does it work?

Can improve the Tanimoto calculation performance

Can skip the Tanimoto calculation performance

chemfp advertisement

Simple in-memory ChEMBL similarity search

Measuring FPS parsing overhead

Multiple queries

What multiple queries?

SMILES file

SMILES file parser

Reporting search results

A search implementation

Command-line options with argparse

Setting up the query reader

chembl_in_memory_search.py

Search Wikipedia structures

The equivalent chemfp search

--scan vs. --memory searches

chemfp's FPB format

chemfp with 1,000 Wikipedia queries

chemfp advertisement