Diary RSS | All of Andrew's writings | Diary archive
Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure. Code to taste. Best served at room temperature.Are 32-bit floats enough for your Tanimoto scores? #
Chemfp uses 64-bit
IEEE 754 floating point (a.k.a. double
or binary64
,
or float64
) to represent cheminformatics fingerprint
similarity scores.
Some fingerprint search systems instead use 32-bit
IEEE 754 floating point (a.k.a. single
or binary32
or float32
), usually to reduce the amount of memory needed, and
sometimes also for the higher performance of 32-bit floating point
operations.
I chose to represent everything in binary64 for a simple reason:
chemfp is designed to work with Python. Python's native float
data
type is binary64, and I find that mixing 32-bit and 64-bit floats
results in subtle conversion problems that are frustrating to figure
out.
(Consider that the binary32 representation closest to 0.6 has a value slightly larger than 0.6, while the corresponding binary64 representation's value is slightly smaller than 0.6.)
Other than possible conversion problems, is there any reason to NOT use a binary32 value?
In this essay I'll show that binary32 isn't able to distinguish some Tanimoto scores for fingerprints which are 4142 bits long, or longer, though in practice this is essentially never a concern.
Real-world Tanimotos scores and binary32
As a brief background, most fingerprints are binary bitstrings of
length around 1,000 bits. Let ‖fp‖ be the number
of on-bits in a fingerprint bitstring. The most common way to compare
two fingerprints is the Jaccard
similarity, which in cheminformatics is referred to as the
Tanimoto
. It is computed as
‖fp1&fp2‖/‖fp1|fp2‖,
and ranges between 0.0 and 1.0. In chemfp, 0/0 = 0.0.
A simple examination shows that 64 bits is overkill. I can't recall ever reading a paper using over 214 = 16,384 bits. This means the numerator and denominator of every Tanimoto requires at most 14 bits, so should easily fit in 32-bits.
This seems like a binary32 value should work. However, some of those bits are used to store values that aren't needed for Tanimoto scores. The sign bit isn't needed because Tanimoto scores can never be negative. A further 8 bits are reserved for 254 possible exponents, but only 7 or so of those exponents are needed.
In the worst case, we only have 24 bits of significand precision, which gives 12 bits each for the numerator and denominator, corresponding to a maximum fingerprint length of 4096 bits.
But surely we can do a bit better than 4096 because the values under 0.1 use a different exponent. Perhaps we can scale up to 4096 / 0.9 = 4551 bits? Furthmore, some of those Tanimoto ratios are not in reduced form, that is, 1/2 = 2/4 = 4/8 = 0.5. Perhaps that gives some more breathing room?
When is 32-bit IEEE 754 floats not enough?
What is the smallest fingerprint length with two distinct possible scores that cannot be distinguished using binary32? The brute-force solution is to enumerate all ratios and report when there are duplicates, as in the following:
# Find the smallest fingerprint length where two different Tanimoto # scores have the same 32-bit float representation. import struct # Convert the value to the 4-byte/32-bit representation. pack_f32 = struct.Struct("f").pack # Mapping from packed_score (float-as-bytes) to (value, numerator, denominator) seen_packed_scores = {} # Try each ratio for denominator in range(1, 5000): for numerator in range(1, denominator): # This uses Python's native 64-bit float score = numerator / denominator # Get the 4-byte representation of the score packed_score = pack_f32(score) # Check if that packed score was seen before prev = seen_packed_scores.get(packed_score, None) if prev is None: # Not seen, so record it in case I see it again. seen_packed_scores[packed_score] = (score, numerator, denominator) continue # The packed score can be the same if the scores are the same. # (4/8 and 5/10 give the same scores) prev_score, prev_numerator, prev_denominator = prev if score != prev_score: print(f"{score} = {numerator}/{denominator} and {prev_numerator}/{prev_denominator}") raise SystemExit(1) print("No duplicates found.")
This takes about 10 seconds before it reports:
0.5111057460164172 = 2117/4142 and 2094/4097
That means that binary32 cannot distinguish between the Tanimoto scores 2117/4142 and 2094/4097, so cannot be used to represent all possible Tanimoto scores for fingerprints that are 4142 bits long or longer.
Don't worry!
On the other hand, do you really need to represent all possible Tanimoto scores? Most people don't work with fingerprints beyond about 4096 bits. Most fingerprints are sparse, with an on-bit density of less than 10%. If the fingerprint type is 8192 bits long and has a maximum bit density of 20% then the Tanimoto denominator is at most 3277 bits, which is within the expected limit.
I honestly don't think that real-world work will run into this issue.
To be sure, it is possible to get Tanimoto scores in this range. The densest fingerprints I know are the RDKit fingerprints. The following program uses chemfp's interface to RDKit to find the ChEMBL records with a successively increasing number of bits set, using 16384-bit fingerprints:
import chemfp from chemfp import bitops fptype = chemfp.get_fingerprint_type("RDKit-Fingerprint fpSize=16384") max_num_bits = 0 for id, fp in fptype.read_molecule_fingerprints("chembl_27.sdf.gz", errors="ignore"): num_bits = bitops.byte_popcount(fp) if num_bits > max_num_bits: print(id, num_bits) max_num_bits = num_bits
The output until I stopped in manually is:
CHEMBL153534 819 CHEMBL440060 1244 CHEMBL440245 1290 CHEMBL440249 2656 CHEMBL504077 3546 CHEMBL501923 3600 CHEMBL501238 3648 CHEMBL441925 3783 CHEMBL439119 5981 CHEMBL438241 5999 CHEMBL1207820 6007 CHEMBL411986 6036 CHEMBL263314 6069 CHEMBL268814 6124 CHEMBL404974 6141 CHEMBL405910 6196 CHEMBL410076 7382 ^C
A spot-check with CHEMBL411986 and CHEMBL1207820, using the SQLite3 file from the ChEMBL distribution to get the corresponding SMILES, is:
import sqlite3 import chemfp from chemfp import bitops db = sqlite3.connect("chembl_27.db") (smiles1, id1), (smiles2, id2) = db.execute(""" SELECT canonical_smiles, chembl_id FROM compound_structures, chembl_id_lookup WHERE chembl_id_lookup. chembl_id IN ("CHEMBL411986", "CHEMBL1207820") AND chembl_id_lookup.entity_id = compound_structures.molregno""") fptype= chemfp.get_fingerprint_type("RDKit-Fingerprint fpSize=16384") fp1 = fptype.parse_molecule_fingerprint(smiles1, "smistring") fp2 = fptype.parse_molecule_fingerprint(smiles2, "smistring") print(f"{id1} has {bitops.byte_popcount(fp1)} bits set.") print(f"{id2} has {bitops.byte_popcount(fp2)} bits set.") print(f"intersection has {bitops.byte_intersect_popcount(fp1, fp2)} bits set.") print(f"union has {bitops.byte_popcount(bitops.byte_union(fp1, fp2))} bits set.") print(f"Tanimoto({id1}, {id2}) = {bitops.byte_tanimoto(fp1, fp2)}.")
The above program reports:
CHEMBL1207820 has 6007 bits set. CHEMBL411986 has 6036 bits set. intersection has 5744 bits set. union has 6299 bits set. Tanimoto(CHEMBL1207820, CHEMBL411986) = 0.9118907763137006.
This confirms there may be an issues with some fingerprints. Even with 8192-bit fingerprints the Tanimoto is still 4767/5115 which is in the range that cause problems.
However, even supposing I could find a real-world pair of Tanimotos where the actual Tanimoto scores were different but the binary32 scores were the same, the real question is, does that make a difference?
I think not. After all, fingerprints do not give an accurate enough representation of a molecule to justify four digits of precision.
So, don't worry! (Or use a tool like chemfp which doesn't use binary32 floats. ;) ).
Farey sequence
The Python program was easy to write, but a bit slow, and it used a
lot of memory to store all of the distinct values. That's why I called
it the brute-force
solution.
To be sure, it only took a few seconds, which was more than fast enough to answer the question posed. But what if performance was more critical?
The usual solutions are to switch to PyPy (a faster implementation of the Python language), or to switch to a language like C which usually has less performance overhead.
However, C has no built-in dictionary type, making a direct translation difficult. I could use C++ or other more sophisticated language. Instead, I'll switch algorithms.
Twenty-some-odd years ago, John MacCuish pointed out to me that the possible Tanimoto scores for a given length N were the same as the terms in the Farey sequence for N. This was part of the paper he co-authored describing issues with ties in proximity and clustering compounds.
The Farey sequence for N is the ordered sequence of distinct values between 0 and 1 where each value is a rational with the denominator at most N. Farey gave a simple algorithm for how to generate the reduced rationals that make up that sequence. The following is a shortened version of the Python example in the Wikipedia entry, along with its output for N=4:
def farey_sequence(n): (a, b, c, d) = (0, 1, 1, n) yield (0, 1) while c <= n: k = (n + b) // d (a, b, c, d) = (c, d, k * c - a, k * d - b) yield a, b >>> list(farey_sequence(4)) [(0, 1), (1, 4), (1, 3), (1, 2), (2, 3), (3, 4), (1, 1)]
That's pretty easy to write in C. The following, farey_check.c
,
takes a single command-line value N and reports the first pair of
ratios which cannot be distinguished using a C float (which is a
binary32):
/* Call this "farey_check.c" */ #include <stdio.h> #include <stdlib.h> int main(int argc, char *argv[]) { int a, b, c, d; int next_a, next_b, next_c, next_d; int k; long n; float prev_score, score; char *endptr; if (argc != 2) { fprintf(stderr, "Usage: %s N\n", argv[0]); exit(1); } n = strtol(argv[1], &endptr, 10); if (*endptr != '\0' || n < 1 || n > 10000) { fprintf(stderr, "N must be an integer between 1 and 10,000\n"); exit(1); } /* Modified from https://en.wikipedia.org/wiki/Farey_sequence#Next_term */ prev_score = 0.0; a=0, b=1, c=1, d=n; while (c <= n) { /* Get the next value in the Farey sequence */ k = (n + b) / d; next_a = c; next_b = d; next_c = k * c - a; next_d = k * d - b; /* Compute the ratio */ score = ((float) next_a) / next_b; /* Check if a 32-bit float can distinguish this value */ /* from the previous. If not, replace the value and its ratios. */ if (score == prev_score) { printf("%f = %d/%d and %d/%d\n", score, a, b, next_a, next_b); break; } prev_score = score; a = next_a; b = next_b; c = next_c; d = next_d; } }
I'll compile and run it:
% cc -O2 farey_check.c -o farey_check % ./farey_check 5000 0.500100 = 2499/4997 and 2498/4995 % time ./farey_check 5000 0.500100 = 2499/4997 and 2498/4995 0.056u 0.002s 0:00.06 83.3% 0+0k 0+0io 0pf+0w
It took about 60 milliseconds to find a failure case.
However, that ratio uses larger denominators than the one I found earlier, which was:
0.5111057460164172 = 2117/4142 and 2094/4097
The problem is that the farey_check program finds a failure case for the given value of N, rather than the minimum value of N which has a failure case.
Find minimum N using a Farey sequence
The obvious approach is to test every value of N. That took about 80 seconds to run. While C is significantly faster than Python, the values in the Farey sequence for N+1 includes all of the terms in the Farey sequence for N, which really slows things down.
Instead, I'll use a binary search.
BEWARE! Binary search is notoriously tricky to implement correctly. What I usually do is either use Python's bisect module directly (which contains several binary search variants), or look to its implementation for guidance. Even then, be aware that one of the steps, noted below, may overflow a C integer.
Here's the full search code, which does a binary search between 1 and 10,000 to find the minimim value of N with a failure case, and to report that error case:
#include <stdio.h> #include <stdlib.h> /* Return 1 if two values have the same 32-bit float representation, else return 0. */ /* Modified from https://en.wikipedia.org/wiki/Farey_sequence#Next_term */ int farey_check(int n, int *a1, int *b1, int *a2, int *b2) { int a, b, c, d; int next_a, next_b, next_c, next_d; int k; float score, prev_score; prev_score = 0.0; a=0, b=1, c=1, d=n; while (c <= n) { /* Get the next value in the Farey sequence */ k = (n + b) / d; next_a = c; next_b = d; next_c = k * c - a; next_d = k * d - b; /* Compute the ratio */ score = ((float) next_a) / next_b; /* Check if a 32-bit float can distinguish this value from */ /* the previous. If not, record the ratios and return 1. */ if (score == prev_score) { *a1 = a; *b1 = b; *a2 = next_a; *b2 = next_b; return 1; } prev_score = score; a = next_a; b = next_b; c = next_c; d = next_d; } /* No failure case found. */ return 0; } int main(int argc, char *argv[]) { int lo = 1; int hi = 10000; int mid; int a1, b1, a2, b2; /* bisect_left from Python's bisect module */ while (lo < hi) { mid = (lo + hi) / 2; /* NOTE: Not valid if lo + hi > MAXINT */ if (farey_check(mid, &a1, &b1, &a2, &b2) < 1) { lo = mid + 1; } else { hi = mid; } } /* Re-run to get the ratios for the failure case. */ farey_check(mid, &a1, &b1, &a2, &b2); printf("N: %d\n", lo); printf("%f = %d/%d and %d/%d\n", ((float) a1)/b1, a1, b1, a2, b2); }
This takes about 0.5 seconds to report:
N: 4142 0.511106 = 2094/4097 and 2117/4142
which is about 20x faster than the original Python program, and it uses constant memory.
It also took several times longer to write. ;)
Tversky similarity
A secondary reason that chemfp uses binary64 is to store Tversky similarity search results in the same data structure as Tanimoto similarity search results.
Tversky similarity uses parameters alpha
and beta
to
give different weights to shared bits than the unshared bits. If alpha
= beta = 1.0 then it's identical to the Tanimoto similarity.
Suppose N=1024 bits and alpha = 0.15 and beta = 0.85. The possible Tversky scores are still in a Farey sequence, but in this case they are in Farey N*20 because alpha=3/20 and beta=17/20. And not all of the terms in that Farey sequence correspond to an accessible Tversky score.
I think it's easier to store the results in a binary64 than figure out if binary32 has sufficient space to store the results a given similarity method, or worry about possible consequences if it does not.
Advertisement
Download chemfp and try it out using:
python -m pip install chemfp -i https://chemp.com/packages/
No license key needed to use it with the pre-built ChEMBL fingerprints.
Industrial and no-cost academic licensing available. Contact me for a license key to evaluate chemfp with your data sets and ask about pricing.
A molfile precursor? #
I think I found a precursor to the MDL molfile in a 1973 publication by Gund, Wipke, and Langridge. Here it is:
Background: MDL and MACCS
The SDFile format is one of a suite of related formats which came out of MDL (Molecular Design Limited) starting in the late 1970s. (An SDFile is a molfile followed by a data block followed by the SDFile delimiter.) Quoting Wikipedia:
MDL was the first company to provide interactive graphical registry and full and substructural retrieval. The company's initial products were first-of-their-kind systems for storing and retrieving molecules as graphical structures and for managing databases of chemical reactions and related data. These systems revolutionized the way scientists accessed and managed chemical information in the 1980s.
From its initial pioneering of computer handling of graphical chemical structures with MACCS (Molecular ACCess System) in 1979, MDL continued at the forefront of the field now known as cheminformatics
(Yes, that's the same MACCS as the 166-bit MACCS keys, still widely used for fingerprint similarity. They were developed in the MACCS system as a substructure screen. So many people had a MACCS system, with the screens already computed, that it was a natural data source for the early similarity search implementations.)
Gund, Wipke, and Langridge (1974)
MDL was co-founded by W. Todd Wipke in 1978, who had been working on computer-assisted chemical synthesis design for over a decade.
I came across a 1974 publication, co-authored by Wipke, in proceedings
from a 1973 conference in Ljubljana/Zagreb, which appears to contain
an early version of what was to become a molfile. Here's the relevant
text from the second page of the paper (the page number on the bottom
of the page is 5/34
):
DATA STRUCTURE. For this preliminary study, an extremely simple format was adopted for the molecular data files, as shown in figure 1. The pattern data files are exactly the same, except the bond order may be zero. An advantage of having the same format is that patterns may be displayed and manipulated as if they were molecules.
and here's Fig. 1, Format for molecular structure files
, at the
top of the next page (5/35
):
You can see the basic structure is similar to a Molfile, though a
Molfile has two additional lines for misc. info.
, and the counts
line and the atom and bond lines all have a few more fields.
The full citation is:
Gund, P., Wipke, W. T., and Langridge, R., Computer Searching of a
Molecular Structure File for Pharmacophoric Patterns
, Computers
in Chemical Research and Education, vol 3, pp. 33-38 (1974).
The publication contains papers from a conference in in 1973 but Google Scholar says it was published in 1974 so I'm going with that. Also, the above quote is from the second page of the paper and the figure is at the top of the third page. The page numbers on the bottom of the pages are "5/34" and "5/35", respectively.
It does not appear to be based on a previously widely used connection
table format. The 1971 textbook Computer handling of chemical
structure information
by Lynch, Harrison, Town, and Ash does not
show any format with coordinate information, though it does mention
that some exist.
Do you know of an earlier molfile-like format? Let me know!
Cache and reuse popcount-sorted ChEMBL fingerprints #
This is part of a series of essays on how to write a similarity search program for the RDKit Morgan fingerprints distributed by ChEMBL.
- Simple FPS fingerprint similarity search: variations on a theme
- Simple k-NN FPS Tanimoto and cosine similarity search
- Simple in-memory ChEMBL similarity search
- Simple BitBound ChEMBL similarity search
- Using RDKit BulkTanimotoSimilarity
- Faster in-memory ChEMBL search by using more C
- Faster BitBound ChEMBL search by using more C
- Even faster in-memory search with intersection popcount
- Cache and reuse popcount-sorted ChEMBL fingerprints
The program I wrote in the first essay of this series of ChEMBL fingerprint Tanimoto search algorithms took 2 seconds per query. I wrote a series of new programs which had a higher startup cost in exchange for a faster per-query searches. The final program, in yesterday's essay took 10 seconds to load but could process 35-90 queries per second, depending on the threshold.
In this short essay I'll use Python's pickle
module to cache and re-use the loaded fingerprint data to reduce
the load time to 0.70.
If you want to jump right to the code:
- Download and uncompress chembl_27.fps.gz.
- Download and run popc.py if you haven't already (it's unchanged from yesterday).
- Download chembl_bitbound_search_in_C_v3.py.
- Run
python chembl_bitbound_search_in_C_v3.py --times
to do the default caffeine search and see the internal timings. This will cache the fingerprint data to ./chembl_27.fps.pkl - Run
python chembl_bitbound_search_in_C_v3.py --times
to do the default caffeine search and see the internal timings using the pickle file. The load time should be much faster and the search time about the same.
Amortize the load time
The initial load takes about 10 seconds because the program processes each fingerprint to place it into the right bin, based on its popcount, then merges all fingerprints in the same popcount bin into a single byte string.
This organization is done every time you run the search program.
However, if you plan to run the program many times then it's worthwhile to see if the organized data can be stored (in this case, to a file) in some way which is faster to load when needed. If so, then overhead of organizing the data and storing it can be done once. In computer science terms, the overhead cost is amortized across all of the times it's needed, assuming the new load time is low enough
Python pickle
Python has a few ways to turn a simple data structure into a string or
file, and convert it back into the the original data structure. The
most common (and fastest, in my testing) is the pickle
module. I'll start by walking through an example which pickles a
3-element tuple into a byte string then converts the bytestring back
into an equivalent 3-element tuple:
>>> import pickle >>> data = (1, b"abc", 9.8) >>> pkl_data = pickle.dumps(data) >>> pkl_data b'\x80\x04\x95\x14\x00\x00\x00\x00\x00\x00\x00K\x01C\x03abc\x94G@#\x99\x99\x99\x99\x99\x9a\x87\x94.' >>> pickle.loads(pkl_data) (1, b'abc', 9.8)
There are also functions to read to and write from file opened in binary mode:
>>> with open("datafile.pkl", "wb") as outfile: ... pickle.dump(data, outfile) ... >>> with open("datafile.pkl", "rb") as infile: ... print(pickle.load(infile)) ... (1, b'abc', 9.8)
NOTE: Pay attention to this warning from the documentation:
Warning: The pickle module is not secure. Only unpickle data you trust.
It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with.
Luckily for us, that's not a concern for this essay.
Implement a fingerprint pickler
The chembl_bitbound_search_in_C_v2.py program uses about 16 lines (with comments and a blank line) to create the list target_popcount_bins and figure out max_bin_size. Here's the code we're starting off with, indented one extra level so it makes since in the new program:
# Create 2049 bins, each with an empty list, ordered by popcount. target_popcount_bins = [[] for i in range(2049)] for id_fp_pair in read_fingerprints(open(args.targets)): # Place each record into the right bin. popcount = byte_popcount_256(id_fp_pair[1]) target_popcount_bins[popcount].append(id_fp_pair) # Split the ids and fingerprints into parallel lists then merge the fingerprints max_bin_size = 0 for popcount, id_fp_pairs in enumerate(target_popcount_bins): if id_fp_pairs: target_ids, target_fingerprints = list(zip(*id_fp_pairs)) target_popcount_bins[popcount] = (target_ids, b"".join(target_fingerprints)) max_bin_size = max(max_bin_size, len(target_ids)) else: target_popcount_bins[popcount] = [[], b""]
I'll save the two-element tuple (max_bin_size, target_popcount_bins) into a pickle file, with the command-line argument --no-cache to disable caching. Assuming cache_filename contains the name of the file to store the cached data, and and args.no_cache is True if --no-cache is set, then saving is pretty simple:
if not args.no_cache: with open(cache_filename, "wb") as pkl_file: pickle.dump((max_bin_size, target_popcount_bins), pkl_file)
and reading it is a matter of: if the cache file is there, use it, otherwise use the original code to read the fingerprint file then save to the cache file. Which looks like:
cache_filename = "chembl_27.fps.pkl" if os.path.exists(cache_filename) and not args.no_cache: with open(cache_filename, "rb") as pkl_file: max_bin_size, target_popcount_bins = pickle.load(pkl_file) else: # Create 2049 bins, each with an empty list, ordered by popcount. target_popcount_bins = [[] for i in range(2049)] for id_fp_pair in read_fingerprints(open(args.targets)): …
Other than a few minor changes, like importing the os and pickle modules, and adding the new command-line option, like this:
parser.add_argument("--no-cache", action="store_true", help="don't read or write the processed fingerprints to 'chembl_27.fps.pkl'")
... we're done!
Timing the new loader
I'll do the default search for caffeine using the new chembl_bitbound_search_in_C_v3.py
:
% python chembl_bitbound_search_in_C_v3.py --times No query specified, using caffeine. caffeine CHEMBL113 1.000 caffeine CHEMBL1232048 0.710 load: 13.21 search: 0.01 #queries: 1 #/sec: 109.08
You can see the load time went up by another couple of seconds. That's the extra time needed to save the 500 MB cache file:
% ls -lh chembl_27.fps.pkl -rw-r--r-- 1 dalke staff 503M Oct 8 20:36 chembl_27.fps.pkl
Now that the cache file is created, I'll try the search again:
% python chembl_bitbound_search_in_C_v3.py --times No query specified, using caffeine. caffeine CHEMBL113 1.000 caffeine CHEMBL1232048 0.710 load: 0.78 search: 0.01 #queries: 1 #/sec: 162.16
Well under one second - more than an order of magnitude faster! In fact, it's faster than scanning the FPS file.
Which means if you can take a one-time cost of preparing and storing the 500 MB pickle file then it will be faster to search than the original FPS file, even for a single query.
Lastly, a check with the Wikipedia SMILES:
% python chembl_bitbound_search_in_C_v3.py --queries wikipedia_1000.smi --threshold 0.35 --times > /dev/null load: 0.70 search: 30.39 #queries: 994 #/sec: 32.71
Yes, the load time is faster and the search time is still about the same. As expected.
Chemfp's FPB format
Chemfp natively supports the FPS and FPB formats. I gave an overview of the FPS format in the first essay of the series.
Conceptually the FPB format is similar to this pickle format. It contains the fingerprints sorted by popcount to make good use of the BitBound algorithm, and it stores the ids. It also stores a hash table to make it easy to look up a fingerprint given its id, as well as the FPS header metadata, so the actual file is slightly larger than the pickle file you saw earlier:
% time fpcat chembl_27.fps -o chembl_27.fpb 6.949u 1.403s 0:08.91 93.6% 0+0k 0+0io 676pf+0w % ls -lh chembl_27.fpb -rw-r--r-- 1 dalke staff 534M Oct 8 21:57 chembl_27.fpb
Most importantly, it's arranged so the data can be loaded with almost no additional processing. If the FPB file is uncompressed then the contents are memory-mapped, meaning the chemfp implementation uses the data as-is. This combines well with the BitBound algorithm because for some searches, like if the threshold is high, the search doesn't even need to read all of the data from disk.
In the following caffeine search, the load time is under 10 milliseconds, simply because there aren't enough digits in the output (I reordered the output manually so the times, which are written to stderr, were at the end after the search results written to stdout.)
% simsearch --query 'CN1C=NC2=C1C(=O)N(C(=O)N2C)C' --threshold 0.7 chembl_27.fpb --times #Simsearch/1 #num_bits=2048 #type=Tanimoto k=all threshold=0.7 #software=chemfp/3.4.1 #targets=chembl_27.fpb 2 Query1 CHEMBL113 1.00000 CHEMBL1232048 0.70968 open 0.00 read 0.00 search 0.04 output 0.00 total 0.04
With the full 1000 Wikipedia SMILES you can see the load time is a bit higher, but still quite small:
% simsearch --queries wikipedia_1000.smi --threshold 0.35 chembl_27.fpb --times > /dev/null open 0.03 read 0.26 search 15.55 output 1.19 total 17.02
(When you have that many hits, simply formatting the scores to print them takes significant time - 7% of the total time in this case!)
chemfp advertising
chemfp is available with academic and commercial licenses. Academics can even get a no-cost license key to use all of the features in the pre-compiled package for Linux-based OSes. See chemfp's download page for installation instructions, then go to the license page to request an evaluation license key for your license of choice.
Even faster in-memory search with intersection popcount #
This is part of a series of essays on how to write a similarity search program for the RDKit Morgan fingerprints distributed by ChEMBL.
- Simple FPS fingerprint similarity search: variations on a theme
- Simple k-NN FPS Tanimoto and cosine similarity search
- Simple in-memory ChEMBL similarity search
- Simple BitBound ChEMBL similarity search
- Using RDKit BulkTanimotoSimilarity
- Faster in-memory ChEMBL search by using more C
- Faster BitBound ChEMBL search by using more C
- Even faster in-memory search with intersection popcount
- Cache and reuse popcount-sorted ChEMBL fingerprints
This is part of a series of essays I started to write a week ago where I use a few different approaches to implement cheminformatics fingerprint similarity search.
My initial file scan implementation took 2 seconds per query. I then switched to a two stage in-memory search with a load step and a search step. Loading the data (an initialization cost) took 4.6 seconds and per-query search took about 0.5 seconds, making it more effective if there are three or more queries.
Two days ago I sped up in-memory search by moving more of the search into C. This had even higher initialization overhead, at about 7 seconds, but could process 14 queries per second (70 ms/query). Yesterday I re-implemented last week's BitBound algorithm to use more C code. With the threshold of 0.7 I've been using for benchmarking, the new load time was nearly 9 seconds but search performance increased to almost 60 queries per second.
In this essay I'll improve the search performance by reducing the number of popcount calculations needed during the Tanimoto calculation.
If you want to jump right to the code:
- Download and uncompress chembl_27.fps.gz.
- Download the updated popc.py.
- Run
python popc.py
to (re)generate the _popc module. - Download chembl_bitbound_search_in_C_v2.py.
- Run
python chembl_bitbound_search_in_C_v2.py --times
to do the default caffeine search and see the internal timings. - Run
python chembl_bitbound_search_in_C_v2.py --help
to see the command-line help, then try your own queries, including with different thresholds.
threshold=0.35
BitBound search gives a linear speedup to search. The higher the threshold, the faster the speedup. My benchmarking used a threshold of 0.7 solely so the caffeine search of ChEMBL returns two matches, and not for any deeper reason.
In practice, different fingerprint types have different thresholds where the Tanimoto similarity becomes chemically significant. For the ECFP-like fingerprints (including RDKit's Morgan fingerprint) this can be as low as 0.35 or so. I'll compare my full in-memory search program with my BitBound implementation to see how they compare at a more chemically relevant score:
% python chembl_in_memory_search_in_C.py --times --queries wikipedia_1000.fps --threshold 0.35 | sort > old_sorted.out load: 8.65 search: 69.32 #queries: 994 #/sec: 14.34 % python chembl_bitbound_search_in_C.py --times --queries wikipedia_1000.fps --threshold 0.35 | sort > new_sorted.out load: 9.33 search: 43.21 #queries: 994 #/sec: 23.01 % cmp old_sorted.out new_sorted.out % fgrep -c '*' old_sorted.out 88 % wc -l old_sorted.out 559225 old_sorted.out % python -c "print((559225-88)/994)" 562.5120724346076
(The 88 *
lines correspond to queries with no hits.)
With a threshold of 0.35 there are on average 562 hits per query, and the BitBound algorithm is only 60% faster, not the 4x faster we saw with a threshold of 0.7.
Reducing the number of popcounts
Most of the search time is spent in C, in this part of threshold_tanimoto_search() defined in popc.c:
for (target_idx = 0; target_idx<num_targets; target_idx++) { score = byte_tanimoto_256(query_fp, target_fps + target_idx*256); if (score >= threshold) {
and most of that time is spent in this part of byte_tanimoto_256():
for (int i=0; i<num_words; i++) { intersect_popcount += __builtin_popcountll(fp1_64[i] & fp2_64[i]); union_popcount += __builtin_popcountll(fp1_64[i] | fp2_64[i]); }
I can speed that up by reducing the number of popcount calculations.
And I can do that with one key observation. If we somehow know
that popcount(fp1) = A and popcount(fp2) = B and if we compute C =
popcount(fp1 & fp2), then Tanimoto(fp1, fp2) = C / (A + B -
C). The - C
in the denominator (A + B - C)
prevents the
common bits from being double-counted.
Finally, my BitBound implementation already computes the target fingerprint popcounts (to put them into the right bins, based on the popcount) and the query fingerprint (in order to get the BitBound limits), so I already have A and B. I just need some way to compute C.
Intersection popcount
I added a new function definition in popc.py, byte_intersect_256(). It computes the intersection popcount between two 2048-bit/256-byte fingerprints. It's a simplification of byte_tanimoto_256() so I think I can simply show it without explanation:
static int byte_intersect_256(const unsigned char *fp1, const unsigned char *fp2) { int num_words = 2048 / 64; int intersect_popcount = 0; /* Interpret as 64-bit integers and assume possible mis-alignment is okay. */ uint64_t *fp1_64 = (uint64_t *) fp1, *fp2_64 = (uint64_t *) fp2; for (int i=0; i<num_words; i++) { intersect_popcount += __builtin_popcountll(fp1_64[i] & fp2_64[i]); } return intersect_popcount; }
Similarity search using the intersection popcount
I'll write a new C function, threshold_bin_tanimoto_search(), defined in the updated popc.py, to do the search. If we know the query_popcount and target_popcount then the main search code is:
double total_popcount = (double) (query_popcount + target_popcount); ... /* Do the search */ for (target_idx = 0; target_idx<num_targets; target_idx++) { intersect_popcount = byte_intersect_256(query_fp, target_fps + target_idx*256); score = intersect_popcount / (total_popcount - intersect_popcount); if (score >= threshold) { hit_indices[num_hits] = target_idx; hit_scores[num_hits] = score; num_hits++; } }
The double total_popcount
is to not only pre-compute
the sum (A+B) once, but to change it to a double so the A+B-C
step results in a double, so the division is an int / double,
resulting in the desired double. If I left the sum as an integer then
C says that an integer divided by an integer is an integer.
However, take a second look at that division. If both query_popcount and target_popcount are 0 then score = 0/0. I'll define Tanimoto(0, 0) to be 0, but I still need to handle that case myself instead of letting IEEE 754 math return a NaN or worse.
I don't need to handle that case in the main loop because I can check if the two popcounts are 0. While I'm at it, if either popcounts is 0 then the Tanimoto must also be 0. So depending on the threshold, either all fingerprints will match, or none of them will match.
That test looks like:
/* Handle edge cases if one or both of the fingerprints has no bits set. */ if ((threshold > 0.0) && ((query_popcount == 0) || (target_popcount == 0))) { /* Nothing can match: Tanimoto(0, x) = 0 < threshold */ return 0; } else if ((threshold <= 0.0) && (total_popcount == 0.0)) { /* Everything will match: Tanimoto(0, 0) = 0 >= threshold */ for (target_idx = 0; target_idx<num_targets; target_idx++) { hit_indices[target_idx] = target_idx; hit_scores[target_idx] = 0.0; } return num_targets; } /* Do the search */ ...
Changes at the Python level
Finally, I need to update the Python code to handle the new C function. Here's the relevant code from yesterday's chembl_bitbound_search_in_C.py:
for target_ids, target_fingerprints in target_popcount_bins[min_popcount:max_popcount+1]: # Do the Tanimoto search. num_hits = threshold_tanimoto_search( query_fp, threshold, len(target_ids), target_fingerprints, hit_indices, hit_scores)
The target popcount is implicit in the Python iterator but I need an explicit integer. I decided to iterate over indices, which changed the code slightly.
for target_popcount in range(min_popcount, max_popcount+1): target_ids, target_fingerprints = target_popcount_bins[target_popcount] # Do the Tanimoto search. num_hits = threshold_bin_tanimoto_search( query_fp, query_popcount, threshold, len(target_ids), target_fingerprints, target_popcount, hit_indices, hit_scores)
(A perhaps more Pythonic solution is to store the index as part of the target_popcount_bins, so instead of a the 2-element (list of fingerprints, fingerprint block) it's the 3-element (popcount, list of fingerprints, fingerprint block).)
And that's it. The final program is chembl_bitbound_search_in_C_v2.py. You'll need to download and run the updated popc.py to generate a new _popc module with the new C functions.
Verification and timings
First, a quick check with caffeine as the query:
% python chembl_bitbound_search_in_C.py --times No query specified, using caffeine. caffeine CHEMBL113 1.000 caffeine CHEMBL1232048 0.710 load: 8.81 search: 0.01 #queries: 1 #/sec: 84.53 % python chembl_bitbound_search_in_C_v2.py --times No query specified, using caffeine. caffeine CHEMBL113 1.000 caffeine CHEMBL1232048 0.710 load: 8.89 search: 0.01 #queries: 1 #/sec: 119.54
Looks good so far, and with some speedup. Now with 1,000 Wikipedia SMILES:
% python chembl_bitbound_search_in_C.py --times --queries wikipedia_1000.smi --threshold 0.7 | sort > old_sorted.out load: 11.33 search: 17.93 #queries: 994 #/sec: 55.44 % python chembl_bitbound_search_in_C_v2.py --times --queries wikipedia_1000.smi --threshold 0.7 | sort > new_sorted.out load: 10.80 search: 11.36 #queries: 994 #/sec: 87.52 % cmp old_sorted.out new_sorted.outIdentical results (whew) and a 40% query speedup.
What about at a threshold of 0.35?
% python chembl_bitbound_search_in_C.py --times --queries wikipedia_1000.smi --threshold 0.35 > /dev/null load: 11.87 search: 49.19 #queries: 994 #/sec: 20.21 % python chembl_bitbound_search_in_C_v2.py --times --queries wikipedia_1000.smi --threshold 0.35 > /dev/null load: 10.56 search: 28.26 #queries: 994 #/sec: 35.17
Also about 40% faster.
intersection popcount speeds up full in-memory search too
Incidentally, if you go into chembl_bitbound_search_in_C_v2.py and change the line:
if threshold > 0.0:
to
if 0 and threshold > 0.0:
then that disabled the BitBound limits and the implementation reverts to a full in-memory search, but with performance benefit of computing fewer popcounts during search time. I did that, and the timings for a threshold of 0.35 are:
load: 9.30 search: 39.45 #queries: 994 #/sec: 25.20
The full in-memory search using byte_tanimoto_256() took 49.19 seconds, so switching to byte_intersect_256() gives a 20% faster performance.
What's left?
If you've made it this far, and understood what I've done, then congratulations!
One of my goals in writing this series of similarity search implementations was to get you to the point where you could understand how similarity search implementations work, try it out for yourself, and see that it's really not that hard, and even rather fun.
We know there are ways to speed up the code still further because chemfp's simsearch is about twice as fast:
% simsearch --times --queries wikipedia_1000.smi --threshold 0.35 chembl_27.fps > /dev/null open 5.83 read 0.25 search 15.39 output 0.67 total 22.15
However, the big, easy gains are gone. (Unless you have access to a
machine with an AVX-512 popcount instruction!) What's left is an
improved search method to reduce the number of divisions needed, an
AVX-2 popcount implementation, memory prefetching, and a bunch of
standard optimization techniques. I mentioned them all in my paper
The
chemfp project
.
If you're the type of person who really enjoys this sort of problem - have at it! But if you want just want fast cheminformatics similarity search, why not try out chemfp?
Plus, source code licensing is available, if you really want to dig into the details of how chemfp works.
Faster BitBound ChEMBL search by using more C #
This is part of a series of essays on how to write a similarity search program for the RDKit Morgan fingerprints distributed by ChEMBL.
- Simple FPS fingerprint similarity search: variations on a theme
- Simple k-NN FPS Tanimoto and cosine similarity search
- Simple in-memory ChEMBL similarity search
- Simple BitBound ChEMBL similarity search
- Using RDKit BulkTanimotoSimilarity
- Faster in-memory ChEMBL search by using more C
- Faster BitBound ChEMBL search by using more C
- Even faster in-memory search with intersection popcount
- Cache and reuse popcount-sorted ChEMBL fingerprints
This is part of a series of essays I started to write a week ago where I use a few different approaches to implement cheminformatics fingerprint similarity search.
Last Thursday I developed a program which used the BitBound algorithm to do an in-memory search of the uncompressed chembl_27.fps.gz fingerprint file from ChEMBL, containing RDKit Morgan fingerprints.
Yesterday I reorganized the in-memory data structure and developed a new Python/C extension function to speed up linear search by a factor of 7.
In today's essay, I'll merge the two so the BitBound algorithm uses the new extension function. In tomorrow's essay I'll speed it up further by reducing the number of popcount calculations.
If you want to jump right to the code:
- Download and uncompress chembl_27.fps.gz.
- If you don't already have it, download and run popc.py to generate the _popc library. This is unchanged from yesterday's essay.
- Download chembl_bitbound_search_in_C.py.
- Run
python chembl_bitbound_search_in_C.py --times
to do the default caffeine search and see the internal timings. - Run
python chembl_bitbound_search_in_C.py --help
to see the command-line help, then try your own queries, including with different thresholds.
Merge fingerprints in each bin into a single bytestring
In the BitBound essay from last week I sorted the fingerprints by popcount, resulting in 2049 bins. Each bin contains a list of (id, fingerprint) pairs. Here's the original code to create those bins:
# Create 2049 bins, each with an empty list, ordered by popcount. target_popcount_bins = [[] for i in range(2049)] for id_fp_pair in read_fingerprints(open(args.targets)): # Place each record into the right bin. popcount = byte_popcount_256(id_fp_pair[1]) target_popcount_bins[popcount].append(id_fp_pair)
In yesterday's essay, I reorganized the data to make it easier and faster to access from C. More specifically, I merged the Python list of fingerprint bytestrings into a single bytestring. This required putting the identifiers into one list and the fingerprints into its own list, which is then merged.
I'll do the same for the (id, fingerprint) pairs in the target popcount bins. One thing to watch out for is empty bins, since the zip(*values) technique returns a single empty list, rather than the two empty lists I need. But that's a simple check.
The new conversion code also needs to keep track of the largest number of fingerprints in any of the bins. This is to prevent possible array overflows in the C extension. (In yesterday's program that needed to be large enough to handle all of the target fingerprints, so binning ends up reducing the amount of space needed.)
The following code uses the same loading step from last week, then adds code (in bold) to transform the target_popcount_bins so each bin has a list of ids and the merged fingerprint bytes.
# Create 2049 bins, each with an empty list, ordered by popcount. target_popcount_bins = [[] for i in range(2049)] for id_fp_pair in read_fingerprints(open(args.targets)): # Place each record into the right bin. popcount = byte_popcount_256(id_fp_pair[1]) target_popcount_bins[popcount].append(id_fp_pair) # Split the ids and fingerprints into parallel lists then merge the fingerprints max_bin_size = 0 for popcount, id_fp_pairs in enumerate(target_popcount_bins): if id_fp_pairs: target_ids, target_fingerprints = list(zip(*id_fp_pairs)) target_popcount_bins[popcount] = (target_ids, b"".join(target_fingerprints)) max_bin_size = max(max_bin_size, len(target_ids)) else: target_popcount_bins[popcount] = [[], b""]
Searching the bins
Last week's BitBound implementation figured out the number of bits in the query fingerprint, determined the min/max BitBound popcounts, then searched each fingerprint of each bin for a sufficiently similar fingerprint. Here's the key part of implementation, which I'll use as a starting point:
has_match = False # Determine the BitBound popcount limits query_popcount = byte_popcount_256(query_fp) if threshold > 0.0: min_popcount = math.ceil(query_popcount * fraction_threshold) max_popcount = math.floor(query_popcount / fraction_threshold) if max_popcount > 2048: max_popcount = 2048 # For each popcount in bound, check each of the targets. for targets in target_popcount_bins[min_popcount:max_popcount+1]: for target_id, target_fp in targets: score = similarity_function(query_fp, target_fp) # If it's similar enough, print details. if score >= threshold: has_match = True print(f"{query_id}\t{target_id}\t{score:.3f}") # Print something in case there were no matches. if not has_match: print(f"{query_id}\t*\t*")
In the new algorithm, the BitBounds are the same so that part of the search hasn't changed:
has_match = False # Determine the BitBound popcount limits query_popcount = byte_popcount_256(query_fp) if threshold > 0.0: min_popcount = math.ceil(query_popcount * fraction_threshold) max_popcount = math.floor(query_popcount / fraction_threshold) if max_popcount > 2048: max_popcount = 2048
The search code is somewhat differnet. Each bit has different contents, and I call threshold_tanimoto_search() for the performance critical code. That returns the number of sufficiently similar fingerprints, so I use the same code as yesterday to print the match information:
# For each popcount in bound, check each of the targets. for target_ids, target_fingerprints in target_popcount_bins[min_popcount:max_popcount+1]: # Do the Tanimoto search. num_hits = threshold_tanimoto_search( query_fp, threshold, len(target_ids), target_fingerprints, hit_indices, hit_scores) if num_hits: # Print the hits to stdout, one per line. has_match = True for i in range(num_hits): hit_idx = hit_indices[i] score = hit_scores[i] target_id = target_ids[hit_idx] print(f"{query_id}\t{target_id}\t{score:.3f}")
Lastly, if none of the bins have any matches, print a no match
output line (just like the original code):
# Print something in case there were no matches. if not has_match: print(f"{query_id}\t*\t*")
Validation and Timings
I'll do my now-usual timing of nearly 1,000 structures from the Wikipedia Chemical Structure Explorer, and compare if the linear and BitBound in-memory searches give the same answers. I'll run the two programs like this:
% python chembl_in_memory_search_in_C.py --times --queries wikipedia_1000.smi | sort > old_sorted.out load: 7.22 search: 65.06 #queries: 994 #/sec: 15.28 % python chembl_bitbound_search_in_C.py --times --queries wikipedia_1000.smi | sort > new_sorted.out load: 8.93 search: 16.58 #queries: 994 #/sec: 59.95 % cmp old_sorted.out new_sorted.out
You can see both programs gave the same output (once sorted, because the BitBound search changes the search order), and the new program is nearly 4x faster. In fact, the performance is roughly 40% that of chemfp's simsearch, so it's doing pretty well!
chemfp advertisement
There are still ways to make the BitBound implementation faster - I've implemented them in chemfp, which is available right now for commercial and no-cost academic licensing.
Now, a 2x speedup may not sound like much, but bear in mind that part
of the reason this simple
search code is so fast is because
it's specialized for the 2048-bit RDKit Morgan2 fingerprints from
ChEMBL. If you want to support other fingerprint lengths, you might
start with a more general purpose Tanimoto calculation - only to find
the general purpose one is also slower. If you want to support other
fingerprint parameters, you'll need to write the code to handle
those. K-nearest search using BitBound is not so simple, and neither
is Tversky search. If you want to support Open Babel, or OEChem, then
you'll need the code for that. If you want to improve fingerprint load
time, then you'll need to implement that code. And if you want APIs to
work with fingerprints and collections of fingerprints, you'll need to
write that code too.
Or, you can try out chemfp. To install chemfp under the Chemfp Base License Agreement, use the following:
python -m pip install chemfp -i https://chemp.com/packages/
then send me an email to request an evaluation license key.
Faster in-memory ChEMBL search by using more C #
This is part of a series of essays on how to write a similarity search program for the RDKit Morgan fingerprints distributed by ChEMBL.
- Simple FPS fingerprint similarity search: variations on a theme
- Simple k-NN FPS Tanimoto and cosine similarity search
- Simple in-memory ChEMBL similarity search
- Simple BitBound ChEMBL similarity search
- Using RDKit BulkTanimotoSimilarity
- Faster in-memory ChEMBL search by using more C
- Faster BitBound ChEMBL search by using more C
- Even faster in-memory search with intersection popcount
- Cache and reuse popcount-sorted ChEMBL fingerprints
This is part of a series of essays I started writing a week ago where I use a few different approaches to implement cheminformatics fingerprint similarity search.
In my previous essay, from last Friday, I developed a program to do a similarity search of the uncompressed chembl_27.fps.gz from ChEMBL containing RDKit Morgan fingerprints. That version used RDKit's BulkTanimotoSimilarity for the Tanimoto calculation. Profiling showed that 80% of the time was spent in the two lines of Python searching for scores at or above a given threshold. I concluded that I had to push more work into C.
That's what I'll do in today's essay, starting with the chembl_in_memory_search.py program from the initial in-memory ChEMBL search I developed earlier last week.
If you want to jump right to the code:
- Download and uncompress chembl_27.fps.gz.
- Download the updated popc.py.
- Run
python popc.py
to (re)generate the _popc module. - Download chembl_in_memory_search_in_C.py.
- Run
python chembl_in_memory_search_in_C.py --times
to do the default caffeine search and see the internal timings. - Run
python chembl_in_memory_search_in_C.py --help
to see the command-line help, then try your own queries, including with different thresholds.
Merge target fingerprints into a single byte string
Data representation and performance go hand-in-hand. The original
in-memory search code (chembl_in_memory_search.py
)
stored each fingerprint as a Python byte string, called a function
written in a C extension to compute the Tanimoto similarity between
the query and target fingerprints, and got the score back. Last
Friday's used RDKit's BulkTanimotoSimilarity() to process the Python
list at the C++ level, but still required processing a Python list.
The approach I'll use today is to merge all of the target fingerprints into a single byte string. Each 2048-bit fingerprint is 256 bytes long, which makes it very easy to get a fingerprint by index. Storing the fingerprints in one string also saves memory, because each Python byte string has about 33 bytes of overhead, which is used to store information that Python needs internally:
>>> import sys >>> sys.getsizeof(b"") 33 >>> sys.getsizeof(b"a") 34
That's an overhead of about 12% for each 256-byte fingerprint. If this is instead stored in a single byte string then we save over 60 MB across the 1.9 million fingerprints in ChEMBL.
Load the target fingerprints into a byte string, and get the ids
The existing fingerprint reader iterates over the (id, fingerprint
byte string) pairs. I'll separate that into a list of ids and a
parallel list of byte strings using the same
zip(*values)
technique I described
on Friday, then merge the list of byte strings into a single byte
string (the new part is in bold):
# Load targets as (id, fingerprint) pairs into a list of # target ids and a block of target fingerprints t1 = time.time() target_ids, target_fingerprints = list(zip(*read_fingerprints(open(args.targets)))) num_targets = len(target_fingerprints) target_fingerprints = b"".join(target_fingerprints) # Merge into a single byte string t2 = time.time() load_time = t2-t1
Put results in a hit list
In the original chembl_in_memory_search.py
program I printed
the fingerprints as soon as I found one with a Tanimoto score at or
above the threshold. I could let the C code print the results to
stdout, but I'm going to keep that part of the code in Python. That
means I need some way to pass the hit information from C back to
Python.
In this series of ChEMBL fingerprint search implementations, I'm trying to keep things simple. The Tanimoto calculation, for example, is hard-coded for 2048-bit fingerprints, and the implementation isn't concerned about hardware issues like word alignment.
I'll keep things simple with the hit list as well. If there are N target fingerprints then hit list cannot be more than N. I can allocate space for N hits, have the C function fill in the values it finds, and return the number of hits found.
At the C level, the hitlist will be two parallel arrays:
int *hit_indices; /* the indices of a matching target fingerprint */ double *hit_scores; /* the corresponding score (must be >= the search threshold) */
I need some way to allocate these arrays in a way that C code can
access them. I've been using the cffi package so the
program popc.py
can convert C functions into a Python module
named _popc
. The cffi-generated package includes functions to
help generate
those arrays from Python:
from _popc import ffi num_targets = ... the number of fingerprints... hit_indices = ffi.new(f"int[]", num_targets) hit_scores = ffi.new(f"double[]", num_targets)
threshold_tanimoto_search()
I then modified popc.py
to include the function threshold_tanimoto_search()
. It re-uses
the byte_tanimoto_256()
function described
last week. I think the C implementation is a nearly direct mapping
of the Python implementation, except that I store the hits in the two
arrays rather than print them out. I also think it's understandable
enough that I won't provide extra commentary.
/* Find all fingerprints in a sequential block of 2048-bit fingerprints */ /* with a Tanimoto similarity of at least 'threshold' to the query fingerprint. */ /* Return the number of hits. Hit results are in *hit_indices and *hit_scores. */ /* Assumes there is enough space in hit_indices and hit_scores! */ int threshold_tanimoto_search(unsigned char *query_fp, double threshold, int num_targets, unsigned char *target_fps, int *hit_indices, double *hit_scores) { int target_idx, num_hits = 0; double score; for (target_idx = 0; target_idx<num_targets; target_idx++) { score = byte_tanimoto_256(query_fp, target_fps + target_idx*256); if (score >= threshold) { hit_indices[num_hits] = target_idx; hit_scores[num_hits] = score; num_hits++; } } return num_hits; }
(The final popc.py includes the configuration needed to have cffi include the extension code so the new C function can be called from Python. See the program for details.)
I've updated popc.py, which means I need to re-run it in order to
generate the updated _popc
extension.
% python popc.py generating ./_popc.c the current directory is '/Users/dalke/demo' running build_ext building '_popc' extension gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -I/Users/dalke/venvs/py38-2020-8/include -I/Users/dalke/local-3.8/include/python3.8 -c _popc.c -o ./_popc.o -mpopcnt gcc -bundle -undefined dynamic_lookup ./_popc.o -o ./_popc.cpython-38-darwin.so
Updated ChEMBL similarity search
The function call at the Python level looks like this:
num_hits = threshold_tanimoto_search( query_fp, threshold, num_targets, target_fingerprints, hit_indices, hit_scores)
The parameters are:
- query_fp - the query fingerprint as a 256-byte bytestring;
- threshold - the minimum similarity threshold as a Python float betweeen 0.0 and 1.0;
- num_targets - the number of target fingerprints;
- target_fingerprints - the
num_targets
target fingerprints, each 256-bytes long, concatenated into a single byte string; - hit_indices - an int* array which will contain the indices of the fingerprints which are sufficiently similar to the query fingerprint;
- hit_score - a double* array containing the corresponding score;
and the function returns an integer, which is the number of hits found.
That pushes the entire search into C, so what's needed is a driver in Python to do a search for each query and report the hits:
# Do a threshold search for each query. t1 = time.time() num_queries = 0 for query_id, query_fp in query_reader: num_queries += 1 # Do the Tanimoto search. num_hits = threshold_tanimoto_search( query_fp, threshold, num_targets, target_fingerprints, hit_indices, hit_scores) if num_hits: # Print the hits to stdout, one per line. for i in range(num_hits): hit_idx = hit_indices[i] score = hit_scores[i] target_id = target_ids[hit_idx] print(f"{query_id}\t{target_id}\t{score:.3f}") else: # Print something in case there were no matches. print(f"{query_id}\t*\t*")
Does it work?
The final version is available as chembl_in_memory_search_in_C.py.
The C-based similarity search should give identical results to the original Python-based similarity search. Does it? As queries I'll use the first 1,000 structures from the SMILES data I downloaded from the Wikipedia Chemical Structure Explorer. (An earlier essay describes how I converted that data into a SMILES file.) The following omits some error and warning messages.)
% python chembl_in_memory_search.py --queries wikipedia_1000.smi --times > old_1000.out load: 5.79 search: 530.20 #queries: 994 #/sec: 1.87 % python chembl_in_memory_search_in_C.py--queries wikipedia_1000.smi --times > new_1000.out load: 7.19 search: 70.40 #queries: 994 #/sec: 14.12 % cmp old_1000.out new_1000.out
This means the two output files are byte-for-byte identical.
Now, compare the two timings. The C implementation is over 7x faster! (Take the numbers with a grain of salt; I ran the program on my laptop, with a video playing in the background when I ran the Python-based implementation.)
Not bad for under 20 lines of C code.
chemfp advertising
As a reminder, one of my goals in this series is to show that it's easy to implement fingerprint similarity search, but hard to implement high-performance similarity search. I hope these details will help convince people to purchase a chemfp license.
To give something concrete, the above C implementation is much faster than the Python version, but chemfp's simsearch is another 10x or so faster still, using the same input files:
% simsearch --queries wikipedia_1000.smi chembl_27.fps --threshold 0.7 --times > chemfp_1000.out open 5.86 read 0.26 search 6.87 output 0.02 total 13.01
The output isn't in the same format so can't easily be compared. A (long) one-liner to transform the output is:
grep -v '^#' chemfp_1000.out | \ awk 'BEGIN{FS="\t"} \ $1==0 {printf("%s\t*\t*\n", $2)} \ $1>0 {for(i=3; i<NF; i+=2) {printf("%s\t%s\t%.3f\n", $2, $i, $(i+1))}}' | \ sort > chembl_1000_sorted.out
A comparison shows 4 mismatches where the last digit was off by 1, all because of the extra rounding step in the one-liner needed to switch from simsearch's 5-digit output scores to the 3 decimal places I've been using in the simple programs.
More information about chemfp can be found from the chemfp home page.
Using RDKit BulkTanimotoSimilarity #
This is part of a series of essays on how to write a similarity search program for the RDKit Morgan fingerprints distributed by ChEMBL.
- Simple FPS fingerprint similarity search: variations on a theme
- Simple k-NN FPS Tanimoto and cosine similarity search
- Simple in-memory ChEMBL similarity search
- Simple BitBound ChEMBL similarity search
- Using RDKit BulkTanimotoSimilarity
- Faster in-memory ChEMBL search by using more C
- Faster BitBound ChEMBL search by using more C
- Even faster in-memory search with intersection popcount
- Cache and reuse popcount-sorted ChEMBL fingerprints
In the first essay of this series I developed a program to do a Tanimoto similarity search of chembl_27.fps. I evaluated three different data types: an RDKit ExplicitBitVect, a GMP mpz integer, and a byte string. I found that the byte string was fastest, at about 5 seconds per search. All three programs worked by parsing and testing each FPS record in sequence. A scan search like this requires essentially constant memory overhead no matter the size of the FPS file.
In the third essay of this series I developed an in-memory Tanimoto search program with two stages. The first stage read all of the fingerprints into memory, and the second did the linear Tanimoto calculation and tests of all of the fingerprints. This program took 4-5 seconds to load the fingerprints, and about 0.5 seconds to search. (Yesterday's essay added the BitBound algorithm, which prunes some of the search space and is increasingly effective as the similarity threshold increases.)
In this essay I'll revisit the choice of data type and try an in-memory search using RDKit's ExplicitBitVect() and BulkTanimotoSimilarity(). I find that the load time is about 32 seconds and the search time is about 0.6 seconds per query.
If you really want high-performance cheminformatics fingerprint similarity search, try chemfp.
Loading ExplicitBitVect fingerprints
On Wednesday I ended up with chembl_in_memory_search.py. It is not hard to change the code so it uses an ExplicitBitVect instead of a byte string for the fingerprints. The result is at chembl_bulk_tanimoto.py.
The two differences are:
- get_query_fp() returns the ExplicitBitVect() directly from GetMorganFingerprintAsBitVect() rather than convert it to a byte string.
- read_fingerprints() uses DataStructs.CreateFromFPSText() to parse the hex-encoded fingerprints into an ExplicitBitVect(), instead of using bytes.fromhex() to create a byte string.
Load the data sets
I'm going to use BulkTanimotoSimilarity() to compute the similarity scores for each fingerprint. That takes an RDKit fingerprint as the query and a list (or tuple) of RDKit fingerprints as the targets. However, the reader returns iterator of (id, fingerprint) pairs. I'll need to convert that into two parallel lists; one with ids and one with fingerprints.
The obvious way to do that is something like:
target_ids = [] target_fingerprints = [] for target_id, target_fingerprints in fingerprint_reader: target_ids.append(target_id) target_fingerprints.append(target_fingerprint)
However, there's an easy but non-obvious way to do that, using zip(*values).
zip(*values)
The zip() built-in function lets you iterate over zero or more lists in parallel. For example, if you have a list of identifiers and an associated list of SMILES then you might print them out together like this:
>>> id_list = ["CHEBI:18153", "CHEBI:18407", "CHEBI:27518", "CHEBI:33602"] >>> smiles_list = ["C=C", "C#N", "C#C", "BBB"] >>> for id, smiles in zip(id_list, smiles_list): ... print(f"{id} is {smiles}") ... CHEBI:18153 is C=C CHEBI:18407 is C#N CHEBI:27518 is C#C CHEBI:33602 is BBB
The construct f(*args) is special syntax to unpack an argument list. It uses args[0] as the first argument, args[1] as the second, and so on. For example, suppose data contains a list of elements and I want to print the list out, with one element per line:
>>> data = [('CHEBI:18153', 'C=C'), ('CHEBI:18407', 'C#N'), ('CHEBI:27518', 'C#C'), ('CHEBI:33602', 'BBB')] >>> print(*data, sep="\n") ('CHEBI:18153', 'C=C') ('CHEBI:18407', 'C#N') ('CHEBI:27518', 'C#C') ('CHEBI:33602', 'BBB')
The print(*data, sep="\n")
call is identical to print(data[0],
data[1], data[2], data[3], sep="\n")
, which prints each element,
with a newline between each one (and the default end="\n" printing the
final newline.)
The zip() and the f(*args) work together to give the following:
>>> data = [('CHEBI:18153', 'C=C'), ('CHEBI:18407', 'C#N'), ('CHEBI:27518', 'C#C'), ('CHEBI:33602', 'BBB')] >>> id_list, smiles_list = list(zip(*data)) >>> id_list ('CHEBI:18153', 'CHEBI:18407', 'CHEBI:27518', 'CHEBI:33602') >>> smiles_list ('C=C', 'C#N', 'C#C', 'BBB')
The step-by-step transformations are:
- list(zip(*data)) is the same as:
- list(zip(data[0], data[1], data[2], data[3])) is the same as:
- list(zip(('CHEBI:18153', 'C=C'), ('CHEBI:18407', 'C#N'), ('CHEBI:27518', 'C#C'), ('CHEBI:33602', 'BBB'))) is the same as:
- [('CHEBI:18153', 'CHEBI:18407', 'CHEBI:27518', 'CHEBI:33602'), ('C=C', 'C#N', 'C#C', 'BBB')] (because zip yields all of first elements, then all of the second elements, and there are only two elements in each parameter).
As a result, id_list contains all of the first terms in the data elements, and smiles_list contains all of the second terms in the data elements.
For some people it may help to think of zip(*args) as working together to generate the transpose of args.
Actually loading the data sets
All that discussion was there to show how the old code, which read the ids and fingerprints into a list of (id, fingerprint) pairs using the following:
targets = list(read_fingerprints(open(args.targets)))
was transformed into reading a list of ids and a list of fingerprints by using the following:
target_ids, target_fingerprints = list(zip(*read_fingerprints(open(args.targets))))
Easy to write. Not obvious to understand the first few times you come across that idiom!
BulkTanimotoSimilarity()
RDKit's BulkTanimotoSimilarity() takes a query fingerprint and a list of target fingerprints, and returns a list of scores, one for each target fingerprint. It's straight-forward to adapt the original in-memory search program to use this bulk function:
# Compute the score with each of the targets. scores = DataStructs.BulkTanimotoSimilarity(query_fp, target_fingerprints) # Find the records with a high enough similarity. for i, score in enumerate(scores): if score >= threshold: # Need to get the corresponding id target_id = target_ids[i] has_match = True print(f"{query_id}\t{target_id}\t{score:.3f}")
Since this function computes all of the Tanimoto scores, you could easily modify this code to sort it or use a priority queue and find the nearest k neighbors.
Also, RDKit implements many Bulk*
functions
(BulkAllBitSimilarity, BulkAsymmetricSimilarity,
BulkBraunBlanquetSimilarity, BulkCosineSimilarity, BulkDiceSimilarity,
BulkKulczynskiSimilarity, BulkMcConnaugheySimilarity,
BulkOnBitSimilarity, BulkRogotGoldbergSimilarity,
BulkRusselSimilarity, BulkSokalSimilarity, BulkTanimotoSimilarity,
BulkTverskySimilarity) with the same function call API, so it's easy
to modify the code to use one of these alterate comparison methods.
Timings
The result is chembl_bulk_tanimoto.py. I kept all of the timing code I developed for Wednesday's chembl_in_memory_search.py, which I'll enable so I can compare the two:
% python chembl_in_memory_search.py --queries wikipedia_100.smi --times > old.out load: 5.27 search: 52.63 #queries: 100 #/sec: 1.90 % python chembl_bulk_tanimoto.py --queries wikipedia_100.smi --times > new.out load: 32.92 search: 60.22 #queries: 100 #/sec: 1.66 % cmp old.out new.out
It's always a relief to see that a cross-comparison gives the same results!
I'm not surprised the RDKit search time is a bit slower than the raw byte string search code beause the RDKit implementation does more work, for examples, the BulkTanimotoSimilarity() function accepts ExplicitBitVect() and SparseBitVect() fingerprint types (the latter is for sparse fingerprints) so there's some overhead figuring out the right way to do the computation, and it double-checks that the fingerprints are of the same length. While the programs I wrote will only work for 2048-bit fingerprints and may cause Python itself to crash if you don't use it correctly.
Are there obvious ways to speed things up?
The same BitBound algorithm in yesterday's essay can be applied to improve the BulkTanimotoSimilarity() performance. However, most of the time is not spent in computing fingerprints! I used kernprof with the first 100 Wikipedia SMILES to profile the program and found almost 70% of the time is spent iterating through the scores:
.... lines omitted ... Line # Hits Time Per Hit % Time Line Contents ============================================================== .... lines omitted ... 115 1 32567959.0 32567959.0 9.7 target_ids, target_fingerprints = list(zip(*read_fingerprints(open(args.targets)))) .... lines omitted ... 129 # Compute the score with each of the targets. 130 100 40625764.0 406257.6 12.1 scores = DataStructs.BulkTanimotoSimilarity(query_fp, target_fingerprints) 131 # Find the records with a high enough similarity. 132 194141100 133618300.0 0.7 39.9 for i, score in enumerate(scores): 133 194141000 127750939.0 0.7 38.2 if score >= threshold: .... lines omitted ...
The only way to make this significantly faster is push more work into C.
If you are interested in improving the performance of the parts of RDKit I covered here, I suggest you look into speeding up DataStructs.CreateFromFPSText(), which seems to take rather longer than I think it should.Simple BitBound ChEMBL similarity search #
This is part of a series of essays on how to write a similarity search program for the RDKit Morgan fingerprints distributed by ChEMBL.
- Simple FPS fingerprint similarity search: variations on a theme
- Simple k-NN FPS Tanimoto and cosine similarity search
- Simple in-memory ChEMBL similarity search
- Simple BitBound ChEMBL similarity search
- Using RDKit BulkTanimotoSimilarity
- Faster in-memory ChEMBL search by using more C
- Faster BitBound ChEMBL search by using more C
- Even faster in-memory search with intersection popcount
- Cache and reuse popcount-sorted ChEMBL fingerprints
In yesterday's essay I changed the scan-based Tanimoto search to an in-memory search and showed that after a start-up cost of about 5 seconds I was able to do about 2 searches per second of the 1.9 million ChEMBL fingerprints.
I ended by pointing out how chemfp was over 100x faster.
How is chemfp so much faster? One clear reason is the search code is all implemented in C. A comparison test like "score >= threshold" can be done in one CPU cycle, but the equivalent in Python takes likely thousands of CPU cycles. Another is Python's function call overhead - I suspect it takes longer to set up the call to byte_tanimoto_256() and convert the result to Python than it does to compute the actual Tanimoto.
The chemfp paper covers all of the techniques I used, and reviews the other ones I know about. In this essay I will implement the BitBound algorithm, which is a simple technique that can often reduce the number of fingerprints to test.
If you want to skip all the discussion, here are the instructions to use the final code, which assumes you have a gcc or clang C compiler and that RDKit's Python bindings are installed:
- Download chembl_27.fps.gz.
- Download the updated popc.py.
- Run
python popc.py
to generate the _popc module. - Download chembl_bitbound_search.py.
- Run
python chembl_bitbound_search.py --times
to do the default caffeine search and see the internal timings. - Run
python chembl_bitbound_search.py --help
to see the command-line help, then try your own queries, including with different thresholds.
The new program averages about 0.3 seconds per query with the default threshold of 0.7, and with an initial load time of about 8 seconds.
Update: A week later I implemented faster versions, first by reorganizing the data so I can do more work in C instead of Python, and second by replacing full byte_tanimoto_256() calculation with a pre-computed query and target fingerprints and a byte_intersect_256() calculation.
BitBound algorithm
Swamidass and Baldi pointed out that if a query fingerprint A has a bits set, and the goal is to find target fingerprints Bi which have a Tanimoto score at least T similar to the query, then we can set bounds on the number of bits b in the target fingerprints. More specifically: a T ≤ b ≤ a / T.
(They generalize this calculation to Tversky similarity, and also describe an search method to improve a nearest neighbor search.)
What this means is, if the number of bits in the target fingerprints can be pre-computed then those fingerprints which are out-of-bounds for a given query don't need to be considered.
There are few ways to implement this. One is to add the pre-computed popcount to each record. However, this still means testing every record. Another is to sort the records into different bins, one for each popcount. Then the search need only test the bins which might have a similarity match. I'll implement this seecond approach.
Compute the fingerprint popcount
I don't have a way to compute the popcount of each of the target
fingerprints. I decided
to use byte strings for them and used cffi to write popc.py,
a program to generates the _popc
module with the
Tanimoto similarity and cosine similarity functions, hard-coded for
2048-bit fingerprints. I'll extend that module so it implements
byte_popcount_256()
, to compute the popcount of a
256-byte byte string.
The implementation is a straight-forward simplification of the similarity functions, so I'll present it here without further description (see popc.py for the full implementation):
static int byte_popcount_256(const unsigned char *fp) { int num_words = 2048 / 64; int popcnt = 0; /* Interpret as 64-bit integers and assume possible mis-alignment is okay. */ uint64_t *fp_64 = (uint64_t *) fp; for (int i=0; i<num_words; i++) { popcnt += __builtin_popcountll(fp_64[i]); } return popcnt; }
Pre-compute the target fingerprints and put into bins
Just like in yesterday's essay, I'll first read the fingerprints into memory and then search them. However, this time I'll organize the fingerprints by popcount. There are 2049 possible values (from 0 bits set to all 2048 bits set), so I'll create a list of 2049 empty lists. For each fingerprint record, I'll compute the fingerprint popcount and append it to the appropriate list. Here's what that code looks like:
# Load targets into a popcount bins, ordered by the fingerprint popcount. # Each bin contains a list of (id, fingerprint) pairs. t1 = time.time() # Create 2049 bins, each with an empty list, ordered by popcount. target_popcount_bins = [[] for i in range(2049)] for id_fp_pair in read_fingerprints(open(args.targets)): # Place each record into the right bin. popcount = byte_popcount_256(id_fp_pair[1]) target_popcount_bins[popcount].append(id_fp_pair) t2 = time.time() load_time = t2-t1
Determine the bounds
It seems easy to compute the bit bounds with something like:
query_popcount = byte_popcount_256(query_fp) min_popcount = query_popcount * threshold max_popcount = query_popcount / threshold
and then use the integer popcount values between min_popcount and max_popcount. The lowest popcount value is the integer greater than or equal to min_popcount and the highest popcount value is the integer no greater than max_popcount. You might think this would work:
# This is wrong! import math query_popcount = byte_popcount_256(query_fp) min_popcount = math.ceil(query_popcount * threshold) max_popcount = math.floor(query_popcount / threshold) for popcount in range(min_popcount, max_popcount+1): ...
In practice, this is tricky because the calculations will use IEEE 754 doubles. Some operations which should ideally result in an integer value ... don't. Here is an example where the min_popcount would be wrong:
>>> threshold = 0.55 >>> query_popcount = 1580 >>> threshold * query_popcount 869.0000000000001 >>> import math >>> math.ceil(threshold * query_popcount) 870
and here's an example where the max_popcount would be wrong:
>>> threshold = 0.55 >>> query_popcount = 396 >>> query_popcount / threshold 719.9999999999999 >>> math.floor(query_popcount / threshold) 719
Solution #1: change by epsilon
One solution, which will work for normal threshold values, is to add or subtract a small value to ratio before doing the floor or ceiling call.
>>> import math >>> EPSILON = 1e-10 >>> threshold = 0.55 >>> threshold * 1580 869.0000000000001 >>> math.ceil(threshold * 1580 - EPSILON) 869 >>> 396 / threshold 719.9999999999999 >>> math.floor(396 / threshold) 719 >>> math.floor(396 / threshold + EPSILON) 720
This works because 1) it's chemically unreasonable to specify a threshold with more than 3 significant digits, and 2) the differences between fingerprint scores cannot be more than 1/(20482).
Solution #2: use rationals
I think a better solution is to do the calculation using rationals, so the operations can be done with integers. (This is especially important if you want to handle the Tversky BitBounds; in chemfp I gave up trying to get IEEE 754 math to work correctly and ended up multiplying alpha and beta values by 10,000 to work with integers.)
Since I'm using Python, I can use Python's built-in fractions module, which implements a rational data type:
>>> import math, fractions >>> threshold = fractions.Fraction("0.55") >>> math.ceil(threshold * 1580) 869 >>> math.floor(396 / threshold) 720
(I actually used the fractions module to find the examples where the simple application of IEEE 7554 doubles cause problems.)
Only a few minor changes are needed to use a fraction for the threshold, starting with the argparse --threshold parameter:
parser.add_argument("--threshold", "-t", type=fractions.Fraction, default=fractions.Fraction("0.7"), help="minimum threshold (default: 0.7)")For performance reasons, I keep track of the fraction threshold (only used to find the popcount bonds) and the floating point value (used when checking the Tanimoto score):
fraction_threshold = args.threshold threshold = float(fraction_threshold)
The popcount bounds calculation, when using a fraction threshold, is the straight-forward, expected one:
min_popcount = 0 max_popcount = 2048 for query_id, query_fp in query_reader: ... query_popcount = byte_popcount_256(query_fp) if threshold > 0.0: min_popcount = math.ceil(query_popcount * fraction_threshold) max_popcount = math.floor(query_popcount / fraction_threshold) if max_popcount > 2048: max_popcount = 2048
The check for threshold > 0.0 is to prevent divide-by-zero errors. Of course, if the threshold is 0.0 then everything will match, and there will be a lot of overhead just printing the results.
Search the appropriate bins
Now that I have the bounds I can get the bins that are in range, and for each bin check each of the target fingerprints:
for targets in target_popcount_bins[min_popcount:max_popcount+1]: for target_id, target_fp in targets: score = similarity_function(query_fp, target_fp) # If it's similar enough, print details. if score >= threshold: has_match = True print(f"{query_id}\t{target_id}\t{score:.3f}")
Does it work?
The final code is in chembl_bitbound_search.py. But does it work?
I used chembl_in_memory_search.py from yesterday's essay to process the first 1,000 entries from the SMILES downloaded from the Wikipedia Chemical Structure Explorer, then processed the same entries with chembl_bitbound_search.py, then compared the two. The two programs have a different search order so the outputs cannot be compared directly. Instead, I compared the sorted outputs to each other (and excluded the warning and error messages in the following):
% python chembl_in_memory_search.py --times --queries wikipedia_1000.smi | sort > old.txt load: 5.27 search: 491.73 #queries: 994 #/sec: 2.02 % python chembl_bitbound_search.py --times --queries wikipedia_1000.smi | sort > new.txt load: 8.10 search: 297.42 #queries: 994 #/sec: 3.34 % cmp old.txt new.txt
They matched, and it was almost 40% faster. So that's a good first test. I'll also run with a higher threshold to see if the overall performance changes:
% python chembl_bitbound_search.py --times --queries wikipedia_1000.smi --threshold 0.9 & /dev/null load: 5.53 search: 81.74 #queries: 994 #/sec: 12.16
Nice speedup!
A more sensitive test would be to construct fingerprints with bit patterns that trigger edge cases in the code, and to add instrumentation to ensure the BitBound calculation is not accidentally too wide. That's too much for this essay.
Can improve the Tanimoto calculation performance
If the target fingerprints are sorted by popcount then we know that all fingerprint in a given bin have the same count. We also know the number of bits in the query fingerprint. This can be used to improve the search performance by reducing the amount of work to compute the Tanimoto.
The heart of the current code, in byte_tanimoto_256(), is:
for (int i=0; i<num_words; i++) { intersect_popcount += __builtin_popcountll(fp1_64[i] & fp2_64[i]); union_popcount += __builtin_popcountll(fp1_64[i] | fp2_64[i]); } return ((double) intersect_popcount) / union_popcount;
If the query and target popcounts are known then this reduces to:
for (int i=0; i<num_words; i++) { intersect_popcount += __builtin_popcountll(fp1_64[i] & fp2_64[i]); } return ((double) intersect_popcount) / (query_popcount + target_popcount - intersect_popcount);
I'll leave that as an exercise for the student.
Can skip the Tanimoto calculation performance
Another possible speedup is to avoid the division in the Tanimoto calculation altogether. If you work out the math you'll find that the target fingerprint will pass the test if the intersection popcount C is:
C ≥ threshold * (query_popcount + target_popcount) / (1 + threshold)
This calculation is another place where it's easier to calculate using rationals than with IEEE 754 doubles.
chemfp advertisement
I've been trying to make the point that while it's easy to implement a similarity search program, there are a lot of complicated details to make a fast similarity search program.
You could go ahead and spend the days or weeks to implement these features yourself, or you could install and test chemfp chemfp (under the Base License Agreement) by doing:
python -m pip install chemfp -i https://chemfp.com/packages/
The Base License Agreement does not allow you to create FPB files or do an in-memory search with more than 50,000 fingerprints. If you are interested in those features, take a look at the other licensing options then contact me to ask for an evaluation license key.
Simple in-memory ChEMBL similarity search #
This is part of a series of essays on how to write a similarity search program for the RDKit Morgan fingerprints distributed by ChEMBL.
- Simple FPS fingerprint similarity search: variations on a theme
- Simple k-NN FPS Tanimoto and cosine similarity search
- Simple in-memory ChEMBL similarity search
- Simple BitBound ChEMBL similarity search
- Using RDKit BulkTanimotoSimilarity
- Faster in-memory ChEMBL search by using more C
- Faster BitBound ChEMBL search by using more C
- Even faster in-memory search with intersection popcount
- Cache and reuse popcount-sorted ChEMBL fingerprints
In the previous two essays I showed how to search chembl_27.fps to find records with similar fingerprints to a query fingerprint, then how to implement a nearest-neighbor search and replace Tanimoto similarity with cosine similarity. The final program took about 5 seconds.
In this essay I'll show how to increase the search performance by shifting more of the work to a load step. This sort of precomputation can be useful if the load step is done once, with the extra overhead shared over multiple searches.
If you want to skip all the discussion, here are the instructions to use the final code, which assumes you have a gcc or clang C compiler and that RDKit's Python bindings are installed:
- Download chembl_27.fps.gz.
- Download popc.py.
- Run
python popc.py
to generate the _popc module. - Download chembl_in_memory_search.py.
- Run
python chembl_in_memory_search.py --times
to do the default caffeine search and see the internal timings. - Run
python chembl_in_memory_search.py --help
to see the command-line help, then try your own queries, including from a SMILES or FPS file.
The new program averages about 0.5 seconds per query, with an initial 4-5 second load time.
In tomorrow's essay, I'll describe how to implement the BitBound algorithm, which speeds up most searches by reducing the number of fingerprints to test. I'll revisit this in-memory search algorithm next week when I implement more of the search code in C.
Measuring FPS parsing overhead
I profiled the
program and found that most of the time was spent in parsing the
FPS file, not in computing the Tanimoto. To profile, I wrapped the
entire program into a function (the profiler I'm using works on
functions), used the @profile
decorator, which tells
kernprof
which function(s) to profile, profiled the
result with:
% kernprof -l ssimsearch_bytes.py
then viewed the profiler results with
% python -m line_profiler ssimsearch_bytes.py.lprof Timer unit: 1e-06 s Total time: 13.0334 s File: ssimsearch_bytes.py Function: main at line 1 Line # Hits Time Per Hit % Time Line Contents ============================================================== 1 @profile 2 def main(): 3 1 6859.0 6859.0 0.1 from _popc.lib import byte_tanimoto_256 4 5 1 224418.0 224418.0 1.7 from rdkit import Chem 6 1 6.0 6.0 0.0 from rdkit import DataStructs 7 1 5629.0 5629.0 0.0 from rdkit.Chem import rdMolDescriptors 8 9 1 1.0 1.0 0.0 smiles = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C" 10 1 1.0 1.0 0.0 threshold = 0.7 11 12 1 11504.0 11504.0 0.1 mol = Chem.MolFromSmiles(smiles) 13 1 2.0 2.0 0.0 if mol is None: 14 raise SystemExit(f"Cannot parse SMILES {smiles!r}") 15 16 2 98.0 49.0 0.0 rd_query_fp = rdMolDescriptors.GetMorganFingerprintAsBitVect( 17 1 1.0 1.0 0.0 mol, radius=2, nBits=2048, useChirality=0, 18 1 0.0 0.0 0.0 useBondTypes=1, useFeatures=0) 19 1 15.0 15.0 0.0 query_fp = DataStructs.BitVectToBinaryText(rd_query_fp) 20 21 22 1 97.0 97.0 0.0 infile = open("chembl_27.fps") 23 7 8.0 1.1 0.0 for i in range(6): 24 6 32.0 5.3 0.0 line = infile.readline() 25 1 3.0 3.0 0.0 assert line.startswith("#date") 26 27 1941411 3356477.0 1.7 25.8 for line in infile: 28 1941410 3207597.0 1.7 24.6 target_hex_fp, target_id = line.split() 29 1941410 2605370.0 1.3 20.0 target_fp = bytes.fromhex(target_hex_fp) 30 31 1941410 2071065.0 1.1 15.9 score = byte_tanimoto_256(query_fp, target_fp) 32 1941410 1544099.0 0.8 11.8 if score >= threshold: 33 2 98.0 49.0 0.0 print(f"{target_id}\t{score}")
Lines 27-29 are all part of parsing the FPS file, and they take 70% of the search time. The actual Tanimoto calculation (line 31) and score test (line 32) take together about 25% of the time.
Multiple queries
What if you want to search for multiple queries? One implementation might do the full search of the first query, then of the second, and so on. This would take about 5*N seconds.
The above profile suggests that if you could load the fingerprint data into a more easily processed form, where the pre-processing was done once, then the overall performance would be faster. This may be useful in a web service where extra work can be done during startup in order to improve the per-request performance.
I changed the main loop (lines 27-33 in the above) so there are two stages. Stage 1 loads all of the fingerprints into memory, and stage 2 searches those fingerprints, and added timing measurements for each stage:
import time t1 = time.time() targets = [] for line in infile: target_hex_fp, target_id = line.split() target_fp = bytes.fromhex(target_hex_fp) targets.append( (target_id, target_fp) ) t2 = time.time() for target_id, target_fp in targets: score = byte_tanimoto(query_fp, target_fp) if score >= threshold: print(f"{target_id}\t{score}")
This reports:
Load: 4.64 Search: 0.79
What multiple queries?
Previously I supported a hard-coded SMILES string, or a user-defined SMILES string on the command-line. If there are going to be multiple queries, what are they, and where do they come from?
I decided I was going to support a SMILES file or FPS file as input,
based on if the given query filename ends with .smi
or
not. I already have a way to read_fingerprints()
from an FPS file, and I have a way to compute
the RDKit Morgan fingerprint given a SMILES, so all I need is a way
to read a SMILES file that yields the same (id, fingerprint) pairs as
read_fingerprints().
SMILES file
There are two main types of SMILES files: Daylight-like and CSV-like. A Daylight SMILES file contain one record per line. Each line contains a SMILES followed by a whitespace character followed by a molecule title/id, up to the rest of the line.
A CSV-like SMILES file contains a SMILES followed by a space or tab seperator character, followed by an molecule title/id, followed optionally by successive seperator characters and additional fields. In addition, the first line may contain column headers.
I'm going to support Daylight-like SMILES file.
SMILES file parser
It's pretty easy to read either SMILES file variant, though more complex if you want full configurability. In short, for every line in the file, split on the first whitespace. The first field is the SMILES, the second is the identifier. Convert the SMILES to a fingerprint, or skip it if the SMILES could not be processed. That give the needed (id, fingerprint) pairs:
# Read a SMILES file, convert the SMILES to a fingerprint, # and yield (id, fingerprint) pairs for each parsable record. def read_structure_fingerprints_from_smiles_file(infile): for lineno, line in enumerate(infile, 1): # Expecting the format: <SMILES> + whitespace + <title> + "\n" smiles, id = line[:-1].split(None, 1) try: # Get the corresponding RDKit Morgan fingerprint fp = get_query_fp(smiles) except ValueError as err: # Skip records which can't be parsed sys.stderr.write(f"ERROR: {err}: Skipping line {lineno}\n") continue yield id, fp
Reporting search results
Since there may be multiple queries, each with 0 or more hits, I need some way to figure out which query matches which targets. I decided to use a simple three-column tab-separated format with the query id in the first column the target id in the second column, and the score in the third, like the following:
Amygdalin CHEMBL2303964 0.796 Amygdalin CHEMBL2303958 0.700 Boron nitride * * Bicarbonate CHEMBL1353 0.875 Bicarbonate CHEMBL2106975 0.875 Benzoic acid CHEMBL541 1.000
A problem with this format is there's no obvious way to say that a
query has no hits. Perhaps those should be omitted from the output?
What I decided was to have one output line for that case, with
*
for the target id and score. You can see that in the Boron
nitride
line, above.
A search implementation
Assuming targets
is a list of (target id, target fingerprint) pairs, and
query_reader
is an iterator of (query id, query fingerprint)
pairs (either from read_fingerprints() or
read_structure_fingerprints_from_smiles_file()) then the following
shows the structure of the main search loop and reporting.
for query_id, query_fp in query_reader: has_match = False # Check each of the targets for target_id, target_fp in targets: score = similarity_function(query_fp, target_fp) # If it's similar enough, print details. if score >= threshold: has_match = True print(f"{query_id}\t{target_id}\t{score:.3f}") # Print something in case there were no matches. if not has_match: print(f"{query_id}\t*\t*")
(The final code includes a count of the number of queries, which is used for reporting.)
Command-line options with argparse
There are several Python packages for parsing command-line arguments. I use argparse because it's part of Python's standard library. Many are fond of click as a third-party package.
I decided I wanted to allow a single query SMILES input or
input from either a SMILES file or an FPS file. I also wanted to be
able to change the threshold, specify the location of the
chembl_27.fps file (in case it isn't in the current directory), and
support a --times
option to report the load and search times.
Here's what the --help
from the program looks like:
usage: chembl_in_memory_search.py [-h] [--threshold THRESHOLD] [--query SMILES | --queries FILENAME] [--times] [FILENAME] in-memory ChEMBL fingerprint search positional arguments: FILENAME location of chembl_27.fps optional arguments: -h, --help show this help message and exit --threshold THRESHOLD, -t THRESHOLD minimum threshold (default: 0.7) --query SMILES query smiles --queries FILENAME input query file (SMILES or FPS) --times print timing information to stderr
This sort of configuration is pretty easy to do with argparse. The hardest thing is that the single-query SMILES search and the file-based search are exclusive, meaning they can't both be specified at the same time.
parser = argparse.ArgumentParser( description = "in-memory ChEMBL fingerprint search") parser.add_argument("--threshold", "-t", type=float, default=0.7, help="minimum threshold (default: 0.7)") g = parser.add_mutually_exclusive_group() g.add_argument("--query", metavar="SMILES", help="query smiles") g.add_argument("--queries", metavar="FILENAME", help="input query file (SMILES or FPS)") parser.add_argument("--times", action="store_true", help="print timing information to stderr") parser.add_argument("targets", metavar="FILENAME", default="chembl_27.fps", nargs="?", help="location of chembl_27.fps")
An argparse.ArgumentParser() instance (like parser
in the
above) is usually used something like:
args = parser.parse_args() print("Threshold:", args.threshold)
That is, parse_args() parses the command line and returns an argparse.Namespace() instance with the parsed command-line parameters accessible as attributes.
Setting up the query reader
The last tricky part to explain is how to handle both --query
SMILES
and --queries FILENAME
. I decided to have a function
which returns a query reader
, that is, something which can be
iterated over to get the successive pairs of (query id, query
fingerprint). I have all the parts - I just need to hook them together
correctly. And, if neither option is specified, I'll have the program
search for caffeine, with a warning that that's what it will do.
# Helper function to figure out how the queries are specified def open_query_reader(args): if args.query is None: if args.queries is None: # Default search sys.stderr.write("No query specified, using caffeine.\n") query_fp = get_query_fp("CN1C=NC2=C1C(=O)N(C(=O)N2C)C") return [("caffeine", query_fp)] elif args.queries.endswith(".smi"): # If --queries ends with ".smi" return read_structure_fingerprints_from_smiles_file(open(args.queries)) else: # Otherwise, assume it's an FPS file return read_fingerprints(open(args.queries)) else: # User specified a --query SMILES try: query_fp = get_query_fp(args.query) except ValueError as err: raise SystemExit("Cannot parse --query SMILES {args.query!r}") return [("Query", query_fp)]
chembl_in_memory_search.py
I've walked through all of the the important parts, though there are a few missing connector pieces. Rather than show it here, you can download chembl_in_memory_search.py for yourself. You'll need to download and run popc.py in order to generate the _popc Python module. (See Monday's essay for more details.)
Search Wikipedia structures
The fun Wikipedia
Chemical Structure Explorer is an interactive tool to explore the
chemical structures scrapped from Wikipedia. The structures can be
downloaded as a SMILES
file. It currently contains 17,846 structures, but it's NOT a
SMILES file because the id and SMILES fields are swapped. I'll extract
the first 10 to use for my timings and use awk to swap the two fields
(the -F\t
configures it to use the tab character as
the field seperator):
% head -10 smiles.txt | awk '-F\t' '{print $2, $1}' > wikipedia_10.smi
Many if not most of the Wikipedia structures are in ChEMBL, along with similar structures, so I'll use a high similarity threshold in order to limit the number of lines in the output:
% python chembl_in_memory_search.py --queries wikipedia_10.smi --threshold 0.9 Ammonia CHEMBL1160819 1.000 Aspirin CHEMBL25 1.000 Acetylene CHEMBL116336 1.000 Adenosine triphosphate CHEMBL14249 1.000 Adenosine triphosphate CHEMBL339385 0.942 Adenosine triphosphate CHEMBL437508 0.942 Adenosine triphosphate CHEMBL490984 1.000 Adenosine triphosphate CHEMBL591905 0.942 Adenosine triphosphate CHEMBL407938 0.942 Adenosine triphosphate CHEMBL14830 0.981 Adenosine triphosphate CHEMBL1331383 0.942 Adenosine triphosphate CHEMBL1626485 0.942 Adenosine triphosphate CHEMBL1610462 0.925 Adenosine triphosphate CHEMBL1162201 0.942 Adenosine triphosphate CHEMBL4298329 0.942 Ampicillin CHEMBL174 1.000 Ampicillin CHEMBL453388 0.978 Ampicillin CHEMBL1078033 0.978 Ampicillin CHEMBL1473908 1.000 Chemistry of ascorbic acid CHEMBL196 1.000 Chemistry of ascorbic acid CHEMBL486293 1.000 Chemistry of ascorbic acid CHEMBL1161421 1.000 Chemistry of ascorbic acid CHEMBL196 1.000 Chemistry of ascorbic acid CHEMBL486293 1.000 Chemistry of ascorbic acid CHEMBL1161421 1.000 Amphetamine CHEMBL405 1.000 Amphetamine CHEMBL19393 1.000 Amphetamine CHEMBL612 1.000 Amphetamine CHEMBL287386 0.950 Amphetamine CHEMBL554211 0.950 Amphetamine CHEMBL1788294 0.950 Aspartame CHEMBL171679 1.000 Aspartame CHEMBL312245 1.000 Aspartame CHEMBL2110238 1.000 Amoxicillin CHEMBL1082 1.000 Amoxicillin CHEMBL1367635 1.000 Amoxicillin CHEMBL1433952 1.000
If I add the --times
option, then I see I'm doing
about 2 searches per second:
load: 4.57 search: 5.01 #queries: 10 #/sec: 2.00
The equivalent chemfp search
Chemfp is a high-performance cheminformatics similarity search package I sell. Its command-line is similar to the chembl_in_memory_search.py program I developed above. I'll try it out:
% simsearch --queries wikipedia_10.smi --threshold 0.9 --times chembl_27.fps > /dev/null open 0.00 read 0.00 search 8.55 output 0.00 total 8.56
This means it takes about 9 seconds to search the file, which is about the same time as the simple Python code I wrote!
--scan vs. --memory searches
That was unexpected. The problem is that chemfp has two search
modes. The --scan
mode searches the file chunk by chunk (like
Monday'
essay), while --memory
loads the data set into memory before
searching.
Years ago, the tradeoff was about 20 queries. Now I see it's much smaller! If I ask it to do an in-memory search then it takes only 5.7 seconds total:
% simsearch --queries wikipedia_10.smi --threshold 0.9 --times chembl_27.fps --memory > /dev/null open 5.60 read 0.00 search 0.07 output 0.00 total 5.67
This means it took 5.6 seconds to read the fingerprints and organize them for fast search, and the search itself took only 0.07 seconds.
Looks like it's time to re-tune that parameter!
chemfp's FPB format
One reason I didn't notice this until now is that I tend to work with chemfp's FPB format, which is optimized for fast loading. It takes all of the FPS readers over 4 seconds to parse the file to put the data into a format which is more appropriate for faster searches.
The FPB format organizes the fingerprint data so it can loaded quickly. In fact, if the file is uncompressed it can be memory-mapped and used directly - for some searches chemfp doesn't even need to load the full data set!
First, I'll convert the FPS file to FPB format:
% fpcat chembl_27.fps -o chembl_27.fpb
I'll then do the same search with 10 Wikipedia queries using a warm disk cache:
% simsearch --queries wikipedia_10.smi --threshold 0.9 --times chembl_27.fpb > /dev/null open 0.00 read 0.00 search 0.22 output 0.00 total 0.23
That's about 22 milliseconds per query, with very little load time needed. This is fast enough to include a ChEMBL similarity search in a scripting environment.
chemfp with 1,000 Wikipedia queries
I then tried with nearly 1,000 Wikipedia queries (RDKit could not parse a couple of structures; I've omitted RDKit's warning and error messages) using chemfp and the FPS and FPB formats:
% simsearch --queries wikipedia_1000.smi --threshold 0.9 --times chembl_27.fps > /dev/null open 5.92 read 0.26 search 2.58 output 0.01 total 8.77 % simsearch --queries wikipedia_1000.smi --threshold 0.9 --times chembl_27.fpb > /dev/null open 0.02 read 0.25 search 2.77 output 0.02 total 3.07With 1,000 queries the
readtime is now noticeable; that includes the time for RDKit to parse the SMILES string and generate the Morgan fingerprint. In case you are curious, the performance is over 100x faster than the simple in-memory program I developed for this essay:
% python chembl_in_memory_search.py --queries wikipedia_1000.smi --threshold 0.9 --times > /dev/null load: 4.69 search: 499.65 #queries: 994 #/sec: 1.99
chemfp advertisement
You can install and test chemfp chemfp (under the Base License Agreement) by doing:
python -m pip install chemfp -i https://chemfp.com/packages/The Base License Agreement does not allow you to create FPB files or do an in-memory search with more than 50,000 fingerprints. If you are interested in those features, take a look at the other licensing options then contact me to ask for an evaluation license key.
Copyright © 2001-2020 Andrew Dalke Scientific AB