Cache and reuse popcount-sorted ChEMBL fingerprints
This is part of a series of essays on how to write a similarity search program for the RDKit Morgan fingerprints distributed by ChEMBL.
- Simple FPS fingerprint similarity search: variations on a theme
- Simple k-NN FPS Tanimoto and cosine similarity search
- Simple in-memory ChEMBL similarity search
- Simple BitBound ChEMBL similarity search
- Using RDKit BulkTanimotoSimilarity
- Faster in-memory ChEMBL search by using more C
- Faster BitBound ChEMBL search by using more C
- Even faster in-memory search with intersection popcount
- Cache and reuse popcount-sorted ChEMBL fingerprints
The program I wrote in the first essay of this series of ChEMBL fingerprint Tanimoto search algorithms took 2 seconds per query. I wrote a series of new programs which had a higher startup cost in exchange for a faster per-query searches. The final program, in yesterday's essay took 10 seconds to load but could process 35-90 queries per second, depending on the threshold.
In this short essay I'll use Python's
If you want to jump right to the code:
- Download and uncompress chembl_27.fps.gz.
- Download and run popc.py if you haven't already (it's unchanged from yesterday).
- Download chembl_bitbound_search_in_C_v3.py.
python chembl_bitbound_search_in_C_v3.py --timesto do the default caffeine search and see the internal timings. This will cache the fingerprint data to ./chembl_27.fps.pkl
python chembl_bitbound_search_in_C_v3.py --timesto do the default caffeine search and see the internal timings using the pickle file. The load time should be much faster and the search time about the same.
Amortize the load time
The initial load takes about 10 seconds because the program processes each fingerprint to place it into the right bin, based on its popcount, then merges all fingerprints in the same popcount bin into a single byte string.
This organization is done every time you run the search program.
However, if you plan to run the program many times then it's worthwhile to see if the organized data can be stored (in this case, to a file) in some way which is faster to load when needed. If so, then overhead of organizing the data and storing it can be done once. In computer science terms, the overhead cost is amortized across all of the times it's needed, assuming the new load time is low enough
Python has a few ways to turn a simple data structure into a string or file, and convert it back into the the original data structure. The most common (and fastest, in my testing) is the
>>> import pickle >>> data = (1, b"abc", 9.8) >>> pkl_data = pickle.dumps(data) >>> pkl_data b'\x80\x04\x95\x14\x00\x00\x00\x00\x00\x00\x00K\x01C\x03abc\x94G@#\x99\x99\x99\x99\x99\x9a\x87\x94.' >>> pickle.loads(pkl_data) (1, b'abc', 9.8)
There are also functions to read to and write from file opened in binary mode:
>>> with open("datafile.pkl", "wb") as outfile: ... pickle.dump(data, outfile) ... >>> with open("datafile.pkl", "rb") as infile: ... print(pickle.load(infile)) ... (1, b'abc', 9.8)
NOTE: Pay attention to this warning from the documentation:
Warning: The pickle module is not secure. Only unpickle data you trust.
It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with.
Luckily for us, that's not a concern for this essay.
Implement a fingerprint pickler
The chembl_bitbound_search_in_C_v2.py program uses about 16 lines (with comments and a blank line) to create the list target_popcount_bins and figure out max_bin_size. Here's the code we're starting off with, indented one extra level so it makes since in the new program:
# Create 2049 bins, each with an empty list, ordered by popcount. target_popcount_bins = [ for i in range(2049)] for id_fp_pair in read_fingerprints(open(args.targets)): # Place each record into the right bin. popcount = byte_popcount_256(id_fp_pair) target_popcount_bins[popcount].append(id_fp_pair) # Split the ids and fingerprints into parallel lists then merge the fingerprints max_bin_size = 0 for popcount, id_fp_pairs in enumerate(target_popcount_bins): if id_fp_pairs: target_ids, target_fingerprints = list(zip(*id_fp_pairs)) target_popcount_bins[popcount] = (target_ids, b"".join(target_fingerprints)) max_bin_size = max(max_bin_size, len(target_ids)) else: target_popcount_bins[popcount] = [, b""]
I'll save the two-element tuple (max_bin_size, target_popcount_bins) into a pickle file, with the command-line argument --no-cache to disable caching. Assuming cache_filename contains the name of the file to store the cached data, and and args.no_cache is True if --no-cache is set, then saving is pretty simple:
if not args.no_cache: with open(cache_filename, "wb") as pkl_file: pickle.dump((max_bin_size, target_popcount_bins), pkl_file)
and reading it is a matter of: if the cache file is there, use it, otherwise use the original code to read the fingerprint file then save to the cache file. Which looks like:
cache_filename = "chembl_27.fps.pkl" if os.path.exists(cache_filename) and not args.no_cache: with open(cache_filename, "rb") as pkl_file: max_bin_size, target_popcount_bins = pickle.load(pkl_file) else: # Create 2049 bins, each with an empty list, ordered by popcount. target_popcount_bins = [ for i in range(2049)] for id_fp_pair in read_fingerprints(open(args.targets)): …
Other than a few minor changes, like importing the os and pickle modules, and adding the new command-line option, like this:
parser.add_argument("--no-cache", action="store_true", help="don't read or write the processed fingerprints to 'chembl_27.fps.pkl'")
... we're done!
Timing the new loader
I'll do the default search for caffeine using the new
% python chembl_bitbound_search_in_C_v3.py --times No query specified, using caffeine. caffeine CHEMBL113 1.000 caffeine CHEMBL1232048 0.710 load: 13.21 search: 0.01 #queries: 1 #/sec: 109.08
You can see the load time went up by another couple of seconds. That's the extra time needed to save the 500 MB cache file:
% ls -lh chembl_27.fps.pkl -rw-r--r-- 1 dalke staff 503M Oct 8 20:36 chembl_27.fps.pkl
Now that the cache file is created, I'll try the search again:
% python chembl_bitbound_search_in_C_v3.py --times No query specified, using caffeine. caffeine CHEMBL113 1.000 caffeine CHEMBL1232048 0.710 load: 0.78 search: 0.01 #queries: 1 #/sec: 162.16
Well under one second - more than an order of magnitude faster! In fact, it's faster than scanning the FPS file.
Which means if you can take a one-time cost of preparing and storing the 500 MB pickle file then it will be faster to search than the original FPS file, even for a single query.
Lastly, a check with the Wikipedia SMILES:
% python chembl_bitbound_search_in_C_v3.py --queries wikipedia_1000.smi --threshold 0.35 --times > /dev/null load: 0.70 search: 30.39 #queries: 994 #/sec: 32.71
Yes, the load time is faster and the search time is still about the same. As expected.
Chemfp's FPB format
Conceptually the FPB format is similar to this pickle format. It contains the fingerprints sorted by popcount to make good use of the BitBound algorithm, and it stores the ids. It also stores a hash table to make it easy to look up a fingerprint given its id, as well as the FPS header metadata, so the actual file is slightly larger than the pickle file you saw earlier:
% time fpcat chembl_27.fps -o chembl_27.fpb 6.949u 1.403s 0:08.91 93.6% 0+0k 0+0io 676pf+0w % ls -lh chembl_27.fpb -rw-r--r-- 1 dalke staff 534M Oct 8 21:57 chembl_27.fpb
Most importantly, it's arranged so the data can be loaded with almost no additional processing. If the FPB file is uncompressed then the contents are memory-mapped, meaning the chemfp implementation uses the data as-is. This combines well with the BitBound algorithm because for some searches, like if the threshold is high, the search doesn't even need to read all of the data from disk.
In the following caffeine search, the load time is under 10 milliseconds, simply because there aren't enough digits in the output (I reordered the output manually so the times, which are written to stderr, were at the end after the search results written to stdout.)
% simsearch --query 'CN1C=NC2=C1C(=O)N(C(=O)N2C)C' --threshold 0.7 chembl_27.fpb --times #Simsearch/1 #num_bits=2048 #type=Tanimoto k=all threshold=0.7 #software=chemfp/3.4.1 #targets=chembl_27.fpb 2 Query1 CHEMBL113 1.00000 CHEMBL1232048 0.70968 open 0.00 read 0.00 search 0.04 output 0.00 total 0.04
With the full 1000 Wikipedia SMILES you can see the load time is a bit higher, but still quite small:
% simsearch --queries wikipedia_1000.smi --threshold 0.35 chembl_27.fpb --times > /dev/null open 0.03 read 0.26 search 15.55 output 1.19 total 17.02
(When you have that many hits, simply formatting the scores to print them takes significant time - 7% of the total time in this case!)
chemfp is available with academic and commercial licenses. Academics can even get a no-cost license key to use all of the features in the pre-compiled package for Linux-based OSes. See chemfp's download page for installation instructions, then go to the license page to request an evaluation license key for your license of choice.
Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me
Copyright © 2001-2020 Andrew Dalke Scientific AB