Cache and reuse popcount-sorted ChEMBL fingerprints

[ previous | newer ] /home/writings/diary/archive/2020/10/08/cache_sorted_fingerprints

Cache and reuse popcount-sorted ChEMBL fingerprints

This is part of a series of essays on how to write a similarity search program for the RDKit Morgan fingerprints distributed by ChEMBL.

Simple FPS fingerprint similarity search: variations on a theme
Simple k-NN FPS Tanimoto and cosine similarity search
Simple in-memory ChEMBL similarity search
Simple BitBound ChEMBL similarity search
Using RDKit BulkTanimotoSimilarity
Faster in-memory ChEMBL search by using more C
Faster BitBound ChEMBL search by using more C
Even faster in-memory search with intersection popcount
Cache and reuse popcount-sorted ChEMBL fingerprints

The program I wrote in the first essay of this series of ChEMBL fingerprint Tanimoto search algorithms took 2 seconds per query. I wrote a series of new programs which had a higher startup cost in exchange for a faster per-query searches. The final program, in yesterday's essay took 10 seconds to load but could process 35-90 queries per second, depending on the threshold.

In this short essay I'll use Python's pickle module to cache and re-use the loaded fingerprint data to reduce the load time to 0.70.

If you want to jump right to the code:

Download and uncompress chembl_27.fps.gz.
Download and run popc.py if you haven't already (it's unchanged from yesterday).
Download chembl_bitbound_search_in_C_v3.py.
Run python chembl_bitbound_search_in_C_v3.py --times to do the default caffeine search and see the internal timings. This will cache the fingerprint data to ./chembl_27.fps.pkl
Run python chembl_bitbound_search_in_C_v3.py --times to do the default caffeine search and see the internal timings using the pickle file. The load time should be much faster and the search time about the same.

Amortize the load time

The initial load takes about 10 seconds because the program processes each fingerprint to place it into the right bin, based on its popcount, then merges all fingerprints in the same popcount bin into a single byte string.

This organization is done every time you run the search program.

However, if you plan to run the program many times then it's worthwhile to see if the organized data can be stored (in this case, to a file) in some way which is faster to load when needed. If so, then overhead of organizing the data and storing it can be done once. In computer science terms, the overhead cost is amortized across all of the times it's needed, assuming the new load time is low enough

Python pickle

Python has a few ways to turn a simple data structure into a string or file, and convert it back into the the original data structure. The most common (and fastest, in my testing) is the pickle module. I'll start by walking through an example which pickles a 3-element tuple into a byte string then converts the bytestring back into an equivalent 3-element tuple:

>>> import pickle
>>> data = (1, b"abc", 9.8)
>>> pkl_data = pickle.dumps(data)
>>> pkl_data
b'\x80\x04\x95\x14\x00\x00\x00\x00\x00\x00\x00K\x01C\x03abc\x94G@#\x99\x99\x99\x99\x99\x9a\x87\x94.'
>>> pickle.loads(pkl_data)
(1, b'abc', 9.8)

There are also functions to read to and write from file opened in binary mode:

>>> with open("datafile.pkl", "wb") as outfile:
...    pickle.dump(data, outfile)
...
>>> with open("datafile.pkl", "rb") as infile:
...    print(pickle.load(infile))
...
(1, b'abc', 9.8)

NOTE: Pay attention to this warning from the documentation:

Warning: The pickle module is not secure. Only unpickle data you trust.
It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with.

Luckily for us, that's not a concern for this essay.

Implement a fingerprint pickler

The chembl_bitbound_search_in_C_v2.py program uses about 16 lines (with comments and a blank line) to create the list target_popcount_bins and figure out max_bin_size. Here's the code we're starting off with, indented one extra level so it makes since in the new program:

        # Create 2049 bins, each with an empty list, ordered by popcount.
        target_popcount_bins = [[] for i in range(2049)]
        for id_fp_pair in read_fingerprints(open(args.targets)):
            # Place each record into the right bin.
            popcount = byte_popcount_256(id_fp_pair[1])
            target_popcount_bins[popcount].append(id_fp_pair)
    
        # Split the ids and fingerprints into parallel lists then merge the fingerprints
        max_bin_size = 0
        for popcount, id_fp_pairs in enumerate(target_popcount_bins):
            if id_fp_pairs:
                target_ids, target_fingerprints = list(zip(*id_fp_pairs))
                target_popcount_bins[popcount] = (target_ids, b"".join(target_fingerprints))
                max_bin_size = max(max_bin_size, len(target_ids))
            else:
                target_popcount_bins[popcount] = [[], b""]

I'll save the two-element tuple (max_bin_size, target_popcount_bins) into a pickle file, with the command-line argument --no-cache to disable caching. Assuming cache_filename contains the name of the file to store the cached data, and and args.no_cache is True if --no-cache is set, then saving is pretty simple:

        if not args.no_cache:
            with open(cache_filename, "wb") as pkl_file:
                pickle.dump((max_bin_size, target_popcount_bins), pkl_file)

and reading it is a matter of: if the cache file is there, use it, otherwise use the original code to read the fingerprint file then save to the cache file. Which looks like:

    cache_filename = "chembl_27.fps.pkl"
    if os.path.exists(cache_filename) and not args.no_cache:
        with open(cache_filename, "rb") as pkl_file:
            max_bin_size, target_popcount_bins = pickle.load(pkl_file)
    else:
        # Create 2049 bins, each with an empty list, ordered by popcount.
        target_popcount_bins = [[] for i in range(2049)]
        for id_fp_pair in read_fingerprints(open(args.targets)):
            …

Other than a few minor changes, like importing the os and pickle modules, and adding the new command-line option, like this:

parser.add_argument("--no-cache", action="store_true",
                        help="don't read or write the processed fingerprints to 'chembl_27.fps.pkl'")

... we're done!

Timing the new loader

I'll do the default search for caffeine using the new chembl_bitbound_search_in_C_v3.py:

% python chembl_bitbound_search_in_C_v3.py --times
No query specified, using caffeine.
caffeine	CHEMBL113	1.000
caffeine	CHEMBL1232048	0.710
load: 13.21 search: 0.01 #queries: 1 #/sec: 109.08

You can see the load time went up by another couple of seconds. That's the extra time needed to save the 500 MB cache file:

% ls -lh chembl_27.fps.pkl
-rw-r--r--  1 dalke  staff   503M Oct  8 20:36 chembl_27.fps.pkl

Now that the cache file is created, I'll try the search again:

% python chembl_bitbound_search_in_C_v3.py --times
No query specified, using caffeine.
caffeine	CHEMBL113	1.000
caffeine	CHEMBL1232048	0.710
load: 0.78 search: 0.01 #queries: 1 #/sec: 162.16

Well under one second - more than an order of magnitude faster! In fact, it's faster than scanning the FPS file.

Which means if you can take a one-time cost of preparing and storing the 500 MB pickle file then it will be faster to search than the original FPS file, even for a single query.

Lastly, a check with the Wikipedia SMILES:

% python chembl_bitbound_search_in_C_v3.py --queries wikipedia_1000.smi --threshold 0.35 --times > /dev/null
load: 0.70 search: 30.39 #queries: 994 #/sec: 32.71

Yes, the load time is faster and the search time is still about the same. As expected.

Chemfp's FPB format

Chemfp natively supports the FPS and FPB formats. I gave an overview of the FPS format in the first essay of the series.

Conceptually the FPB format is similar to this pickle format. It contains the fingerprints sorted by popcount to make good use of the BitBound algorithm, and it stores the ids. It also stores a hash table to make it easy to look up a fingerprint given its id, as well as the FPS header metadata, so the actual file is slightly larger than the pickle file you saw earlier:

% time fpcat chembl_27.fps -o chembl_27.fpb
6.949u 1.403s 0:08.91 93.6%	0+0k 0+0io 676pf+0w
% ls -lh chembl_27.fpb
-rw-r--r--  1 dalke  staff   534M Oct  8 21:57 chembl_27.fpb

Most importantly, it's arranged so the data can be loaded with almost no additional processing. If the FPB file is uncompressed then the contents are memory-mapped, meaning the chemfp implementation uses the data as-is. This combines well with the BitBound algorithm because for some searches, like if the threshold is high, the search doesn't even need to read all of the data from disk.

In the following caffeine search, the load time is under 10 milliseconds, simply because there aren't enough digits in the output (I reordered the output manually so the times, which are written to stderr, were at the end after the search results written to stdout.)

% simsearch --query 'CN1C=NC2=C1C(=O)N(C(=O)N2C)C' --threshold 0.7 chembl_27.fpb --times
#Simsearch/1
#num_bits=2048
#type=Tanimoto k=all threshold=0.7
#software=chemfp/3.4.1
#targets=chembl_27.fpb
2	Query1	CHEMBL113	1.00000	CHEMBL1232048	0.70968
open 0.00 read 0.00 search 0.04 output 0.00 total 0.04

With the full 1000 Wikipedia SMILES you can see the load time is a bit higher, but still quite small:

% simsearch --queries wikipedia_1000.smi --threshold 0.35 chembl_27.fpb --times > /dev/null
open 0.03 read 0.26 search 15.55 output 1.19 total 17.02

(When you have that many hits, simply formatting the scores to print them takes significant time - 7% of the total time in this case!)

chemfp advertising

chemfp is available with academic and commercial licenses. Academics can even get a no-cost license key to use all of the features in the pre-compiled package for Linux-based OSes. See chemfp's download page for installation instructions, then go to the license page to request an evaluation license key for your license of choice.

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me