Simple kNN FPS Tanimoto and cosine similarity search
Yesterday
I developed an simple program to search chembl_27.fps.gz
for records with a Tanimoto similarity of at least 0.7 to caffeine. I
started by mentioning the 1986 paper by Willet, Winterman, and Bawden
Implementation
of nearestneighbor searching in an online chemical structure search
system
. As you can read from the title, that paper does a
knearest neighbor search (kNN search
, also called a topN
search
), and compares a Tanimoto similarity search to a cosine
similarity search. So really, I'm only halfway there. In this
essay I'll go the other half and implement the nearest neighbor
search. As before, this code will do almost no error handling, and
will only work for the uncompressed chembl_27.fps in the local
directory.
If you want to skip all the discussion, here are the instructions to use the final code, which assumes you have a gcc or clang C compiler and that RDKit's Python bindings are installed:
 Download chembl_27.fps.gz.
 Download popc.py.
 Run
python popc.py
to generate the _popc module.  Download chembl_knn_search.py.
 Run
python chembl_knn_search.py
to do the default caffeine search.  Run
python chembl_knn_search.py help
to see the commandline help, then try your own queries.
Priority queues and heaps
If you need to find the largest element in an unordered list then you can start by assuming the first item is the largest, then test each remaining item. If it's larger than the currently known largest value then use it as the new largest known value. Once done, you have the largest element. Here's an example of what that algorithm looks like:
data = [5, 8, 5, 9, 8, 7, 4] largest = data[0] for value in data[1:]: if value > largest: largest = value
What if you want to find the two largest elements? It's not hard, though it is tedious and errorprone, to extend the algorithm:
data = [5, 8, 5, 9, 8, 7, 4] largest = data[0] second_largest = data[1] if second_largest > largest : largest, second_largest = second_largest, largest for value in data[2:]: if value > second_largest: if value > largest: second_largest, largest = largest, value else: second_largest = value
A manual implementation along these lines rapidly becomes infeasible. Instead, this is a wellknown topic. The goto solution is to use a priority queue, often implemented as a heap.
Python's heqpq module
Python doesn't implement a heap data type. Instead, the heapq module
implements the heap operations using the list data type. More
specifically, it implements a min heap
, which in practice makes
it faster and easier to get the smallest element rather than the top;
the smallest item is always the first element of the list. Here's an
example where I add the elements 5, 8, 4, 7, and 3 to the heap; you
can see that heap[0] is always the smallest, though the ordering of
the other elements is less obvious:
>>> import heapq >>> heap = [] >>> heapq.heappush(heap, 5) >>> heap [5] >>> heapq.heappush(heap, 8) >>> heap [5, 8] >>> heapq.heappush(heap, 4) >>> heap [4, 8, 5] >>> heapq.heappush(heap, 7) >>> heap [4, 7, 5, 8] >>> heapq.heappush(heap, 3) >>> heap [3, 4, 5, 8, 7]
Next, I'll iteratively remove the smallest item from the list, and show that the results are from smallest to largest:
>>> heapq.heappop(heap) 3 >>> heap [4, 7, 5, 8] >>> heapq.heappop(heap) 4 >>> heap [5, 7, 8] >>> heapq.heappop(heap) 5 >>> heap [7, 8] >>> heapq.heappop(heap) 7 >>> heap [8] >>> heapq.heappop(heap) 8 >>> heap []
Topn largest elements in a list
Python's heapq functions can be used to find the largest k elements of a list. Initialize the heap to the first k values of the input. For each of the other values in the input, the value is larger than the current smallest value in the heap, then remove that smallest value and add the new, larger input value. (Python's heapreplace() is more efficient than a heappop() followed by a heappush().)
For example, the following program prints 8 9
, which
are the two largest values in the data
list:
import heapq data = [5, 8, 5, 9, 8, 7, 4]w # Initialize the heap with the first two elements heap = [] heapq.heappush(heap, data[0]) heapq.heappush(heap, data[1]) for value in data[2:]: # If a bigger value is found, remove the smallest value # in the heap and add the new value. if value > heap[0]: heapq.heapreplace(heap, value) print(heapq.heappop(heap), heapq.heappop(heap))
Reorganization
A nearest neighbor search is a bit more complicated than a threshold search because there are two stages: 1) filling the heap with k elements, and 2) adding larger elements to the heap.
To help focus on the nearest neighbor search I've reorganized and modified yesterday's example. There are now two functions to help process the inputs:
 get_query_fp(smiles)  parse the SMILES string into an RDKit molecule, generate the Morgan fingerprint with the appropriate values, and return the fingerprint as a byte string;
 read_fingerprints(infile)  parse the FPSformatted input file. Skip the header section and return an iterator of the (id, fingerprint) pairs, with the fingerprint as a hexdecoded byte string.
Here's they are:
import heapq from _popc.lib import byte_tanimoto_256 from rdkit import Chem from rdkit.Chem import rdMolDescriptors from rdkit import DataStructs # Convert the SMILES into Morgan fingerprint byte string def get_query_fp(smiles): mol = Chem.MolFromSmiles(smiles) if mol is None: raise SystemExit(f"Cannot parse SMILES {smiles!r}") query_rd_fp = rdMolDescriptors.GetMorganFingerprintAsBitVect( mol, radius=2, nBits=2048, useChirality=0, useBondTypes=1, useFeatures=0) return DataStructs.BitVectToBinaryText(query_rd_fp) # Read (id, byte fingerprint) from a filelike object # containing an FPSformatted fingerprints. def read_fingerprints(infile): # Skip the header for line in infile: if line[:1] != "#": break else: # Reached the end of file. Stop. return # Process the first fingerprint line target_hex_fp, target_id = line.split() assert len(target_hex_fp) == 512, "Hardcoded for 2048bit fingerprints" yield target_id, bytes.fromhex(target_hex_fp) # Process the rest of the lines for line in infile: target_hex_fp, target_id = line.split() yield target_id, bytes.fromhex(target_hex_fp)
Find the 10 most similar fingerprints (Tanimoto)
I'll walk through the main function which does the search. First is some initialization code, which converts the query SMILES into a fingerprint and which reads the target fingerprints:
def main(): ## Input parameters smiles = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C" k = 10 target_filename = "chembl_27.fps" query_fp = get_query_fp(smiles) with open("chembl_27.fps") as infile: reader = read_fingerprints(infile)
Initialize the heap with k elements
Next I'll initialize the heap with the first k elements from the input file. (The zip() stops when the first of range(k) or reader finishes, so this will read at most k elements.
# Get up to the first k elements and add them to the heap heap = [] for _, (target_id, target_fp) in zip(range(k), reader): score = byte_tanimoto_256(query_fp, target_fp) heapq.heappush(heap, (score, target_id))
You'll see that I added a 2element tuple to the heap. Because of the way Python works, these heap terms will be compared elementwise, first on the smallest score, with ties broken by comparing the ChEMBL id ASCIIbetically. This will tend to bias the results toward newer ChEMBL records. Another option might be to use the record's position in the input, with a preference either for earlier occurring or later occurring records. Since the records are mostly ASCIIbetically already, the above code is similar to a preference for records which occur later in the file.
Figure out the minimum threshold
Now that I have (up to) k elements in the heap, I know that any new record I find must be larger than the smallest record in the heap. I'll keep track of that. Also, if the smallest record has a score of 1.0 then it's impossible to find a higher score, so I can stop the search:
# Any new match must be larger than the score of the smallest heap element. threshold = heap[0][0] # No need for a further search if all elements match exactly. if threshold == 1.0: pass else:
Search for more similar matches
Here's the heart of the algorithm. Compare each fingerprint. If it's larger than the current smallest score in the heap then remove the smallest term in the heap and add the new one. This might set a new, higher threshold, and if that new threshold is 1.0 then we can stop:
# Test the rest of the fingerprints for target_id, target_fp in reader: score = byte_tanimoto_256(query_fp, target_fp) if score > threshold: heapq.heapreplace(heap, (score, target_id)) # Update the threshold and see if we can stop threshold = heap[0][0] if threshold == 1.0: break
Get the hits
The final step is to print the list of matches, sorted from largest to smallest. What I'll do is heappop() each item into a list of hits. Those will be sorted from smallest to largest, so I'll reverse them, before printing the results (with the score to 3 decimal places):
# Reverse the list because heappop() returns the items from smallest to largest. hits = [heapq.heappop(heap) for _ in range(len(heap))] hits.reverse() # Print the results from largest to smallest for score, target_id in hits: print(f"{target_id}\t{score:.3f}")
__main__
Lastly, I need some way to call the main() function, which I'll do in the typical Python way by only calling it if this code is being run as a program:
if __name__ == "__main__": main()
The 10 most similar matches (Tanimoto)
Now for what you've been waiting for  the 10 most similar matches to caffeine, based on Tanimoto similarity:
CHEMBL113 1.000 CHEMBL1232048 0.710 CHEMBL446784 0.677 CHEMBL2058173 0.667 CHEMBL1738791 0.667 CHEMBL417654 0.647 CHEMBL286680 0.647 CHEMBL2058174 0.647 CHEMBL1200569 0.641 CHEMBL26455 0.618
Find the 10 most similar fingerprints (cosine)
I'm not done yet. The Willett, Winterman, and Bawden paper compared the results of a Tanimoto similarity search to the results of a cosine similarity search. I want to reimplement that sort of comparison.
cosine similarity
Since I decided to not use RDKit (which implements CosineSimilarity()) for my search, I'll need to implement it in my _popc module. I'll show just the new C code here and have you download popc.py for the full details:
static double byte_cosine_256(const unsigned char *fp1, const unsigned char *fp2) { int num_words = 2048 / 64; int dot_product = 0, popcnt1 = 0, popcnt2 = 0; /* Interpret as 64bit integers and assume possible misalignment is okay. */ uint64_t *fp1_64 = (uint64_t *) fp1, *fp2_64 = (uint64_t *) fp2; for (int i=0; i<num_words; i++) { dot_product += __builtin_popcountll(fp1_64[i] & fp2_64[i]); popcnt1 += __builtin_popcountll(fp1_64[i]); popcnt2 += __builtin_popcountll(fp2_64[i]); } if (popcnt1 == 0  popcnt2 == 0) { return 0.0; } return (dot_product / sqrt((double) (popcnt1 * popcnt2))); }
You'll need to (re)generate the shared library by running
python popc.py
.
Check the cosine implementation
If you think I wrote this correctly in one go, then thank you, but your estimation of my programming skills is too high. I used both RDKit and my own Python code to double and triplecheck the implementation. Here's my check code, which generates a large number of fingerprints and random, computes the cosine similarity in three different ways, and compares the results:
from _popc.lib import byte_cosine_256 from rdkit import DataStructs from chemfp import bitops import random N = 2048 bitnos = range(N) def near(x, y): return abs(xy) < 0.000000001 for i in range(10000): n1 = random.randrange(N+1) n2 = random.randrange(N+1) bitlist1 = random.sample(bitnos, n1) bitlist2 = random.sample(bitnos, n2) rd1 = DataStructs.ExplicitBitVect(2048) rd1.SetBitsFromList(bitlist1) rd2 = DataStructs.ExplicitBitVect(2048) rd2.SetBitsFromList(bitlist2) rd_score = DataStructs.CosineSimilarity(rd1, rd2) bytes1 = bitops.byte_from_bitlist(bitlist1, 2048) bytes2 = bitops.byte_from_bitlist(bitlist2, 2048) popc_score = byte_cosine_256(bytes1, bytes2) dot_product = len((set(bitlist1).intersection(bitlist2))) if n1 and n2: expected_score = dot_product / (n1 * n2) ** 0.5 else: expected_score = 0.0 if not near(popc_score, rd_score) or not near(popc_score, expected_score): bitlist1.sort() bitlist2.sort() print(bitlist1) print(bitlist2) print("dot_product:", dot_product) print(rd_score) print(popc_score) print(expected_score) raise AssertionError
The 10 most similar matches (cosine)
Only a few changes are needed to change the code to use cosine similarity instead of Tanimoto similarity. In short, import the new function:
from _popc.lib import byte_tanimoto_256, byte_cosine_256
and change the two occurences of:
score = byte_tanimoto_256(query_fp, target_fp)
to:
score = byte_cosine_256(query_fp, target_fp)
With that in place, the results are:
CHEMBL113 1.000 CHEMBL1232048 0.832 CHEMBL446784 0.808 CHEMBL2058173 0.803 CHEMBL1738791 0.803 CHEMBL1200569 0.801 CHEMBL417654 0.790 CHEMBL286680 0.790 CHEMBL2058174 0.790 CHEMBL26455 0.767
That list is almost identical to the Tanimoto results. This shouldn't be unexpected. Quoting Willett, Winterman, and Bawden's 1986 paper:
The difference between the two types of coefficient had not been apparent in the earlier property prediction experiments,^{10} where the data sets were fairly homogeneous in character, with many of them being sets of analogues; in a typical structure file, conversely, a wide range of structural types will be present, and noncommon fragments will play a much larger role in discriminating between structures when a ranking method is used.
HTML output with images
It's a bit difficult to compare the Tanimoto and cosine results. I want to see structures as well. We can embed the image directly from ChEMBL's servers, so I'll add an option for that.
That proved to be complicated and uninteresting, so I'm not going to review it here. Instead, I took all of the above code and created a program to do a k nearest neighbor search of the ChEMBL data set.
The final code
I went a bit overboard, which isn't that unusual for me. I added support for arguments (using Python's argparse module) so you can specify the input SMILES string, the value of k to use, select Tanimoto or cosine search, and select text or HTML output.
It's entirely too complicated to show here. Instead, you can download
popc.py
(use python popc.py
to have cffi generate the
_popc
module) and chembl_knn_search.py
and look at all the implementation details.
For your edification, here's the sidebyside comparison of the k=20 nearest neighbor search for caffeine.


Conclusion
An efficient nearest neighbor search is harder than a threshold search because managing a priority queue requires some thought. With Python we have the advantage of being able to use a builtin module that someone else had developed and debugged. And we can use Python, and packages like cffi which make it easy to pull in C code for performance.
Things were a bit different 30+ years ago, when Willett, Winterman, and Bawden were first doing this sort of study. While priority queues were well known, it may have required pulling out a text book. Or, perhaps they didn't even use a priority queue? Another option is to simply compute all of the similarity scores, sort the results, and display the largest. This might be slower and more memory intensive for large data sets, but it's easier to implement.
I tried reading the 1986 paper and its citations (and some of their citations) to find this detail, but none explicitly said how they implemented ranking.
Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me
Copyright © 20012020 Andrew Dalke Scientific AB