Fingerprint set similarity
This is part 1 of a three-part series in generating fingerprint-based set similarities using chemfp. Read part 2 where I figure out how to use the ChEMBL bioactivity data and part 3 where I put everything together and visualize the results.
Someone recently asked how I might use chemfp to compute the similarity between two sets of fingerprints. This would be used for something like SEA (Similarity Ensemble Approach), which "relates proteins based on the set-wise chemical similarity among their ligands. It can be used to rapidly search large compound databases and to build cross-target similarity maps."
The underlying techniques of document set similarity are used in many fields. Bioinformatics, for example, uses it to analyze gene expression microarrays to find correlations between genes which might indicate they affect similar pathways. That said, I am no expert on the topic. I implemented a SEA-like algorithm for one client, but I don't know enough to judge the effectiveness of one approach over another, nor do I know the underlying statistics.
But that's okay. I'll assume that you know these details, and just want to know how to use chemfp to get the raw numbers to feed into your scoring function.
The first step is to create and load fingerprints. I have some pre-computed fingerprints for ChEMBL 21, so I'll use that as my example. I'll use the 2048-bit RDKit fingerprints. To generate them yourself, can use the command-line tool rdkit2fps, like this:
% rdkit2fps chembl_21.sdf.gz -o chembl_21.rdkit2048.fps.gz
The FPS file is a text-based, line-oriented format which is easy to for both humans and software to read and write. Easy, however, does not mean fast. It takes about 20 seconds to load this fps.gz file:
>>> import time >>> import chemfp >>> filename = "/Users/dalke/databases/chembl_21.rdkit.rdkit2048.fps.gz" >>> t1 = time.time(); chemfp.load_fingerprints(filename);t2=time.time() <chemfp.arena.FingerprintArena object at 0x1021aef90> >>> t2-t1 19.388361930847168
Chemfp 2.0 introduced the FPB binary file format, which is more structured around the chemfp internal data structures. It's very fast for chemfp to load, but much more complicated. All of the chemfp tools know how to read and write the FPB format, so you could create the FPB file like this:
% rdkit2fps chembl_21.sdf.gz -o chembl_21.rdkit2048.fpbIf you already have an FPS (or gzip-ed FPS) file then use the fpcat tool to convert it to an FPB file:
% fpbcat chembl_21.rdkit2048.fps.gz -o chembl_21.rdkit2048.fpbThe conversion took about 32 seconds on my 7 year old laptop.
I'll switch back to Python and show why I developed the FPB format. The first time I loaded it took 0.4 seconds:
>>> t1=time.time(); chemfp.load_fingerprints("chembl_21.rdkit2048.fpb");t2=time.time() <chemfp.fpb_io.FPBFingerprintArena object at 0x113809790> >>> t2-t1 0.3760688304901123Much of that time is likely waiting for the hard disk. When I open it again, the data is already in disk cache, so the load time is now under 0.1 seconds:
>>> t1=time.time(); chemfp.load_fingerprints("chembl_21.rdkit2048.fpb");t2=time.time() <chemfp.fpb_io.FPBFingerprintArena object at 0x113809650> >>> t2-t1 0.06340503692626953That's much better than the 20 seconds it took to read from the gzipped FPS file. For what it's worth, for the RDKit 2048-bit fingerprint, the fps.gz file is slightly more compact than the .fpb file:
% ls -l chembl_21.rdkit2048.fpb chembl_21.rdkit2048.fps.gz -rw-r--r-- 1 dalke staff 416031861 May 12 2016 chembl_21.rdkit2048.fps.gz -rw-r--r-- 1 dalke staff 457103145 Mar 18 02:36 chembl_21.rdkit2048.fpb
Select a subset
In chemfp, a set of fingerprints is called an "arena". (I struggled for a while to come up with a good name. "A 'hand' of fingerprints" was humorous, but in the end I re-used the name from memory management.) I want to use a subset of the arena. The best solution is to use the arena.copy() method. By itself it makes a copy of the entire data set:
>>> import chemfp >>> arena = chemfp.load_fingerprints("chembl_21.rdkit2048.fpb") >>> subarena = arena.copy() >>> len(arena), len(subarena) (1583839, 1583839)
Arenas act like an array of (id, fingerprint) pairs. The first element is term [0], the second [1], and so on. For example:
>>> arena[1234] (u'CHEMBL79699', '\x04\x00\x00\x08 ... many bytes omitted ... \x00\xc0') >>> arena[5678] (u'CHEMBL291214', '\x00\x00\x00\x08 ... many bytes omitted ... \x00 \x80')
I'll make another copy, but this time ask chemfp to only copy indices 1234 and 5678:
>>> subarena = arena.copy(indices=[1234, 5678]) >>> len(subarena) 2 >>> subarena.ids [u'CHEMBL79699', u'CHEMBL291214'] >>> subarena[0] (u'CHEMBL79699', '\x04\x00\x00\x08 ... many bytes omitted ... \x00\xc0')(Notes: by default the subarena may rearrange the fingerprints for better search performance. If you want the new arena to have the fingerprint in the same order as the list of indices then also pass in "reorder=False". Also, the u'' here shows that I'm using chemfp 3.0 on Python 2.7. Earlier versions of chemfp return both the id and fingerprint as byte strings, while chemfp 3.0 returns the id as a Unicode string and the fingerprint as a byte string. Under Python 2, Unicode strings are shown using the 'u' prefix. Under Python 3, byte strings are represented with the 'b' prefix.)
Make a random subset
I need some subsets so I can compare them. To get started I'll just use randomly selected elements. This is pleasently easy, using a technique I learned from reading the Python documentation for random.sample:
To choose a sample from a range of integers, use an xrange() object as an argument. This is especially fast and space efficient for sampling from a large population: sample(xrange(10000000), 60).Note: the xrange() object is Python 2 specific. If you are using Python 3 then replace "xrange" with "range".
Here are some examples where I sample 4 values (without replacement) from 0...99:
>>> import random >>> random.sample(xrange(100), 4) [68, 49, 79, 38] >>> random.sample(xrange(100), 4) [77, 57, 33, 43] >>> random.sample(xrange(100), 4) [32, 70, 13, 53]I'll change this to get 4 randomly indices in the arena:
>>> random.sample(xrange(len(arena)), 4) [1089497, 196869, 1474590, 376331] >>> random.sample(xrange(len(arena)), 4) [161904, 1136119, 1455737, 518548] >>> random.sample(xrange(len(arena)), 4) [1092929, 218864, 1436330, 1357672]
Here's a function to create a subarena arena of size n, in easy-to-copy form, which I'll then execute interactively:
import random def make_subset(arena, n): indices = random.sample(xrange(len(arena)), n) return arena.copy(indices=indices) >>> a = make_subset(arena, 4) >>> len(a) 4 >>> a.ids [u'CHEMBL1448181', u'CHEMBL1532656', u'CHEMBL3480613', u'CHEMBL1765556']
Jaccard index between two sets
The Jaccard index is a widely used scoring function. (In cheminformatics we know it better as the Tanimoto, for historical reasons tied more to how knowledge diffuses than to strict date priority.) Simply put, it's defined as:
number of elements common to both A and B Jaccard(A, B) = ----------------------------------------- number of unique elements in A or BThe harder part is to turn this into something meaningful. What does it mean for an fingerprint record to be common to both A and B?
One possibility is to look at the identifier, and say that a record is common to A and B if and only if the record's identifier is in both sets. However, it would be nice to capture some sense of chemical similarity. We can instead say that a record is in common if and only if they have the same fingerprints, but 100% similarity is very strict.
Let's lower the threshold a bit, and say that a record 'a' in A is common to B if and only if there is at least one record 'b' in B such that the fingerprint Tanimoto(a, b) >= 0.8.
If you think at this for a bit you'll see a couple of things. First off, this definition is not symmetric. If there is at least one similar 'b' for some 'a' then there is at least one similar 'a' for that 'b', but 'a' might be the nearest neighbor of two different 'b's. (The asymmetry could be fixed by also including the results of a search of B against A. I'm not going to consider that approach.) Second, "has a neighbor in B" can be implemented as a k=1 nearest neighbor search of A against B, which is a built-in chemfp function.
I'll make two subsets with 1,000 elements each, randomly sampled from the ChEMBL-21 arena:
>>> subset1 = make_subset(arena, 1000) >>> subset2 = make_subset(arena, 1000)then use chemfp to do the k=1 nearest-neighbor search between two arenas, with a minimum required score of 0.8:
>>> from chemfp import search >>> results = search.knearest_tanimoto_search_arena(subset1, subset2, k=1, threshold=0.8)
Most randomly chosen fingerprints are below 0.8 threshold, so they find 0 neighbors, but some of them have 1 neighbor:
>>> len(results[0]) 0 >>> len(results[1]) 0 >>> len(results[26]) 1There are a few ways to count the number of element in common. You can use normal Python functions, like:
>>> sum(len(result) for result in results) 24A faster way is to let chemfp count all of the hits for you:
>>> results.count_all() 24
Okay, so Tanimoto similarity gives the numerator to the Jaccard set similarity. What about the denominator, which is "the number of unique elements in A or B"?
For that I'll count the total number of unique identifiers in both A and B. You saw earlier that arena.ids gives the list of identifiers for the arena. I'll use those to make a Python set made of the union of the two ids, then get the size of the resulting set:
>>> len(set(subset1.ids).union(subset2.ids)) 2000Thus, the Jaccard index of the two sets is 24/2000, or 0.012. They are not that similar.
A slighly more interesting case
The denominator of 2000 is rather boring, as it's 1000+1000, that is, the sum of the individual set sizes. That's because the two randomly chosen sets have no elements in common:
>>> len(set(subset1.ids).intersection(subset2.ids)) 0I'll make some random samples until there are at least 3 elements in common:
>>> while len(set(subset1.ids).intersection(subset2.ids)) < 3: ... subset2 = make_subset(arena, 1000) ... >>> len(set(subset1.ids).intersection(subset2.ids)) 4 >>> set(subset1.ids).intersection(subset2.ids) set([u'CHEMBL64050', u'CHEMBL129835', u'CHEMBL255712', u'CHEMBL88792']) >>> len(set(subset1.ids).union(subset2.ids)) 1996These two subsets have 40 elements in common:
>>> search.knearest_tanimoto_search_arena(subset1, subset2, k=1, threshold=0.8).count_all() 40given them a Jaccard index of 40/1996, or 0.02. That's better than 0.012, but still not very similar.
Estimating background similarity
I want to get some idea of the average similarity between two randomly chosen sets, at different sizes and threshold scores. I'll define a 'jaccard' function to help with the calculation:
from __future__ import division # for Python 2.x, make 1/2 be 0.5, not 0 from chemfp import search # Note: this isn't quite the Jaccard similarity as the non-reflexive # numerator means that often jaccard(A, B) != jaccard(B, A) def jaccard(arena1, arena2, threshold=0.8): numerator = search.knearest_tanimoto_search_arena( arena1, arena2, k=1, threshold=threshold).count_all() denominator = len(set(arena1.ids).union(arena2.ids)) return numerator / denominator >>> jaccard(subset1, subset2) 0.02004008016032064 >>> jaccard(make_subset(arena, 1000), make_subset(arena, 1000)) 0.017508754377188594 >>> jaccard(make_subset(arena, 1000), make_subset(arena, 1000)) 0.021 >>> jaccard(make_subset(arena, 1000), make_subset(arena, 1000)) 0.0155
It varies quite a bit, so I'll compute 100 scores and get some information about the min, max, average, and standard deviation:
>>> scores = [] >>> for i in range(50): ... scores.append(jaccard(make_subset(arena, 1000), make_subset(arena, 1000))) ... >>> min(scores), max(scores) (0.008, 0.022)
What about the average and standard deviation? Python 3.4 introduced the statistics module, which has those as built-in functions. While chemfp 3.0 supports Python 3.5 and greater, I want this blog post to work on both Python 2.7 and 3.3, so I'll import some functionality from numpy:
>>> import numpy as np >>> np.mean(scores) 0.014819138037618857 >>> np.std(scores) 0.0025880786854096645(One obvious question is, is standard deviation an appropriate characterization, or am I hammering a square block into a round hole? I'll let someone more knowledgeable about statistics figure that out.)
I then wrote a function to show the mean and standard deviation of 100 Jaccard scores, across the NxM cross product of some set sizes, that is size 10x10, size 10x20, ... 10x2000, 20x10, ... up to size 2000x2000. Here's the function, which is mostly taken up by code that tries to format things nicely:
def compute_score_table(arena): sizes = (10, 20, 50, 100, 200, 500, 1000, 2000) print(" " + "".join(str(size).center(12) for size in sizes)) for size1 in sizes: output1 = str(size1).rjust(4) + " " output2 = " " for size2 in sizes: scores = [] for i in range(100): score = jaccard(make_subset(arena, size1), make_subset(arena, size2)) scores.append(score) output1 += ("%.3f +/-" % (np.mean(scores),)).ljust(12) output2 += (" %.5f" % (np.std(scores),)).ljust(12) print(output1) print(output2)
The output shows the average and standard deviations on alternate lines, so I could make better use of vertical space:
>>> compute_score_table(arena) 10 20 50 100 200 500 1000 2000 10 0.000 +/- 0.000 +/- 0.000 +/- 0.000 +/- 0.000 +/- 0.000 +/- 0.000 +/- 0.000 +/- 0.00000 0.00000 0.00233 0.00127 0.00121 0.00059 0.00036 0.00033 20 0.000 +/- 0.000 +/- 0.001 +/- 0.001 +/- 0.001 +/- 0.001 +/- 0.001 +/- 0.001 +/- 0.00332 0.00000 0.00345 0.00266 0.00199 0.00115 0.00068 0.00054 50 0.000 +/- 0.001 +/- 0.001 +/- 0.002 +/- 0.002 +/- 0.001 +/- 0.001 +/- 0.001 +/- 0.00233 0.00371 0.00271 0.00318 0.00322 0.00146 0.00097 0.00067 100 0.000 +/- 0.001 +/- 0.001 +/- 0.001 +/- 0.003 +/- 0.003 +/- 0.003 +/- 0.002 +/- 0.00127 0.00198 0.00256 0.00256 0.00382 0.00193 0.00174 0.00121 200 0.000 +/- 0.001 +/- 0.002 +/- 0.002 +/- 0.004 +/- 0.005 +/- 0.005 +/- 0.005 +/- 0.00121 0.00276 0.00316 0.00311 0.00305 0.00286 0.00208 0.00119 500 0.000 +/- 0.001 +/- 0.002 +/- 0.003 +/- 0.006 +/- 0.009 +/- 0.010 +/- 0.010 +/- 0.00110 0.00108 0.00236 0.00298 0.00314 0.00318 0.00251 0.00177 1000 0.000 +/- 0.001 +/- 0.002 +/- 0.004 +/- 0.006 +/- 0.012 +/- 0.014 +/- 0.016 +/- 0.00061 0.00132 0.00225 0.00280 0.00291 0.00275 0.00272 0.00254 2000 0.001 +/- 0.001 +/- 0.002 +/- 0.004 +/- 0.007 +/- 0.014 +/- 0.019 +/- 0.024 +/- 0.00128 0.00130 0.00192 0.00269 0.00293 0.00303 0.00309 0.00250(By the way, if all of the values are 0.0 then you are likely using Python 2.7 and have omitted the "from __future__ import division" line when you defined the jaccard() function. I did that a few times when developing this essay.)
My jaccard() function is not reflexive, which you can verify by comparing terms i by j with j by i, as in 1000x2000 vs 2000x100. Their respective values are 0.019+/-0.00309 and 0.016+/-0.00254. The averages are multiple standard deviations away from each other.
The scores are also a function of set size, which is not what you want for a good scoring function. The 2000x2000 term has an average score of 0.024 while the 1000x1000 has an average score 0.014. The elements in the sets are randomly selected, so I expected the two comparisons to have about the same score. This bias towards larger sets is built into my similarity definition, so I conclud that my jaccard() implementation is not a good scoring function for sets.
There are ways to improve the scoring function, for example by normalizing to the background average/standard deviation to give a Z-score, or by better use of statistics. That is "left to the student as an exercise."
Here's the final code in a single block so you can use it more easily:
from __future__ import division import random import numpy as np import chemfp from chemfp import search # Support both Python 2 and Python 3 try: xrange # Check if this is Python 2 except NameError: xrange = range # Use 'range' for Python 3 def make_subset(arena, n): indices = random.sample(xrange(len(arena)), n) return arena.copy(indices=indices) # Note: this isn't quite the Jaccard similarity as the non-reflexive # numerator means that often jaccard(A, B) != jaccard(B, A) def jaccard(arena1, arena2, threshold=0.8): numerator = search.knearest_tanimoto_search_arena( arena1, arena2, k=1, threshold=threshold).count_all() denominator = len(set(arena1.ids).union(arena2.ids)) return numerator / denominator def compute_score_table(arena): sizes = (10, 20, 50, 100, 200, 500, 1000, 2000) print(" " + "".join(str(size).center(12) for size in sizes)) for size1 in sizes: output1 = str(size1).rjust(4) + " " output2 = " " for size2 in sizes: scores = [] for i in range(100): score = jaccard(make_subset(arena, size1), make_subset(arena, size2)) scores.append(score) output1 += ("%.3f +/-" % (np.mean(scores),)).ljust(12) output2 += (" %.5f" % (np.std(scores),)).ljust(12) print(output1) print(output2) arena = chemfp.load_fingerprints("chembl_21.rdkit2048.fpb") compute_score_table(arena)
Details about the best match
While it isn't needed for the set-to-set comparison, you might be interested in knowing more about the closest match in the k=1 nearest search. It's easy to get the list of ids and scores for the non-empty rows:
>>> subset1 = make_subset(arena, 1000) >>> subset2 = make_subset(arena, 1000) >>> results = search.knearest_tanimoto_search_arena( ... subset1, subset2, k=1, threshold=0.8) ... >>> for result in results: ... if result: ... print("%-13s %.3f" % result.get_ids_and_scores()[0]) ... CHEMBL386477 0.926 CHEMBL386477 1.000 CHEMBL275879 0.954 CHEMBL2138462 0.921 CHEMBL83932 0.813 CHEMBL2403669 0.805 CHEMBL2372083 0.819 CHEMBL1213903 0.988 CHEMBL1574001 0.901 CHEMBL2337012 0.913 CHEMBL370254 0.806 CHEMBL555410 0.908 CHEMBL2337012 0.897 CHEMBL1683180 0.822 CHEMBL73244 0.801 CHEMBL605903 0.831 CHEMBL1834630 0.806 CHEMBL1475722 0.821 CHEMBL65904 0.912 CHEMBL2106627 0.812 CHEMBL108353 0.816 CHEMBL108353 0.825
It becomes a bit more complicated if you also want information about the query and its fingerprint, so let's do that. I'll also print the query id, the number of 1-bits in the query and target fingerprint (better known as the popcount), and the number of 1-bits in their fingerprint intersection, that is, the number of bits in common. Here's the code:
import random import numpy as np import chemfp from chemfp import search, bitops # Support both Python 2 and Python 3 try: xrange # Check if this is Python 2 except NameError: xrange = range # Use 'range' for Python 3 def make_subset(arena, n): indices = random.sample(xrange(len(arena)), n) return arena.copy(indices=indices) arena = chemfp.load_fingerprints("chembl_21.rdkit2048.fpb") subset1 = make_subset(arena, 1000) subset2 = make_subset(arena, 1000) results = search.knearest_tanimoto_search_arena( subset1, subset2, k=1, threshold=0.8) print(" query_id #bits target_id #bits common score") for i, result in enumerate(results): if not result: continue # skip elements with no matches query_id, query_fp = subset1[i] num_query_bits = bitops.byte_popcount(query_fp) j, score = result.get_indices_and_scores()[0] target_id, target_fp = subset2[j] num_target_bits = bitops.byte_popcount(target_fp) in_common = bitops.byte_intersect_popcount(query_fp, target_fp) print("%-13s %5d %-13s %5d %5d %.4f" % (query_id, num_query_bits, target_id, num_target_bits, in_common, score))It generates the following output:
query_id #bits target_id #bits common score CHEMBL340793 710 CHEMBL443821 657 657 0.9254 CHEMBL2115272 710 CHEMBL1096675 758 667 0.8327 CHEMBL262239 761 CHEMBL411281 761 761 1.0000 CHEMBL529191 821 CHEMBL3434725 843 784 0.8909 CHEMBL389943 868 CHEMBL414961 887 807 0.8513 CHEMBL350381 1008 CHEMBL162288 899 899 0.8919 CHEMBL1407603 1042 CHEMBL1610293 1001 955 0.8778 CHEMBL484872 1058 CHEMBL1610293 1001 967 0.8855 CHEMBL381366 1066 CHEMBL426130 1060 985 0.8633 CHEMBL2058350 1098 CHEMBL2058354 1005 1005 0.9153 CHEMBL1898269 1112 CHEMBL1893417 1049 982 0.8329 CHEMBL132434 1123 CHEMBL15324 1158 1079 0.8977 CHEMBL332868 1128 CHEMBL77685 1160 1052 0.8511 CHEMBL2143647 1188 CHEMBL1893417 1049 1008 0.8202 CHEMBL1711444 1189 CHEMBL1310059 1231 1102 0.8361 CHEMBL3249400 1232 CHEMBL2371463 1128 1062 0.8182 CHEMBL1367025 1242 CHEMBL1352709 1345 1213 0.8828 CHEMBL77799 1290 CHEMBL1081652 1218 1135 0.8267 CHEMBL2303747 1296 CHEMBL15324 1158 1096 0.8071 CHEMBL451899 1656 CHEMBL477245 1551 1433 0.8078 CHEMBL362938 1708 CHEMBL1731136 1966 1641 0.8072 CHEMBL525425 1730 CHEMBL1731136 1966 1658 0.8135 CHEMBL516074 1742 CHEMBL3113501 1876 1691 0.8775 CHEMBL2029145 1764 CHEMBL2304149 1713 1706 0.9633 CHEMBL298265 1865 CHEMBL1731136 1966 1788 0.8752 CHEMBL373910 1906 CHEMBL1731136 1966 1830 0.8962
Cumulative value scoring methods
I earlier pointed out two big problems with the jaccard() implementation I proposed. It isn't symmetric (jaccard(a, b) is rarely equal to jaccard(b, a)), and it has a bias towards larger sizes.
There's another problem. Suppose A and B both have n elements, with no overlaps (making the denominator 2n), and suppose also that each element in A has exactly one neighbor with at least a 0.8 similarity. Then the jaccard() method will report a similarity of 1.0.
Now consider a set C which again has n elements, with no overlaps with elements in A. But this time each element of A has 5 neighbors in C, and vice versa. The jaccard() function only checks if there is at least one neighbor, so will still return a score of 1.0, even though sets A and C are more compact and closer to each other than A was to B.
One way to capture that information is to look at the total number of matches between A and B, and between A and C. Instead of doing a k=1 nearest neighbor search, I'll do a threshold search and count the total number of matches between the two:
>>> A = make_subset(arena, 1000) >>> B = make_subset(arena, 1000) >>> C = make_subset(arena, 1000) >>> results_AB = search.threshold_tanimoto_search_arena(A, B, threshold=0.8) >>> results_AB.count_all() 34 >>> results_AC = search.threshold_tanimoto_search_arena(A, C, threshold=0.8) >>> results_AC.count_all() 49This result is also reflexive, that is, comparing (A, B) and (B, A) give the same values.
One possible concern is that the match count assumes that all similarities are of equal importance, so a similarity of 0.8 is just as strong as 1.0. A more subtle method might use the cumulative sum of the scores in AxB and AxC, which is returned by the cumulative_score_all() method:
>>> results_AB.cumulative_score_all() 29.499029322763217 >>> results_AC.cumulative_score_all() 42.23963171858597
I pointed out earlier that my naive jaccard() implementation is biased towards comparing large sets. A similar bias applies when using counts or cumulative scores. With randomly selected elements, the count or cumulative scores will scale as the product of the set sizes:
>>> def get_count(n, m, repeat=100): ... count = 0 ... for i in range(repeat): ... subset1 = make_subset(arena, n) ... subset2 = make_subset(arena, m) ... results = search.threshold_tanimoto_search_arena(subset1, subset2) ... count += results.count_all() ... return count ... >>> get_count(1000, 1000) 29709 >>> get_count(1000, 1000) 29803 >>> get_count(2000, 1000) 58237 >>> get_count(2000, 1000) 61533 >>> get_count(3000, 1000) 91164 >>> get_count(3000, 1000) 91833 >>> get_count(3000, 3000) 265121 >>> get_count(3000, 3000)Change the count_all() to cumulative_score_all() to get similar results using the sum of all the scores.
As I mentioned, an advantage of the count and cumulative score approach over the k=1 nearest-neighbor search is that they are symmetric, or more precisely, reflexive. Compare the following to the (3000, 1000) output of the previous code and you'll see they give about the same values:
>>> get_count(1000, 3000) 91185 >>> get_count(1000, 3000) 90586
The joy of floating point arithmetic
This is an aside. Here are two alternative ways to compute the cumulative sum of all of the scores, the first by computing the cumulative_sum() of each individual result, and the second by computing the sum of the sum of each list of scores using Python.
>>> sum(result.cumulative_score() for result in results_AB) 29.499029322763214 >>> sum(sum(result.get_scores()) for result in results_AB) 29.499029322763214It really bugs me that the cumulative_score_all() and the other two approaches differ by -3.552713678800501e-15, which is the least significant bit of the two representations:
>>> sum(sum(result.get_scores()) for result in results_AB).hex() '0x1.d7fc062bd0356p+4' >>> results_AB.cumulative_score_all().hex() '0x1.d7fc062bd0357p+4'
After a bit of research, I confirmed that it's not due to a chemfp bug, but rather because floating point addition using doubles is not associative. If I sum up all of the scores in order, I get the value from cumulative_score_all():
>>> scores = [] >>> for result in results_AB: ... scores.extend(result.get_scores()) ... >>> sum(scores) 29.499029322763217
If I group the scores so I first sum up each row, I get the other value:
>>> scores = [] >>> for result in results_AB: ... scores.append(sum(result.get_scores())) ... >>> sum(scores) 29.499029322763214
Minor differences in the last significant digit are an all too common consequence of using floating point numbers, so while it annoyed me, it was the first reason I thought of.
Cumulative score-based similarity
In an earlier section I showed that the total count or the cumulative score between two randomly chosen sets scales as the product of the two set sizes. What about normalizing the cumulative score by that product?
There's probably a name for this similarity measure, but I don't know much about this field and don't know the name. I'll call it "sss" for "sum of scores similarity".
# I don't actually use this one def sss(arena1, arena2, threshold=0.8): results = search.threshold_tanimoto_search_arena( arena1, arena2, threshold=threshold) return results.cumulative_score_all() / (len(arena1) * len(arena2))After some experimentation, I decided to scale the score by 300, because that results in two sets with randomly chosen terms having a similarity of 0.01.
# This is the one I use. def sss(arena1, arena2, threshold=0.8): results = search.threshold_tanimoto_search_arena( arena1, arena2, threshold=threshold) # The scaling factor of 300 was chosen so that random ChEMBL # subsets have a score of about 0.01. It doesn't matter for # ranking as it's used as an arbitrary scaling factor. similarity = 300 * results.cumulative_score_all() / (len(arena1) * len(arena2)) return similarityI made a variant of the similarity table I used before, this time with the sss() function as my similarity measure, instead of jaccard():
def compute_sss_table(arena): sizes = (10, 20, 50, 100, 200, 500, 1000, 2000) print(" " + "".join(str(size).center(12) for size in sizes)) for size1 in sizes: output1 = str(size1).rjust(4) + " " output2 = " " for size2 in sizes: scores = [] for i in range(100): score = sss(make_subset(arena, size1), make_subset(arena, size2)) scores.append(score) output1 += ("%.3f +/-" % (np.mean(scores),)).ljust(12) output2 += (" %.5f" % (np.std(scores),)).ljust(12) print(output1) print(output2)The output table is very close to symmetric under transposition, and with a nearly constant similarity score of 0.011:
10 20 50 100 200 500 1000 2000 10 0.024 +/- 0.012 +/- 0.005 +/- 0.013 +/- 0.013 +/- 0.015 +/- 0.008 +/- 0.011 +/- 0.23916 0.12318 0.05309 0.06569 0.04212 0.03571 0.02265 0.01623 20 0.012 +/- 0.013 +/- 0.008 +/- 0.013 +/- 0.013 +/- 0.010 +/- 0.012 +/- 0.016 +/- 0.12176 0.08863 0.05592 0.03930 0.02858 0.01905 0.02362 0.02322 50 0.005 +/- 0.028 +/- 0.018 +/- 0.010 +/- 0.010 +/- 0.010 +/- 0.011 +/- 0.012 +/- 0.04816 0.09862 0.04181 0.02301 0.02007 0.01040 0.01078 0.01331 100 0.010 +/- 0.006 +/- 0.011 +/- 0.012 +/- 0.013 +/- 0.011 +/- 0.011 +/- 0.011 +/- 0.05136 0.02660 0.02348 0.01906 0.01725 0.00920 0.00927 0.00801 200 0.011 +/- 0.013 +/- 0.012 +/- 0.011 +/- 0.010 +/- 0.010 +/- 0.012 +/- 0.011 +/- 0.05310 0.03826 0.02014 0.01716 0.00940 0.00745 0.00716 0.00496 500 0.013 +/- 0.008 +/- 0.011 +/- 0.012 +/- 0.012 +/- 0.011 +/- 0.012 +/- 0.012 +/- 0.03374 0.01458 0.01471 0.01177 0.00873 0.00597 0.00509 0.00354 1000 0.009 +/- 0.012 +/- 0.010 +/- 0.012 +/- 0.012 +/- 0.012 +/- 0.012 +/- 0.012 +/- 0.02291 0.02247 0.01068 0.00878 0.00639 0.00508 0.00347 0.00300 2000 0.011 +/- 0.011 +/- 0.013 +/- 0.013 +/- 0.011 +/- 0.011 +/- 0.011 +/- 0.011 +/- 0.01723 0.01354 0.01446 0.01031 0.00552 0.00332 0.00275 0.00209This is much more like what I expect from a similarity function!
Effect of threshold on the sss score
I picked a threshold of 0.8 because that's a pretty common value and few will object to it. It might be that a different threshold is better. How does the sss score change as a function of the threshold?
One obvious implementation is to call sss() with a different threshold each time:
>>> arena = chemfp.load_fingerprints("chembl_21.rdkit2048.fpb") >>> subset1 = make_subset(arena, 1000) >>> subset2 = make_subset(arena, 1000) >>> sss(subset1, subset2, threshold=0.9) 0.0014073436068272435 >>> sss(subset1, subset2, threshold=0.8) 0.006656850064948667 >>> sss(subset1, subset2, threshold=0.7) 0.038875826785241686 >>> sss(subset1, subset2, threshold=0.6) 0.2569320521578928(As you can see, while I managed to normalize the score as a function of set size, it's still threshold dependent.)
The obvious implementation is perfectly reasonable. There's a more computationally efficient way. You'll notice that each calculation ends up redoing the calculation before it, because all compounds with at least 0.9 are also within at least 0.8 similarity, etc.
Chemfp has another way to get the same values. The cumulative_score_all() method takes extra arguments to select the min and max values to use, and if the summation should include or exclude the endpoints. I'll get a SearchResults instance and ask for its help:
>>> results = search.threshold_tanimoto_search_arena( ... subset1, subset2, threshold=0.6) >>> >>> help(results.cumulative_score_all) cumulative_score_all(self, min_score=None, max_score=None, interval='[]') method of chemfp.search.SearchResults instance The sum of all scores in all rows which are between *min_score* and *max_score* Using the default parameters this returns the sum of all of the scores in all of the results. With a specified range this returns the sum of all of the scores in that range. The cumulative score is also known as the raw score. The default *min_score* of None is equivalent to -infinity. The default *max_score* of None is equivalent to +infinity. The *interval* parameter describes the interval end conditions. The default of "[]" uses a closed interval, where min_score <= score <= max_score. The interval "()" uses the open interval where min_score < score < max_score. The half-open/half-closed intervals "(]" and "[)" are also supported. :param min_score: the minimum score in the range. :type min_score: a float, or None for -infinity :param max_score: the maximum score in the range. :type max_score: a float, or None for +infinity :param interval: specify if the end points are open or closed. :type interval: one of "[]", "()", "(]", "[)" :returns: a floating point countI'll use the min_score parameter. First, I'll find all fingerprints with a similarity threshold of 0.6, which is the lowest score I'm concerned about, and, to reduce complexity later on, I'll merge the "300*" and the product of the sizes into a single scaling factor:
>>> results = search.threshold_tanimoto_search_arena( ... subset1, subset2, threshold=0.6) >>> scaling = 300/len(subset1)/len(subset2)Then I'll compute the scores for each threshold, and get the same results as earlier.
>>> scaling*results.cumulative_score_all(min_score=0.9) 0.0014073436068272435 >>> scaling*results.cumulative_score_all(min_score=0.8) 0.006656850064948667 >>> scaling*results.cumulative_score_all(min_score=0.7) 0.038875826785241686 >>> scaling*results.cumulative_score_all(min_score=0.6) 0.2569320521578928
For fun, I put it in a loop and computed more thresholds:
>>> scaling = 300/len(subset1)/len(subset2) >>> for threshold in (0.99, 0.95, 0.90, 0.85, 0.8, 0.75, 0.7, 0.65, 0.6): ... score = scaling * results.cumulative_score_all(min_score=threshold) ... print("%.2f %7.4f" % (threshold, score)) ... 0.99 0.0003 0.95 0.0006 0.90 0.0014 0.85 0.0030 0.80 0.0067 0.75 0.0166 0.70 0.0389 0.65 0.0921 0.60 0.2569A quick test now shows that computing the similarities once and finding the cumulative sum for the 9 threshold levels is about 4x faster than doing 9 independent threshold calculations.
Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me

Copyright © 2001-2020 Andrew Dalke Scientific AB