Fingerprint set similarity

This is part 1 of a three-part series in generating fingerprint-based set similarities using chemfp. Read part 2 where I figure out how to use the ChEMBL bioactivity data and part 3 where I put everything together and visualize the results.

Someone recently asked how I might use chemfp to compute the similarity between two sets of fingerprints. This would be used for something like SEA (Similarity Ensemble Approach), which "relates proteins based on the set-wise chemical similarity among their ligands. It can be used to rapidly search large compound databases and to build cross-target similarity maps."

The underlying techniques of document set similarity are used in many fields. Bioinformatics, for example, uses it to analyze gene expression microarrays to find correlations between genes which might indicate they affect similar pathways. That said, I am no expert on the topic. I implemented a SEA-like algorithm for one client, but I don't know enough to judge the effectiveness of one approach over another, nor do I know the underlying statistics.

But that's okay. I'll assume that *you* know these details, and
just want to know how to use chemfp to get the raw numbers to feed
into your scoring function.

The first step is to create and load fingerprints. I have some
pre-computed fingerprints for ChEMBL 21, so I'll use that as my
example. I'll use the 2048-bit RDKit fingerprints. To generate them
yourself, can use the command-line tool `rdkit2fps`, like this:

%rdkit2fps chembl_21.sdf.gz -o chembl_21.rdkit2048.fps.gz

The FPS file is a text-based, line-oriented format which is easy to for both humans and software to read and write. Easy, however, does not mean fast. It takes about 20 seconds to load this fps.gz file:

>>>import time>>>import chemfp>>>filename = "/Users/dalke/databases/chembl_21.rdkit.rdkit2048.fps.gz">>>t1 = time.time(); chemfp.load_fingerprints(filename);t2=time.time()<chemfp.arena.FingerprintArena object at 0x1021aef90> >>>t2-t119.388361930847168

Chemfp 2.0 introduced the FPB binary file format, which is more structured around the chemfp internal data structures. It's very fast for chemfp to load, but much more complicated. All of the chemfp tools know how to read and write the FPB format, so you could create the FPB file like this:

%If you already have an FPS (or gzip-ed FPS) file then use therdkit2fps chembl_21.sdf.gz -o chembl_21.rdkit2048.fpb

`fpcat`tool to convert it to an FPB file:

%The conversion took about 32 seconds on my 7 year old laptop.fpbcat chembl_21.rdkit2048.fps.gz -o chembl_21.rdkit2048.fpb

I'll switch back to Python and show why I developed the FPB format. The first time I loaded it took 0.4 seconds:

>>>Much of that time is likely waiting for the hard disk. When I open it again, the data is already in disk cache, so the load time is now under 0.1 seconds:t1=time.time(); chemfp.load_fingerprints("chembl_21.rdkit2048.fpb");t2=time.time()<chemfp.fpb_io.FPBFingerprintArena object at 0x113809790> >>>t2-t10.3760688304901123

>>>That's much better than the 20 seconds it took to read from the gzipped FPS file. For what it's worth, for the RDKit 2048-bit fingerprint, the fps.gz file is slightly more compact than the .fpb file:t1=time.time(); chemfp.load_fingerprints("chembl_21.rdkit2048.fpb");t2=time.time()<chemfp.fpb_io.FPBFingerprintArena object at 0x113809650> >>>t2-t10.06340503692626953

%ls -l chembl_21.rdkit2048.fpb chembl_21.rdkit2048.fps.gz-rw-r--r-- 1 dalke staff 416031861 May 12 2016 chembl_21.rdkit2048.fps.gz -rw-r--r-- 1 dalke staff 457103145 Mar 18 02:36 chembl_21.rdkit2048.fpb

## Select a subset

In chemfp, a set of fingerprints is called an "arena". (I struggled
for a while to come up with a good name. "A 'hand' of fingerprints"
was humorous, but in the end I re-used the name from memory
management.) I want to use a subset of the arena. The best
solution is to use the `arena.copy()` method. By itself it
makes a copy of the entire data set:

>>>import chemfp>>>arena = chemfp.load_fingerprints("chembl_21.rdkit2048.fpb")>>>subarena = arena.copy()>>>len(arena), len(subarena)(1583839, 1583839)

Arenas act like an array of (id, fingerprint) pairs. The first element is term [0], the second [1], and so on. For example:

>>>arena[1234](u'CHEMBL79699', '\x04\x00\x00\x08 ... many bytes omitted ... \x00\xc0') >>>arena[5678](u'CHEMBL291214', '\x00\x00\x00\x08 ... many bytes omitted ... \x00 \x80')

I'll make another copy, but this time ask chemfp to only copy indices 1234 and 5678:

>>>(Notes: by default the subarena may rearrange the fingerprints for better search performance. If you want the new arena to have the fingerprint in the same order as the list of indices then also pass in "reorder=False". Also, the u'' here shows that I'm using chemfp 3.0 on Python 2.7. Earlier versions of chemfp return both the id and fingerprint as byte strings, while chemfp 3.0 returns the id as a Unicode string and the fingerprint as a byte string. Under Python 2, Unicode strings are shown using the 'u' prefix. Under Python 3, byte strings are represented with the 'b' prefix.)subarena = arena.copy(indices=[1234, 5678])>>>len(subarena)2 >>>subarena.ids[u'CHEMBL79699', u'CHEMBL291214'] >>>subarena[0](u'CHEMBL79699', '\x04\x00\x00\x08 ... many bytes omitted ... \x00\xc0')

## Make a random subset

I need some subsets so I can compare them. To get started I'll just
use randomly selected elements. This is pleasently easy, using a
technique I learned from reading the Python
documentation for `random.sample`:

To choose a sample from a range of integers, use an xrange() object as an argument. This is especially fast and space efficient for sampling from a large population: sample(xrange(10000000), 60).Note: the

`xrange()`object is Python 2 specific. If you are using Python 3 then replace "xrange" with "range".

Here are some examples where I sample 4 values (without replacement) from 0...99:

>>>I'll change this to get 4 randomly indices in the arena:import random>>>random.sample(xrange(100), 4)[68, 49, 79, 38] >>>random.sample(xrange(100), 4)[77, 57, 33, 43] >>>random.sample(xrange(100), 4)[32, 70, 13, 53]

>>>random.sample(xrange(len(arena)), 4)[1089497, 196869, 1474590, 376331] >>>random.sample(xrange(len(arena)), 4)[161904, 1136119, 1455737, 518548] >>>random.sample(xrange(len(arena)), 4)[1092929, 218864, 1436330, 1357672]

Here's a function to create a subarena arena of size `n`, in
easy-to-copy form, which I'll then execute interactively:

import random def make_subset(arena, n): indices = random.sample(xrange(len(arena)), n) return arena.copy(indices=indices)>>>a = make_subset(arena, 4)>>>len(a)4 >>>a.ids[u'CHEMBL1448181', u'CHEMBL1532656', u'CHEMBL3480613', u'CHEMBL1765556']

## Jaccard index between two sets

The Jaccard index is a widely used scoring function. (In cheminformatics we know it better as the Tanimoto, for historical reasons tied more to how knowledge diffuses than to strict date priority.) Simply put, it's defined as:

number of elements common to both A and B Jaccard(A, B) = ----------------------------------------- number of unique elements in A or BThe harder part is to turn this into something meaningful. What does it mean for an fingerprint record to be common to both A and B?

One possibility is to look at the identifier, and say that a record is common to A and B if and only if the record's identifier is in both sets. However, it would be nice to capture some sense of chemical similarity. We can instead say that a record is in common if and only if they have the same fingerprints, but 100% similarity is very strict.

Let's lower the threshold a bit, and say that a record 'a' in A is common to B if and only if there is at least one record 'b' in B such that the fingerprint Tanimoto(a, b) >= 0.8.

If you think at this for a bit you'll see a couple of things. First
off, this definition is *not* symmetric. If there is at least one
similar 'b' for some 'a' then there is at least one similar 'a' for
that 'b', but 'a' might be the nearest neighbor of two different
'b's. (The asymmetry could be fixed by also including the results of a
search of B against A. I'm not going to consider that approach.)
Second, "has a neighbor in B" can be implemented as a k=1 nearest
neighbor search of A against B, which is a built-in chemfp function.

I'll make two subsets with 1,000 elements each, randomly sampled from the ChEMBL-21 arena:

>>>then use chemfp to do the k=1 nearest-neighbor search between two arenas, with a minimum required score of 0.8:subset1 = make_subset(arena, 1000)>>>subset2 = make_subset(arena, 1000)

>>>from chemfp import search>>>results = search.knearest_tanimoto_search_arena(subset1, subset2, k=1, threshold=0.8)

Most randomly chosen fingerprints are below 0.8 threshold, so they find 0 neighbors, but some of them have 1 neighbor:

>>>There are a few ways to count the number of element in common. You can use normal Python functions, like:len(results[0])0 >>>len(results[1])0 >>>len(results[26])1

>>>A faster way is to let chemfp count all of the hits for you:sum(len(result) for result in results)24

>>>results.count_all()24

Okay, so Tanimoto similarity gives the numerator to the Jaccard set similarity. What about the denominator, which is "the number of unique elements in A or B"?

For that I'll count the total number of unique identifiers in both A
and B. You saw earlier that `arena.ids` gives the list of
identifiers for the arena. I'll use those to make a Python
`set` made of the union of the two ids, then get the size of
the resulting set:

>>>Thus, the Jaccard index of the two sets is 24/2000, or 0.012. They are not that similar.len(set(subset1.ids).union(subset2.ids))2000

### A slighly more interesting case

The denominator of 2000 is rather boring, as it's 1000+1000, that is, the sum of the individual set sizes. That's because the two randomly chosen sets have no elements in common:

>>>I'll make some random samples until there are at least 3 elements in common:len(set(subset1.ids).intersection(subset2.ids))0

>>>These two subsets have 40 elements in common:while len(set(subset1.ids).intersection(subset2.ids)) < 3:...subset2 = make_subset(arena, 1000)... >>>len(set(subset1.ids).intersection(subset2.ids))4 >>>set(subset1.ids).intersection(subset2.ids)set([u'CHEMBL64050', u'CHEMBL129835', u'CHEMBL255712', u'CHEMBL88792']) >>>len(set(subset1.ids).union(subset2.ids))1996

>>>given them a Jaccard index of 40/1996, or 0.02. That's better than 0.012, but still not very similar.search.knearest_tanimoto_search_arena(subset1, subset2, k=1, threshold=0.8).count_all()40

## Estimating background similarity

I want to get some idea of the average similarity between two randomly
chosen sets, at different sizes and threshold scores. I'll define a
'`jaccard`' function to help with the calculation:

from __future__ import division # for Python 2.x, make 1/2 be 0.5, not 0 from chemfp import search# Note: this isn't quite the Jaccard similarity as the non-reflexive # numerator means that often jaccard(A, B) != jaccard(B, A)def jaccard(arena1, arena2, threshold=0.8): numerator = search.knearest_tanimoto_search_arena( arena1, arena2, k=1, threshold=threshold).count_all() denominator = len(set(arena1.ids).union(arena2.ids)) return numerator / denominator>>>jaccard(subset1, subset2)0.02004008016032064 >>>jaccard(make_subset(arena, 1000), make_subset(arena, 1000))0.017508754377188594 >>>jaccard(make_subset(arena, 1000), make_subset(arena, 1000))0.021 >>>jaccard(make_subset(arena, 1000), make_subset(arena, 1000))0.0155

It varies quite a bit, so I'll compute 100 scores and get some information about the min, max, average, and standard deviation:

>>>scores = []>>>for i in range(50):...scores.append(jaccard(make_subset(arena, 1000), make_subset(arena, 1000)))... >>>min(scores), max(scores)(0.008, 0.022)

What about the average and standard deviation? Python 3.4 introduced
the `statistics`
module, which has those as built-in functions. While chemfp 3.0
supports Python 3.5 and greater, I want this blog post to work on both
Python 2.7 and 3.3, so I'll import some functionality from numpy:

>>>(One obvious question is, is standard deviation an appropriate characterization, or am I hammering a square block into a round hole? I'll let someone more knowledgeable about statistics figure that out.)import numpy as np>>>np.mean(scores)0.014819138037618857 >>>np.std(scores)0.0025880786854096645

I then wrote a function to show the mean and standard deviation of 100 Jaccard scores, across the NxM cross product of some set sizes, that is size 10x10, size 10x20, ... 10x2000, 20x10, ... up to size 2000x2000. Here's the function, which is mostly taken up by code that tries to format things nicely:

def compute_score_table(arena): sizes = (10, 20, 50, 100, 200, 500, 1000, 2000) print(" " + "".join(str(size).center(12) for size in sizes)) for size1 in sizes: output1 = str(size1).rjust(4) + " " output2 = " " for size2 in sizes: scores = [] for i in range(100): score = jaccard(make_subset(arena, size1), make_subset(arena, size2)) scores.append(score) output1 += ("%.3f +/-" % (np.mean(scores),)).ljust(12) output2 += (" %.5f" % (np.std(scores),)).ljust(12) print(output1) print(output2)

The output shows the average and standard deviations on alternate lines, so I could make better use of vertical space:

>>>(By the way, if all of the values are 0.0 then you are likely using Python 2.7 and have omitted the "from __future__ import division" line when you defined the jaccard() function. I did that a few times when developing this essay.)compute_score_table(arena)10 20 50 100 200 500 1000 2000 10 0.000 +/- 0.000 +/- 0.000 +/- 0.000 +/- 0.000 +/- 0.000 +/- 0.000 +/- 0.000 +/- 0.00000 0.00000 0.00233 0.00127 0.00121 0.00059 0.00036 0.00033 20 0.000 +/- 0.000 +/- 0.001 +/- 0.001 +/- 0.001 +/- 0.001 +/- 0.001 +/- 0.001 +/- 0.00332 0.00000 0.00345 0.00266 0.00199 0.00115 0.00068 0.00054 50 0.000 +/- 0.001 +/- 0.001 +/- 0.002 +/- 0.002 +/- 0.001 +/- 0.001 +/- 0.001 +/- 0.00233 0.00371 0.00271 0.00318 0.00322 0.00146 0.00097 0.00067 100 0.000 +/- 0.001 +/- 0.001 +/- 0.001 +/- 0.003 +/- 0.003 +/- 0.003 +/- 0.002 +/- 0.00127 0.00198 0.00256 0.00256 0.00382 0.00193 0.00174 0.00121 200 0.000 +/- 0.001 +/- 0.002 +/- 0.002 +/- 0.004 +/- 0.005 +/- 0.005 +/- 0.005 +/- 0.00121 0.00276 0.00316 0.00311 0.00305 0.00286 0.00208 0.00119 500 0.000 +/- 0.001 +/- 0.002 +/- 0.003 +/- 0.006 +/- 0.009 +/- 0.010 +/- 0.010 +/- 0.00110 0.00108 0.00236 0.00298 0.00314 0.00318 0.00251 0.00177 1000 0.000 +/- 0.001 +/- 0.002 +/- 0.004 +/- 0.006 +/- 0.012 +/- 0.014 +/- 0.016 +/- 0.00061 0.00132 0.00225 0.00280 0.00291 0.00275 0.00272 0.00254 2000 0.001 +/- 0.001 +/- 0.002 +/- 0.004 +/- 0.007 +/- 0.014 +/- 0.019 +/- 0.024 +/- 0.00128 0.00130 0.00192 0.00269 0.00293 0.00303 0.00309 0.00250

My `jaccard()` function is not reflexive, which you can verify
by comparing terms i by j with j by i, as in 1000x2000 vs
2000x100. Their respective values are 0.019+/-0.00309 and
0.016+/-0.00254. The averages are multiple standard deviations away
from each other.

The scores are also a function of set size, which is not what you want
for a good scoring function. The 2000x2000 term has an average score
of 0.024 while the 1000x1000 has an average score 0.014. The elements
in the sets are randomly selected, so I expected the two comparisons
to have about the same score. This bias towards larger sets is built
into my similarity definition, so I conclud that my
`jaccard()` implementation is not a good scoring function for
sets.

There are ways to improve the scoring function, for example by normalizing to the background average/standard deviation to give a Z-score, or by better use of statistics. That is "left to the student as an exercise."

Here's the final code in a single block so you can use it more easily:

from __future__ import division import random import numpy as np import chemfp from chemfp import search # Support both Python 2 and Python 3 try: xrange # Check if this is Python 2 except NameError: xrange = range # Use 'range' for Python 3 def make_subset(arena, n): indices = random.sample(xrange(len(arena)), n) return arena.copy(indices=indices) # Note: this isn't quite the Jaccard similarity as the non-reflexive # numerator means that often jaccard(A, B) != jaccard(B, A) def jaccard(arena1, arena2, threshold=0.8): numerator = search.knearest_tanimoto_search_arena( arena1, arena2, k=1, threshold=threshold).count_all() denominator = len(set(arena1.ids).union(arena2.ids)) return numerator / denominator def compute_score_table(arena): sizes = (10, 20, 50, 100, 200, 500, 1000, 2000) print(" " + "".join(str(size).center(12) for size in sizes)) for size1 in sizes: output1 = str(size1).rjust(4) + " " output2 = " " for size2 in sizes: scores = [] for i in range(100): score = jaccard(make_subset(arena, size1), make_subset(arena, size2)) scores.append(score) output1 += ("%.3f +/-" % (np.mean(scores),)).ljust(12) output2 += (" %.5f" % (np.std(scores),)).ljust(12) print(output1) print(output2) arena = chemfp.load_fingerprints("chembl_21.rdkit2048.fpb") compute_score_table(arena)

## Details about the best match

While it isn't needed for the set-to-set comparison, you might be interested in knowing more about the closest match in the k=1 nearest search. It's easy to get the list of ids and scores for the non-empty rows:

>>>subset1 = make_subset(arena, 1000)>>>subset2 = make_subset(arena, 1000)>>>results = search.knearest_tanimoto_search_arena(...subset1, subset2, k=1, threshold=0.8)... >>>for result in results:...if result:...print("%-13s %.3f" % result.get_ids_and_scores()[0])... CHEMBL386477 0.926 CHEMBL386477 1.000 CHEMBL275879 0.954 CHEMBL2138462 0.921 CHEMBL83932 0.813 CHEMBL2403669 0.805 CHEMBL2372083 0.819 CHEMBL1213903 0.988 CHEMBL1574001 0.901 CHEMBL2337012 0.913 CHEMBL370254 0.806 CHEMBL555410 0.908 CHEMBL2337012 0.897 CHEMBL1683180 0.822 CHEMBL73244 0.801 CHEMBL605903 0.831 CHEMBL1834630 0.806 CHEMBL1475722 0.821 CHEMBL65904 0.912 CHEMBL2106627 0.812 CHEMBL108353 0.816 CHEMBL108353 0.825

It becomes a bit more complicated if you also want information about the query and its fingerprint, so let's do that. I'll also print the query id, the number of 1-bits in the query and target fingerprint (better known as the popcount), and the number of 1-bits in their fingerprint intersection, that is, the number of bits in common. Here's the code:

import random import numpy as np import chemfp from chemfp import search, bitops # Support both Python 2 and Python 3 try: xrange # Check if this is Python 2 except NameError: xrange = range # Use 'range' for Python 3 def make_subset(arena, n): indices = random.sample(xrange(len(arena)), n) return arena.copy(indices=indices) arena = chemfp.load_fingerprints("chembl_21.rdkit2048.fpb") subset1 = make_subset(arena, 1000) subset2 = make_subset(arena, 1000) results = search.knearest_tanimoto_search_arena( subset1, subset2, k=1, threshold=0.8) print(" query_id #bits target_id #bits common score") for i, result in enumerate(results): if not result: continue # skip elements with no matches query_id, query_fp = subset1[i] num_query_bits = bitops.byte_popcount(query_fp) j, score = result.get_indices_and_scores()[0] target_id, target_fp = subset2[j] num_target_bits = bitops.byte_popcount(target_fp) in_common = bitops.byte_intersect_popcount(query_fp, target_fp) print("%-13s %5d %-13s %5d %5d %.4f" % (query_id, num_query_bits, target_id, num_target_bits, in_common, score))It generates the following output:

query_id #bits target_id #bits common score CHEMBL340793 710 CHEMBL443821 657 657 0.9254 CHEMBL2115272 710 CHEMBL1096675 758 667 0.8327 CHEMBL262239 761 CHEMBL411281 761 761 1.0000 CHEMBL529191 821 CHEMBL3434725 843 784 0.8909 CHEMBL389943 868 CHEMBL414961 887 807 0.8513 CHEMBL350381 1008 CHEMBL162288 899 899 0.8919 CHEMBL1407603 1042 CHEMBL1610293 1001 955 0.8778 CHEMBL484872 1058 CHEMBL1610293 1001 967 0.8855 CHEMBL381366 1066 CHEMBL426130 1060 985 0.8633 CHEMBL2058350 1098 CHEMBL2058354 1005 1005 0.9153 CHEMBL1898269 1112 CHEMBL1893417 1049 982 0.8329 CHEMBL132434 1123 CHEMBL15324 1158 1079 0.8977 CHEMBL332868 1128 CHEMBL77685 1160 1052 0.8511 CHEMBL2143647 1188 CHEMBL1893417 1049 1008 0.8202 CHEMBL1711444 1189 CHEMBL1310059 1231 1102 0.8361 CHEMBL3249400 1232 CHEMBL2371463 1128 1062 0.8182 CHEMBL1367025 1242 CHEMBL1352709 1345 1213 0.8828 CHEMBL77799 1290 CHEMBL1081652 1218 1135 0.8267 CHEMBL2303747 1296 CHEMBL15324 1158 1096 0.8071 CHEMBL451899 1656 CHEMBL477245 1551 1433 0.8078 CHEMBL362938 1708 CHEMBL1731136 1966 1641 0.8072 CHEMBL525425 1730 CHEMBL1731136 1966 1658 0.8135 CHEMBL516074 1742 CHEMBL3113501 1876 1691 0.8775 CHEMBL2029145 1764 CHEMBL2304149 1713 1706 0.9633 CHEMBL298265 1865 CHEMBL1731136 1966 1788 0.8752 CHEMBL373910 1906 CHEMBL1731136 1966 1830 0.8962

## Cumulative value scoring methods

I earlier pointed out two big problems with the `jaccard()`
implementation I proposed. It isn't symmetric (`jaccard(a, b)`
is rarely equal to `jaccard(b, a)`), and it has a bias towards
larger sizes.

There's another problem. Suppose A and B both have *n* elements,
with no overlaps (making the denominator *2n*), and suppose also
that each element in A has exactly one neighbor with at least a 0.8
similarity. Then the `jaccard()` method will report a
similarity of 1.0.

Now consider a set C which again has *n* elements, with no
overlaps with elements in A. But this time each element of A has 5
neighbors in C, and vice versa. The `jaccard()` function only
checks if there is at least one neighbor, so will still return a score
of 1.0, even though sets A and C are more compact and closer to each
other than A was to B.

One way to capture that information is to look at the total number of matches between A and B, and between A and C. Instead of doing a k=1 nearest neighbor search, I'll do a threshold search and count the total number of matches between the two:

>>>This result is also reflexive, that is, comparing (A, B) and (B, A) give the same values.A = make_subset(arena, 1000)>>>B = make_subset(arena, 1000)>>>C = make_subset(arena, 1000)>>>results_AB = search.threshold_tanimoto_search_arena(A, B, threshold=0.8)>>>results_AB.count_all()34 >>>results_AC = search.threshold_tanimoto_search_arena(A, C, threshold=0.8)>>>results_AC.count_all()49

One possible concern is that the match count assumes that all
similarities are of equal importance, so a similarity of 0.8 is just
as strong as 1.0. A more subtle method might use the cumulative sum of
the scores in AxB and AxC, which is returned by the
`cumulative_score_all()` method:

>>>results_AB.cumulative_score_all()29.499029322763217 >>>results_AC.cumulative_score_all()42.23963171858597

I pointed out earlier that my naive `jaccard()` implementation
is biased towards comparing large sets. A similar bias applies when
using counts or cumulative scores. With randomly selected elements,
the count or cumulative scores will scale as the product of the set
sizes:

>>>Change thedef get_count(n, m, repeat=100):...count = 0...for i in range(repeat):...subset1 = make_subset(arena, n)...subset2 = make_subset(arena, m)...results = search.threshold_tanimoto_search_arena(subset1, subset2)...count += results.count_all()...return count... >>>get_count(1000, 1000)29709 >>>get_count(1000, 1000)29803 >>>get_count(2000, 1000)58237 >>>get_count(2000, 1000)61533 >>>get_count(3000, 1000)91164 >>>get_count(3000, 1000)91833 >>>get_count(3000, 3000)265121 >>>get_count(3000, 3000)

`count_all()`to

`cumulative_score_all()`to get similar results using the sum of all the scores.

As I mentioned, an advantage of the count and cumulative score approach over the k=1 nearest-neighbor search is that they are symmetric, or more precisely, reflexive. Compare the following to the (3000, 1000) output of the previous code and you'll see they give about the same values:

>>>get_count(1000, 3000)91185 >>>get_count(1000, 3000)90586

### The joy of floating point arithmetic

This is an aside. Here are two alternative ways to compute the
cumulative sum of all of the scores, the first by computing the
`cumulative_sum()` of each individual result, and the second by
computing the sum of the sum of each list of scores using Python.

>>>It really bugs me that thesum(result.cumulative_score() for result in results_AB)29.499029322763214 >>>sum(sum(result.get_scores()) for result in results_AB)29.499029322763214

`cumulative_score_all()`and the other two approaches differ by -3.552713678800501e-15, which is the least significant bit of the two representations:

>>>sum(sum(result.get_scores()) for result in results_AB).hex()'0x1.d7fc062bd0356p+4' >>>results_AB.cumulative_score_all().hex()'0x1.d7fc062bd0357p+4'

After a bit of research, I confirmed that it's not due to a chemfp
bug, but rather because floating point addition using doubles is not
associative. If I sum up all of the scores in order, I get the value
from `cumulative_score_all()`:

>>>scores = []>>>for result in results_AB:...scores.extend(result.get_scores())... >>>sum(scores)29.499029322763217

If I group the scores so I first sum up each row, I get the other value:

>>>scores = []>>>for result in results_AB:...scores.append(sum(result.get_scores()))... >>>sum(scores)29.499029322763214

Minor differences in the last significant digit are an all too common consequence of using floating point numbers, so while it annoyed me, it was the first reason I thought of.

## Cumulative score-based similarity

In an earlier section I showed that the total count or the cumulative score between two randomly chosen sets scales as the product of the two set sizes. What about normalizing the cumulative score by that product?

There's probably a name for this similarity measure, but I don't know much about this field and don't know the name. I'll call it "sss" for "sum of scores similarity".

# I don't actually use this one def sss(arena1, arena2, threshold=0.8): results = search.threshold_tanimoto_search_arena( arena1, arena2, threshold=threshold) return results.cumulative_score_all() / (len(arena1) * len(arena2))After some experimentation, I decided to scale the score by 300, because that results in two sets with randomly chosen terms having a similarity of 0.01.

# This is the one I use. def sss(arena1, arena2, threshold=0.8): results = search.threshold_tanimoto_search_arena( arena1, arena2, threshold=threshold) # The scaling factor of 300 was chosen so that random ChEMBL # subsets have a score of about 0.01. It doesn't matter for # ranking as it's used as an arbitrary scaling factor. similarity = 300 * results.cumulative_score_all() / (len(arena1) * len(arena2)) return similarityI made a variant of the similarity table I used before, this time with the

`sss()`function as my similarity measure, instead of

`jaccard()`:

def compute_sss_table(arena): sizes = (10, 20, 50, 100, 200, 500, 1000, 2000) print(" " + "".join(str(size).center(12) for size in sizes)) for size1 in sizes: output1 = str(size1).rjust(4) + " " output2 = " " for size2 in sizes: scores = [] for i in range(100): score = sss(make_subset(arena, size1), make_subset(arena, size2)) scores.append(score) output1 += ("%.3f +/-" % (np.mean(scores),)).ljust(12) output2 += (" %.5f" % (np.std(scores),)).ljust(12) print(output1) print(output2)The output table is very close to symmetric under transposition, and with a nearly constant similarity score of 0.011:

10 20 50 100 200 500 1000 2000 10 0.024 +/- 0.012 +/- 0.005 +/- 0.013 +/- 0.013 +/- 0.015 +/- 0.008 +/- 0.011 +/- 0.23916 0.12318 0.05309 0.06569 0.04212 0.03571 0.02265 0.01623 20 0.012 +/- 0.013 +/- 0.008 +/- 0.013 +/- 0.013 +/- 0.010 +/- 0.012 +/- 0.016 +/- 0.12176 0.08863 0.05592 0.03930 0.02858 0.01905 0.02362 0.02322 50 0.005 +/- 0.028 +/- 0.018 +/- 0.010 +/- 0.010 +/- 0.010 +/- 0.011 +/- 0.012 +/- 0.04816 0.09862 0.04181 0.02301 0.02007 0.01040 0.01078 0.01331 100 0.010 +/- 0.006 +/- 0.011 +/- 0.012 +/- 0.013 +/- 0.011 +/- 0.011 +/- 0.011 +/- 0.05136 0.02660 0.02348 0.01906 0.01725 0.00920 0.00927 0.00801 200 0.011 +/- 0.013 +/- 0.012 +/- 0.011 +/- 0.010 +/- 0.010 +/- 0.012 +/- 0.011 +/- 0.05310 0.03826 0.02014 0.01716 0.00940 0.00745 0.00716 0.00496 500 0.013 +/- 0.008 +/- 0.011 +/- 0.012 +/- 0.012 +/- 0.011 +/- 0.012 +/- 0.012 +/- 0.03374 0.01458 0.01471 0.01177 0.00873 0.00597 0.00509 0.00354 1000 0.009 +/- 0.012 +/- 0.010 +/- 0.012 +/- 0.012 +/- 0.012 +/- 0.012 +/- 0.012 +/- 0.02291 0.02247 0.01068 0.00878 0.00639 0.00508 0.00347 0.00300 2000 0.011 +/- 0.011 +/- 0.013 +/- 0.013 +/- 0.011 +/- 0.011 +/- 0.011 +/- 0.011 +/- 0.01723 0.01354 0.01446 0.01031 0.00552 0.00332 0.00275 0.00209This is much more like what I expect from a similarity function!

## Effect of threshold on the sss score

I picked a threshold of 0.8 because that's a pretty common value and few will object to it. It might be that a different threshold is better. How does the sss score change as a function of the threshold?

One obvious implementation is to call sss() with a different threshold each time:

>>>(As you can see, while I managed to normalize the score as a function of set size, it's still threshold dependent.)arena = chemfp.load_fingerprints("chembl_21.rdkit2048.fpb")>>>subset1 = make_subset(arena, 1000)>>>subset2 = make_subset(arena, 1000)>>>sss(subset1, subset2, threshold=0.9)0.0014073436068272435 >>>sss(subset1, subset2, threshold=0.8)0.006656850064948667 >>>sss(subset1, subset2, threshold=0.7)0.038875826785241686 >>>sss(subset1, subset2, threshold=0.6)0.2569320521578928

The obvious implementation is perfectly reasonable. There's a more computationally efficient way. You'll notice that each calculation ends up redoing the calculation before it, because all compounds with at least 0.9 are also within at least 0.8 similarity, etc.

Chemfp has another way to get the same values. The
`cumulative_score_all()` method takes extra arguments to select
the min and max values to use, and if the summation should include or
exclude the endpoints. I'll get a SearchResults instance and ask for
its help:

>>>I'll use theresults = search.threshold_tanimoto_search_arena(...subset1, subset2, threshold=0.6)>>> >>>help(results.cumulative_score_all)cumulative_score_all(self, min_score=None, max_score=None, interval='[]') method of chemfp.search.SearchResults instance The sum of all scores in all rows which are between *min_score* and *max_score* Using the default parameters this returns the sum of all of the scores in all of the results. With a specified range this returns the sum of all of the scores in that range. The cumulative score is also known as the raw score. The default *min_score* of None is equivalent to -infinity. The default *max_score* of None is equivalent to +infinity. The *interval* parameter describes the interval end conditions. The default of "[]" uses a closed interval, where min_score <= score <= max_score. The interval "()" uses the open interval where min_score < score < max_score. The half-open/half-closed intervals "(]" and "[)" are also supported. :param min_score: the minimum score in the range. :type min_score: a float, or None for -infinity :param max_score: the maximum score in the range. :type max_score: a float, or None for +infinity :param interval: specify if the end points are open or closed. :type interval: one of "[]", "()", "(]", "[)" :returns: a floating point count

`min_score`parameter. First, I'll find all fingerprints with a similarity threshold of 0.6, which is the lowest score I'm concerned about, and, to reduce complexity later on, I'll merge the "300*" and the product of the sizes into a single scaling factor:

>>>Then I'll compute the scores for each threshold, and get the same results as earlier.results = search.threshold_tanimoto_search_arena(...subset1, subset2, threshold=0.6)>>>scaling = 300/len(subset1)/len(subset2)

>>>scaling*results.cumulative_score_all(min_score=0.9)0.0014073436068272435 >>>scaling*results.cumulative_score_all(min_score=0.8)0.006656850064948667 >>>scaling*results.cumulative_score_all(min_score=0.7)0.038875826785241686 >>>scaling*results.cumulative_score_all(min_score=0.6)0.2569320521578928

For fun, I put it in a loop and computed more thresholds:

>>>A quick test now shows that computing the similarities once and finding the cumulative sum for the 9 threshold levels is about 4x faster than doing 9 independent threshold calculations.scaling = 300/len(subset1)/len(subset2)>>>for threshold in (0.99, 0.95, 0.90, 0.85, 0.8, 0.75, 0.7, 0.65, 0.6):...score = scaling * results.cumulative_score_all(min_score=threshold)...print("%.2f %7.4f" % (threshold, score))... 0.99 0.0003 0.95 0.0006 0.90 0.0014 0.85 0.0030 0.80 0.0067 0.75 0.0166 0.70 0.0389 0.65 0.0921 0.60 0.2569

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me

Copyright © 2001-2020 Andrew Dalke Scientific AB