Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2017/03/20/fingerprint_set_similarity

Fingerprint set similarity

Someone recently asked how I might use chemfp to compute the similarity between two sets of fingerprints. This would be used for something like SEA (Similarity Ensemble Approach), which "relates proteins based on the set-wise chemical similarity among their ligands. It can be used to rapidly search large compound databases and to build cross-target similarity maps."

The underlying techniques of document set similarity are used in many fields. Bioinformatics, for example, uses it to analyze gene expression microarrays to find correlations between genes which might indicate they affect similar pathways. That said, I am no expert on the topic. I implemented a SEA-like algorithm for one client, but I don't know enough to judge the effectiveness of one approach over another, nor do I know the underlying statistics.

But that's okay. I'll assume that you know these details, and just want to know how to use chemfp to get the raw numbers to feed into your scoring function.

The first step is to create and load fingerprints. I have some pre-computed fingerprints for ChEMBL 21, so I'll use that as my example. I'll use the 2048-bit RDKit fingerprints. To generate them yourself, can use the command-line tool rdkit2fps, like this:

% rdkit2fps chembl_21.sdf.gz -o chembl_21.rdkit2048.fps.gz

The FPS file is a text-based, line-oriented format which is easy to for both humans and software to read and write. Easy, however, does not mean fast. It takes about 20 seconds to load this fps.gz file:

>>> import time
>>> import chemfp
>>> filename = "/Users/dalke/databases/chembl_21.rdkit.rdkit2048.fps.gz"
>>> t1 = time.time(); chemfp.load_fingerprints(filename);t2=time.time()
<chemfp.arena.FingerprintArena object at 0x1021aef90>
>>> t2-t1
19.388361930847168

Chemfp 2.0 introduced the FPB binary file format, which is more structured around the chemfp internal data structures. It's very fast for chemfp to load, but much more complicated. All of the chemfp tools know how to read and write the FPB format, so you could create the FPB file like this:

% rdkit2fps chembl_21.sdf.gz -o chembl_21.rdkit2048.fpb
If you already have an FPS (or gzip-ed FPS) file then use the fpcat tool to convert it to an FPB file:
% fpbcat chembl_21.rdkit2048.fps.gz -o chembl_21.rdkit2048.fpb
The conversion took about 32 seconds on my 7 year old laptop.

I'll switch back to Python and show why I developed the FPB format. The first time I loaded it took 0.4 seconds:

>>> t1=time.time(); chemfp.load_fingerprints("chembl_21.rdkit2048.fpb");t2=time.time()
<chemfp.fpb_io.FPBFingerprintArena object at 0x113809790>
>>> t2-t1
0.3760688304901123
Much of that time is likely waiting for the hard disk. When I open it again, the data is already in disk cache, so the load time is now under 0.1 seconds:
>>> t1=time.time(); chemfp.load_fingerprints("chembl_21.rdkit2048.fpb");t2=time.time()
<chemfp.fpb_io.FPBFingerprintArena object at 0x113809650>
>>> t2-t1
0.06340503692626953
That's much better than the 20 seconds it took to read from the gzipped FPS file. For what it's worth, for the RDKit 2048-bit fingerprint, the fps.gz file is slightly more compact than the .fpb file:
% ls -l chembl_21.rdkit2048.fpb chembl_21.rdkit2048.fps.gz
-rw-r--r--  1 dalke  staff  416031861 May 12  2016 chembl_21.rdkit2048.fps.gz
-rw-r--r--  1 dalke  staff  457103145 Mar 18 02:36 chembl_21.rdkit2048.fpb

Select a subset

In chemfp, a set of fingerprints is called an "arena". (I struggled for a while to come up with a good name. "A 'hand' of fingerprints" was humorous, but in the end I re-used the name from memory management.) I want to use a subset of the arena. The best solution is to use the arena.copy() method. By itself it makes a copy of the entire data set:

>>> import chemfp
>>> arena = chemfp.load_fingerprints("chembl_21.rdkit2048.fpb")
>>> subarena = arena.copy()
>>> len(arena), len(subarena)
(1583839, 1583839)

Arenas act like an array of (id, fingerprint) pairs. The first element is term [0], the second [1], and so on. For example:

>>> arena[1234]
(u'CHEMBL79699', '\x04\x00\x00\x08 ... many bytes omitted ... \x00\xc0')
>>> arena[5678]
(u'CHEMBL291214', '\x00\x00\x00\x08 ... many bytes omitted ... \x00 \x80')

I'll make another copy, but this time ask chemfp to only copy indices 1234 and 5678:

>>> subarena = arena.copy(indices=[1234, 5678])
>>> len(subarena)
2
>>> subarena.ids
[u'CHEMBL79699', u'CHEMBL291214']
>>> subarena[0]
(u'CHEMBL79699', '\x04\x00\x00\x08 ... many bytes omitted ... \x00\xc0')
(Notes: by default the subarena may rearrange the fingerprints for better search performance. If you want the new arena to have the fingerprint in the same order as the list of indices then also pass in "reorder=False". Also, the u'' here shows that I'm using chemfp 3.0 on Python 2.7. Earlier versions of chemfp return both the id and fingerprint as byte strings, while chemfp 3.0 returns the id as a Unicode string and the fingerprint as a byte string. Under Python 2, Unicode strings are shown using the 'u' prefix. Under Python 3, byte strings are represented with the 'b' prefix.)

Make a random subset

I need some subsets so I can compare them. To get started I'll just use randomly selected elements. This is pleasently easy, using a technique I learned from reading the Python documentation for random.sample:

To choose a sample from a range of integers, use an xrange() object as an argument. This is especially fast and space efficient for sampling from a large population: sample(xrange(10000000), 60).
Note: the xrange() object is Python 2 specific. If you are using Python 3 then replace "xrange" with "range".

Here are some examples where I sample 4 values (without replacement) from 0...99:

>>> import random
>>> random.sample(xrange(100), 4)
[68, 49, 79, 38]
>>> random.sample(xrange(100), 4)
[77, 57, 33, 43]
>>> random.sample(xrange(100), 4)
[32, 70, 13, 53]
I'll change this to get 4 randomly indices in the arena:
>>> random.sample(xrange(len(arena)), 4)
[1089497, 196869, 1474590, 376331]
>>> random.sample(xrange(len(arena)), 4)
[161904, 1136119, 1455737, 518548]
>>> random.sample(xrange(len(arena)), 4)
[1092929, 218864, 1436330, 1357672]

Here's a function to create a subarena arena of size n, in easy-to-copy form, which I'll then execute interactively:

import random
def make_subset(arena, n):
    indices = random.sample(xrange(len(arena)), n)
    return arena.copy(indices=indices)

>>> a = make_subset(arena, 4)
>>> len(a)
4
>>> a.ids
[u'CHEMBL1448181', u'CHEMBL1532656', u'CHEMBL3480613', u'CHEMBL1765556']

Jaccard index between two sets

The Jaccard index is a widely used scoring function. (In cheminformatics we know it better as the Tanimoto, for historical reasons tied more to how knowledge diffuses than to strict date priority.) Simply put, it's defined as:

                number of elements common to both A and B
Jaccard(A, B) = -----------------------------------------
                    number of unique elements in A or B
The harder part is to turn this into something meaningful. What does it mean for an fingerprint record to be common to both A and B?

One possibility is to look at the identifier, and say that a record is common to A and B if and only if the record's identifier is in both sets. However, it would be nice to capture some sense of chemical similarity. We can instead say that a record is in common if and only if they have the same fingerprints, but 100% similarity is very strict.

Let's lower the threshold a bit, and say that a record 'a' in A is common to B if and only if there is at least one record 'b' in B such that the fingerprint Tanimoto(a, b) >= 0.8.

If you think at this for a bit you'll see a couple of things. First off, this definition is not symmetric. If there is at least one similar 'b' for some 'a' then there is at least one similar 'a' for that 'b', but 'a' might be the nearest neighbor of two different 'b's. (The asymmetry could be fixed by also including the results of a search of B against A. I'm not going to consider that approach.) Second, "has a neighbor in B" can be implemented as a k=1 nearest neighbor search of A against B, which is a built-in chemfp function.

I'll make two subsets with 1,000 elements each, randomly sampled from the ChEMBL-21 arena:

>>> subset1 = make_subset(arena, 1000)
>>> subset2 = make_subset(arena, 1000)
then use chemfp to do the k=1 nearest-neighbor search between two arenas, with a minimum required score of 0.8:
>>> from chemfp import search
>>> results = search.knearest_tanimoto_search_arena(subset1, subset2, k=1, threshold=0.8)

Most randomly chosen fingerprints are below 0.8 threshold, so they find 0 neighbors, but some of them have 1 neighbor:

>>> len(results[0])
0
>>> len(results[1])
0
>>> len(results[26])
1
There are a few ways to count the number of element in common. You can use normal Python functions, like:
>>> sum(len(result) for result in results)
24
A faster way is to let chemfp count all of the hits for you:
>>> results.count_all()
24

Okay, so Tanimoto similarity gives the numerator to the Jaccard set similarity. What about the denominator, which is "the number of unique elements in A or B"?

For that I'll count the total number of unique identifiers in both A and B. You saw earlier that arena.ids gives the list of identifiers for the arena. I'll use those to make a Python set made of the union of the two ids, then get the size of the resulting set:

>>> len(set(subset1.ids).union(subset2.ids))
2000
Thus, the Jaccard index of the two sets is 24/2000, or 0.012. They are not that similar.

A slighly more interesting case

The denominator of 2000 is rather boring, as it's 1000+1000, that is, the sum of the individual set sizes. That's because the two randomly chosen sets have no elements in common:

>>> len(set(subset1.ids).intersection(subset2.ids))
0
I'll make some random samples until there are at least 3 elements in common:
>>> while len(set(subset1.ids).intersection(subset2.ids)) < 3:
...   subset2 = make_subset(arena, 1000)
... 
>>> len(set(subset1.ids).intersection(subset2.ids))
4
>>> set(subset1.ids).intersection(subset2.ids)
set([u'CHEMBL64050', u'CHEMBL129835', u'CHEMBL255712', u'CHEMBL88792'])
>>> len(set(subset1.ids).union(subset2.ids))
1996
These two subsets have 40 elements in common:
>>> search.knearest_tanimoto_search_arena(subset1, subset2, k=1, threshold=0.8).count_all()
40
given them a Jaccard index of 40/1996, or 0.02. That's better than 0.012, but still not very similar.

Estimating background similarity

I want to get some idea of the average similarity between two randomly chosen sets, at different sizes and threshold scores. I'll define a 'jaccard' function to help with the calculation:

from __future__ import division  # for Python 2.x, make 1/2 be 0.5, not 0
from chemfp import search

# Note: this isn't quite the Jaccard similarity as the non-reflexive
# numerator means that often jaccard(A, B) != jaccard(B, A)
def jaccard(arena1, arena2, threshold=0.8):
  numerator = search.knearest_tanimoto_search_arena(
                 arena1, arena2, k=1, threshold=threshold).count_all()
  denominator = len(set(arena1.ids).union(arena2.ids))
  return numerator / denominator

>>> jaccard(subset1, subset2) 
0.02004008016032064 
>>> jaccard(make_subset(arena, 1000), make_subset(arena, 1000))
0.017508754377188594
>>> jaccard(make_subset(arena, 1000), make_subset(arena, 1000))
0.021
>>> jaccard(make_subset(arena, 1000), make_subset(arena, 1000))
0.0155

It varies quite a bit, so I'll compute 100 scores and get some information about the min, max, average, and standard deviation:

>>> scores = []
>>> for i in range(50):
...   scores.append(jaccard(make_subset(arena, 1000), make_subset(arena, 1000)))
... 
>>> min(scores), max(scores)
(0.008, 0.022)

What about the average and standard deviation? Python 3.4 introduced the statistics module, which has those as built-in functions. While chemfp 3.0 supports Python 3.5 and greater, I want this blog post to work on both Python 2.7 and 3.3, so I'll import some functionality from numpy:

>>> import numpy as np 
>>> np.mean(scores)
0.014819138037618857
>>> np.std(scores)
0.0025880786854096645
(One obvious question is, is standard deviation an appropriate characterization, or am I hammering a square block into a round hole? I'll let someone more knowledgeable about statistics figure that out.)

I then wrote a function to show the mean and standard deviation of 100 Jaccard scores, across the NxM cross product of some set sizes, that is size 10x10, size 10x20, ... 10x2000, 20x10, ... up to size 2000x2000. Here's the function, which is mostly taken up by code that tries to format things nicely:

def compute_score_table(arena):
  sizes = (10, 20, 50, 100, 200, 500, 1000, 2000)
  print("     " + "".join(str(size).center(12) for size in sizes))
  for size1 in sizes:
    output1 = str(size1).rjust(4) + "   "
    output2 = "       "
    for size2 in sizes:
      scores = []
      for i in range(100):
          score = jaccard(make_subset(arena, size1), make_subset(arena, size2))
          scores.append(score)
      output1 += ("%.3f +/-" % (np.mean(scores),)).ljust(12) 
      output2 += ("  %.5f" % (np.std(scores),)).ljust(12)
    print(output1) 
    print(output2)

The output shows the average and standard deviations on alternate lines, so I could make better use of vertical space:

>>> compute_score_table(arena)
          10          20          50         100         200         500         1000        2000    
  10   0.000 +/-   0.000 +/-   0.000 +/-   0.000 +/-   0.000 +/-   0.000 +/-   0.000 +/-   0.000 +/-   
         0.00000     0.00000     0.00233     0.00127     0.00121     0.00059     0.00036     0.00033   
  20   0.000 +/-   0.000 +/-   0.001 +/-   0.001 +/-   0.001 +/-   0.001 +/-   0.001 +/-   0.001 +/-   
         0.00332     0.00000     0.00345     0.00266     0.00199     0.00115     0.00068     0.00054   
  50   0.000 +/-   0.001 +/-   0.001 +/-   0.002 +/-   0.002 +/-   0.001 +/-   0.001 +/-   0.001 +/-   
         0.00233     0.00371     0.00271     0.00318     0.00322     0.00146     0.00097     0.00067   
 100   0.000 +/-   0.001 +/-   0.001 +/-   0.001 +/-   0.003 +/-   0.003 +/-   0.003 +/-   0.002 +/-   
         0.00127     0.00198     0.00256     0.00256     0.00382     0.00193     0.00174     0.00121   
 200   0.000 +/-   0.001 +/-   0.002 +/-   0.002 +/-   0.004 +/-   0.005 +/-   0.005 +/-   0.005 +/-   
         0.00121     0.00276     0.00316     0.00311     0.00305     0.00286     0.00208     0.00119   
 500   0.000 +/-   0.001 +/-   0.002 +/-   0.003 +/-   0.006 +/-   0.009 +/-   0.010 +/-   0.010 +/-   
         0.00110     0.00108     0.00236     0.00298     0.00314     0.00318     0.00251     0.00177   
1000   0.000 +/-   0.001 +/-   0.002 +/-   0.004 +/-   0.006 +/-   0.012 +/-   0.014 +/-   0.016 +/-   
         0.00061     0.00132     0.00225     0.00280     0.00291     0.00275     0.00272     0.00254   
2000   0.001 +/-   0.001 +/-   0.002 +/-   0.004 +/-   0.007 +/-   0.014 +/-   0.019 +/-   0.024 +/-   
         0.00128     0.00130     0.00192     0.00269     0.00293     0.00303     0.00309     0.00250   
(By the way, if all of the values are 0.0 then you are likely using Python 2.7 and have omitted the "from __future__ import division" line when you defined the jaccard() function. I did that a few times when developing this essay.)

My jaccard() function is not reflexive, which you can verify by comparing terms i by j with j by i, as in 1000x2000 vs 2000x100. Their respective values are 0.019+/-0.00309 and 0.016+/-0.00254. The averages are multiple standard deviations away from each other.

The scores are also a function of set size, which is not what you want for a good scoring function. The 2000x2000 term has an average score of 0.024 while the 1000x1000 has an average score 0.014. The elements in the sets are randomly selected, so I expected the two comparisons to have about the same score. This bias towards larger sets is built into my similarity definition, so I conclud that my jaccard() implementation is not a good scoring function for sets.

There are ways to improve the scoring function, for example by normalizing to the background average/standard deviation to give a Z-score, or by better use of statistics. That is "left to the student as an exercise."

Here's the final code in a single block so you can use it more easily:

from __future__ import division

import random
import numpy as np
import chemfp
from chemfp import search

# Support both Python 2 and Python 3
try:
  xrange # Check if this is Python 2
except NameError:
  xrange = range # Use 'range' for Python 3 

def make_subset(arena, n):
  indices = random.sample(xrange(len(arena)), n)
  return arena.copy(indices=indices)

# Note: this isn't quite the Jaccard similarity as the non-reflexive
# numerator means that often jaccard(A, B) != jaccard(B, A)
def jaccard(arena1, arena2, threshold=0.8):
  numerator = search.knearest_tanimoto_search_arena(
                 arena1, arena2, k=1, threshold=threshold).count_all()
  denominator = len(set(arena1.ids).union(arena2.ids))
  return numerator / denominator

def compute_score_table(arena):
  sizes = (10, 20, 50, 100, 200, 500, 1000, 2000)
  print("     " + "".join(str(size).center(12) for size in sizes))
  for size1 in sizes:
    output1 = str(size1).rjust(4) + "   "
    output2 = "       "
    for size2 in sizes:
      scores = []
      for i in range(100):
          score = jaccard(make_subset(arena, size1), make_subset(arena, size2))
          scores.append(score)
      output1 += ("%.3f +/-" % (np.mean(scores),)).ljust(12) 
      output2 += ("  %.5f" % (np.std(scores),)).ljust(12)
    print(output1) 
    print(output2) 


arena = chemfp.load_fingerprints("chembl_21.rdkit2048.fpb")
compute_score_table(arena)

Details about the best match

While it isn't needed for the set-to-set comparison, you might be interested in knowing more about the closest match in the k=1 nearest search. It's easy to get the list of ids and scores for the non-empty rows:

>>> subset1 = make_subset(arena, 1000)
>>> subset2 = make_subset(arena, 1000)
>>> results = search.knearest_tanimoto_search_arena(
...     subset1, subset2, k=1, threshold=0.8)
...
>>> for result in results:
...   if result:
...     print("%-13s %.3f" % result.get_ids_and_scores()[0])
... 
CHEMBL386477  0.926
CHEMBL386477  1.000
CHEMBL275879  0.954
CHEMBL2138462 0.921
CHEMBL83932   0.813
CHEMBL2403669 0.805
CHEMBL2372083 0.819
CHEMBL1213903 0.988
CHEMBL1574001 0.901
CHEMBL2337012 0.913
CHEMBL370254  0.806
CHEMBL555410  0.908
CHEMBL2337012 0.897
CHEMBL1683180 0.822
CHEMBL73244   0.801
CHEMBL605903  0.831
CHEMBL1834630 0.806
CHEMBL1475722 0.821
CHEMBL65904   0.912
CHEMBL2106627 0.812
CHEMBL108353  0.816
CHEMBL108353  0.825

It becomes a bit more complicated if you also want information about the query and its fingerprint, so let's do that. I'll also print the query id, the number of 1-bits in the query and target fingerprint (better known as the popcount), and the number of 1-bits in their fingerprint intersection, that is, the number of bits in common. Here's the code:

import random
import numpy as np
import chemfp
from chemfp import search, bitops

# Support both Python 2 and Python 3
try:
  xrange # Check if this is Python 2
except NameError:
  xrange = range # Use 'range' for Python 3 

def make_subset(arena, n):
    indices = random.sample(xrange(len(arena)), n)
    return arena.copy(indices=indices)

arena = chemfp.load_fingerprints("chembl_21.rdkit2048.fpb")

subset1 = make_subset(arena, 1000)
subset2 = make_subset(arena, 1000)
results = search.knearest_tanimoto_search_arena(
  subset1, subset2, k=1, threshold=0.8)

print("  query_id     #bits    target_id    #bits common  score")
for i, result in enumerate(results):
    if not result:
        continue   # skip elements with no matches
    query_id, query_fp = subset1[i]
    num_query_bits = bitops.byte_popcount(query_fp)
    j, score = result.get_indices_and_scores()[0]
    target_id, target_fp = subset2[j]
    num_target_bits = bitops.byte_popcount(target_fp)
    in_common = bitops.byte_intersect_popcount(query_fp, target_fp)
    print("%-13s %5d   %-13s %5d  %5d   %.4f" %
          (query_id, num_query_bits, target_id, num_target_bits, in_common, score))
It generates the following output:
  query_id     #bits    target_id    #bits common  score
CHEMBL340793    710   CHEMBL443821    657    657   0.9254
CHEMBL2115272   710   CHEMBL1096675   758    667   0.8327
CHEMBL262239    761   CHEMBL411281    761    761   1.0000
CHEMBL529191    821   CHEMBL3434725   843    784   0.8909
CHEMBL389943    868   CHEMBL414961    887    807   0.8513
CHEMBL350381   1008   CHEMBL162288    899    899   0.8919
CHEMBL1407603  1042   CHEMBL1610293  1001    955   0.8778
CHEMBL484872   1058   CHEMBL1610293  1001    967   0.8855
CHEMBL381366   1066   CHEMBL426130   1060    985   0.8633
CHEMBL2058350  1098   CHEMBL2058354  1005   1005   0.9153
CHEMBL1898269  1112   CHEMBL1893417  1049    982   0.8329
CHEMBL132434   1123   CHEMBL15324    1158   1079   0.8977
CHEMBL332868   1128   CHEMBL77685    1160   1052   0.8511
CHEMBL2143647  1188   CHEMBL1893417  1049   1008   0.8202
CHEMBL1711444  1189   CHEMBL1310059  1231   1102   0.8361
CHEMBL3249400  1232   CHEMBL2371463  1128   1062   0.8182
CHEMBL1367025  1242   CHEMBL1352709  1345   1213   0.8828
CHEMBL77799    1290   CHEMBL1081652  1218   1135   0.8267
CHEMBL2303747  1296   CHEMBL15324    1158   1096   0.8071
CHEMBL451899   1656   CHEMBL477245   1551   1433   0.8078
CHEMBL362938   1708   CHEMBL1731136  1966   1641   0.8072
CHEMBL525425   1730   CHEMBL1731136  1966   1658   0.8135
CHEMBL516074   1742   CHEMBL3113501  1876   1691   0.8775
CHEMBL2029145  1764   CHEMBL2304149  1713   1706   0.9633
CHEMBL298265   1865   CHEMBL1731136  1966   1788   0.8752
CHEMBL373910   1906   CHEMBL1731136  1966   1830   0.8962

Cumulative value scoring methods

I earlier pointed out two big problems with the jaccard() implementation I proposed. It isn't symmetric (jaccard(a, b) is rarely equal to jaccard(b, a)), and it has a bias towards larger sizes.

There's another problem. Suppose A and B both have n elements, with no overlaps (making the denominator 2n), and suppose also that each element in A has exactly one neighbor with at least a 0.8 similarity. Then the jaccard() method will report a similarity of 1.0.

Now consider a set C which again has n elements, with no overlaps with elements in A. But this time each element of A has 5 neighbors in C, and vice versa. The jaccard() function only checks if there is at least one neighbor, so will still return a score of 1.0, even though sets A and C are more compact and closer to each other than A was to B.

One way to capture that information is to look at the total number of matches between A and B, and between A and C. Instead of doing a k=1 nearest neighbor search, I'll do a threshold search and count the total number of matches between the two:

>>> A = make_subset(arena, 1000)
>>> B = make_subset(arena, 1000)
>>> C = make_subset(arena, 1000)
>>> results_AB = search.threshold_tanimoto_search_arena(A, B, threshold=0.8)
>>> results_AB.count_all()
34
>>> results_AC = search.threshold_tanimoto_search_arena(A, C, threshold=0.8)
>>> results_AC.count_all()
49
This result is also reflexive, that is, comparing (A, B) and (B, A) give the same values.

One possible concern is that the match count assumes that all similarities are of equal importance, so a similarity of 0.8 is just as strong as 1.0. A more subtle method might use the cumulative sum of the scores in AxB and AxC, which is returned by the cumulative_score_all() method:

>>> results_AB.cumulative_score_all()
29.499029322763217
>>> results_AC.cumulative_score_all()
42.23963171858597

I pointed out earlier that my naive jaccard() implementation is biased towards comparing large sets. A similar bias applies when using counts or cumulative scores. With randomly selected elements, the count or cumulative scores will scale as the product of the set sizes:

>>> def get_count(n, m, repeat=100):
...   count = 0
...   for i in range(repeat):
...     subset1 = make_subset(arena, n)
...     subset2 = make_subset(arena, m)
...     results = search.threshold_tanimoto_search_arena(subset1, subset2)
...     count += results.count_all()
...   return count
... 
>>> get_count(1000, 1000)
29709
>>> get_count(1000, 1000)
29803
>>> get_count(2000, 1000)
58237
>>> get_count(2000, 1000)
61533
>>> get_count(3000, 1000)
91164
>>> get_count(3000, 1000)
91833
>>> get_count(3000, 3000)
265121
>>> get_count(3000, 3000)
Change the count_all() to cumulative_score_all() to get similar results using the sum of all the scores.

As I mentioned, an advantage of the count and cumulative score approach over the k=1 nearest-neighbor search is that they are symmetric, or more precisely, reflexive. Compare the following to the (3000, 1000) output of the previous code and you'll see they give about the same values:

>>> get_count(1000, 3000)
91185
>>> get_count(1000, 3000)
90586

The joy of floating point arithmetic

This is an aside. Here are two alternative ways to compute the cumulative sum of all of the scores, the first by computing the cumulative_sum() of each individual result, and the second by computing the sum of the sum of each list of scores using Python.

>>> sum(result.cumulative_score() for result in results_AB)
29.499029322763214
>>> sum(sum(result.get_scores()) for result in results_AB)
29.499029322763214
It really bugs me that the cumulative_score_all() and the other two approaches differ by -3.552713678800501e-15, which is the least significant bit of the two representations:
>>> sum(sum(result.get_scores()) for result in results_AB).hex()
'0x1.d7fc062bd0356p+4'
>>> results_AB.cumulative_score_all().hex()
'0x1.d7fc062bd0357p+4'

After a bit of research, I confirmed that it's not due to a chemfp bug, but rather because floating point addition using doubles is not associative. If I sum up all of the scores in order, I get the value from cumulative_score_all():

>>> scores = []
>>> for result in results_AB:
...   scores.extend(result.get_scores())
... 
>>> sum(scores)
29.499029322763217

If I group the scores so I first sum up each row, I get the other value:

>>> scores = []
>>> for result in results_AB:
...   scores.append(sum(result.get_scores()))
... 
>>> sum(scores)
29.499029322763214

Minor differences in the last significant digit are an all too common consequence of using floating point numbers, so while it annoyed me, it was the first reason I thought of.

Cumulative score-based similarity

In an earlier section I showed that the total count or the cumulative score between two randomly chosen sets scales as the product of the two set sizes. What about normalizing the cumulative score by that product?

There's probably a name for this similarity measure, but I don't know much about this field and don't know the name. I'll call it "sss" for "sum of scores similarity".

# I don't actually use this one
def sss(arena1, arena2, threshold=0.8):
  results = search.threshold_tanimoto_search_arena(
                 arena1, arena2, threshold=threshold)
  return results.cumulative_score_all() / (len(arena1) * len(arena2))
After some experimentation, I decided to scale the score by 300, because that results in two sets with randomly chosen terms having a similarity of 0.01.
# This is the one I use.
def sss(arena1, arena2, threshold=0.8):
  results = search.threshold_tanimoto_search_arena(
                 arena1, arena2, threshold=threshold)
  # The scaling factor of 300 was chosen so that random ChEMBL
  # subsets have a score of about 0.01. It doesn't matter for
  # ranking as it's used as an arbitrary scaling factor.
  similarity = 300 * results.cumulative_score_all() / (len(arena1) * len(arena2))
  return similarity
I made a variant of the similarity table I used before, this time with the sss() function as my similarity measure, instead of jaccard():
def compute_sss_table(arena):
  sizes = (10, 20, 50, 100, 200, 500, 1000, 2000)
  print("     " + "".join(str(size).center(12) for size in sizes))
  for size1 in sizes:
    output1 = str(size1).rjust(4) + "   "
    output2 = "       "
    for size2 in sizes:
      scores = []
      for i in range(100):
          score = sss(make_subset(arena, size1), make_subset(arena, size2))
          scores.append(score)
      output1 += ("%.3f +/-" % (np.mean(scores),)).ljust(12) 
      output2 += ("  %.5f" % (np.std(scores),)).ljust(12)
    print(output1) 
    print(output2) 
The output table is very close to symmetric under transposition, and with a nearly constant similarity score of 0.011:
          10          20          50         100         200         500         1000        2000    
  10   0.024 +/-   0.012 +/-   0.005 +/-   0.013 +/-   0.013 +/-   0.015 +/-   0.008 +/-   0.011 +/-   
         0.23916     0.12318     0.05309     0.06569     0.04212     0.03571     0.02265     0.01623   
  20   0.012 +/-   0.013 +/-   0.008 +/-   0.013 +/-   0.013 +/-   0.010 +/-   0.012 +/-   0.016 +/-   
         0.12176     0.08863     0.05592     0.03930     0.02858     0.01905     0.02362     0.02322   
  50   0.005 +/-   0.028 +/-   0.018 +/-   0.010 +/-   0.010 +/-   0.010 +/-   0.011 +/-   0.012 +/-   
         0.04816     0.09862     0.04181     0.02301     0.02007     0.01040     0.01078     0.01331   
 100   0.010 +/-   0.006 +/-   0.011 +/-   0.012 +/-   0.013 +/-   0.011 +/-   0.011 +/-   0.011 +/-   
         0.05136     0.02660     0.02348     0.01906     0.01725     0.00920     0.00927     0.00801   
 200   0.011 +/-   0.013 +/-   0.012 +/-   0.011 +/-   0.010 +/-   0.010 +/-   0.012 +/-   0.011 +/-   
         0.05310     0.03826     0.02014     0.01716     0.00940     0.00745     0.00716     0.00496   
 500   0.013 +/-   0.008 +/-   0.011 +/-   0.012 +/-   0.012 +/-   0.011 +/-   0.012 +/-   0.012 +/-   
         0.03374     0.01458     0.01471     0.01177     0.00873     0.00597     0.00509     0.00354   
1000   0.009 +/-   0.012 +/-   0.010 +/-   0.012 +/-   0.012 +/-   0.012 +/-   0.012 +/-   0.012 +/-   
         0.02291     0.02247     0.01068     0.00878     0.00639     0.00508     0.00347     0.00300   
2000   0.011 +/-   0.011 +/-   0.013 +/-   0.013 +/-   0.011 +/-   0.011 +/-   0.011 +/-   0.011 +/-   
         0.01723     0.01354     0.01446     0.01031     0.00552     0.00332     0.00275     0.00209   
This is much more like what I expect from a similarity function!

Effect of threshold on the sss score

I picked a threshold of 0.8 because that's a pretty common value and few will object to it. It might be that a different threshold is better. How does the sss score change as a function of the threshold?

One obvious implementation is to call sss() with a different threshold each time:

>>> arena = chemfp.load_fingerprints("chembl_21.rdkit2048.fpb")
>>> subset1 = make_subset(arena, 1000)
>>> subset2 = make_subset(arena, 1000)
>>> sss(subset1, subset2, threshold=0.9)
0.0014073436068272435
>>> sss(subset1, subset2, threshold=0.8)
0.006656850064948667
>>> sss(subset1, subset2, threshold=0.7)
0.038875826785241686
>>> sss(subset1, subset2, threshold=0.6)
0.2569320521578928
(As you can see, while I managed to normalize the score as a function of set size, it's still threshold dependent.)

The obvious implementation is perfectly reasonable. There's a more computationally efficient way. You'll notice that each calculation ends up redoing the calculation before it, because all compounds with at least 0.9 are also within at least 0.8 similarity, etc.

Chemfp has another way to get the same values. The cumulative_score_all() method takes extra arguments to select the min and max values to use, and if the summation should include or exclude the endpoints. I'll get a SearchResults instance and ask for its help:

>>> results = search.threshold_tanimoto_search_arena(
...                  subset1, subset2, threshold=0.6)
>>> 
>>> help(results.cumulative_score_all)
cumulative_score_all(self, min_score=None, max_score=None, interval='[]')
method of chemfp.search.SearchResults instance
    The sum of all scores in all rows which are between *min_score* and *max_score*
    
    Using the default parameters this returns the sum of all of
    the scores in all of the results. With a specified range this
    returns the sum of all of the scores in that range. The
    cumulative score is also known as the raw score.
    
    The default *min_score* of None is equivalent to -infinity.
    The default *max_score* of None is equivalent to +infinity.
    
    The *interval* parameter describes the interval end
    conditions. The default of "[]" uses a closed interval,
    where min_score <= score <= max_score. The interval "()"
    uses the open interval where min_score < score < max_score.
    The half-open/half-closed intervals "(]" and "[)" are
    also supported.
    
    :param min_score: the minimum score in the range.
    :type min_score: a float, or None for -infinity
    :param max_score: the maximum score in the range.
    :type max_score: a float, or None for +infinity
    :param interval: specify if the end points are open or closed.
    :type interval: one of "[]", "()", "(]", "[)"
    :returns: a floating point count
I'll use the min_score parameter. First, I'll find all fingerprints with a similarity threshold of 0.6, which is the lowest score I'm concerned about, and, to reduce complexity later on, I'll merge the "300*" and the product of the sizes into a single scaling factor:
>>> results = search.threshold_tanimoto_search_arena(
...                  subset1, subset2, threshold=0.6)
>>> scaling = 300/len(subset1)/len(subset2) 
Then I'll compute the scores for each threshold, and get the same results as earlier.
>>> scaling*results.cumulative_score_all(min_score=0.9)
0.0014073436068272435
>>> scaling*results.cumulative_score_all(min_score=0.8)
0.006656850064948667
>>> scaling*results.cumulative_score_all(min_score=0.7)
0.038875826785241686
>>> scaling*results.cumulative_score_all(min_score=0.6)
0.2569320521578928

For fun, I put it in a loop and computed more thresholds:

>>> scaling = 300/len(subset1)/len(subset2) 
>>> for threshold in (0.99, 0.95, 0.90, 0.85, 0.8, 0.75, 0.7, 0.65, 0.6):
...   score = scaling * results.cumulative_score_all(min_score=threshold)
...   print("%.2f %7.4f" % (threshold, score))
... 
0.99  0.0003
0.95  0.0006
0.90  0.0014
0.85  0.0030
0.80  0.0067
0.75  0.0166
0.70  0.0389
0.65  0.0921
0.60  0.2569
A quick test now shows that computing the similarities once and finding the cumulative sum for the 9 threshold levels is about 4x faster than doing 9 independent threshold calculations.


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB