Block Tanimoto calculations

[ previous | newer ] /home/writings/diary/archive/2008/06/28/block_tanimoto_calculations

Block Tanimoto calculations

This is part 4 of an on-going series dealing with molecular fingerprints:

One query, many targets

In part 3 I wrote a C function to compute the Tanimoto similarity score between two fingerprints. The Tanimoto is most often used when searching a large set of target compounds. Some questions are "what's the Tanimoto similarity for every compound in the data set", "what are the 10 target compounds that are most similar to the query, sorted by similarity, and what are their similarity values?" and "how many compounds are at least some specified similarity threshold to the query compound?"

With the Tanimo.similarity function extension I defined earlier, it's simple to implement these algorithms. Here's one which prints the Tanimoto score for every compound in a given ".binary_fp" file using the first compound as the query.

# print_similarity.py
import sys
import tanimoto

filename = "drugs.binary_fp"
if len(sys.argv) > 1:
    filename = sys.argv[1]

# Get the query fingerprint
f = open(filename, "rb")
query_fp = f.read(128)  # 1024 bits


f.seek(0)
while 1:
    target_fp = f.read(128)   # 1024 bits
    if not target_fp:
        # reached the end of file
        break
    print tanimoto.similarity(query_fp, target_fp)

Here's some output to show you what it writes:

% python print_similarity.py nci.binary_fp | head
1.0
0.492537313433
0.450980392157
0.546511627907
0.505319148936
0.686813186813
0.132701421801
0.132701421801
0.602564102564
0.459330143541
Traceback (most recent call last):
  File "print_similarity.py", line 19, in 
    print tanimoto.similarity(query_fp, target_fp)
IOError: [Errno 32] Broken pipe

How fast is this code? I took the PubChem SD file for compounds "09425001 - 09450000" and converted it into a Babel fingerprint file, and thence into a ".binary_fp" file.

% babel Compound_09425001_09450000.sdf.gz nci_09425001_09450000.fpt -xh
24323 molecules converted
364134 audit log messages 
% python fpt_to_binary.py nci_09425001_09450000.fpt
% ls -l nci_09425001_09450000.*
-rw-r--r--   1 dalke  staff  3113344 Jun  9 16:30 nci_09425001_09450000.binary_fp
-rw-r--r--   1 dalke  staff  8215091 Jun  9 16:20 nci_09425001_09450000.fpt
% python -c 'print 3113344/24323.'
128.0

I concatenated 50 copies of the binary file together to get a single large file, called "nci.binary_fp":

% ls -l nci.binary_fp
-rw-r--r--   1 dalke  staff  155667200 Jun  9 16:46 nci.binary_fp
%

I had to make it this large because I wanted it to be somewhat realistic. This corresponds to a fingerprint database of 1,216,150 compounds.

How long does it take to print the Tanimoto scores for the entire data set?

% time python print_similarity.py nci.binary_fp > /dev/null
9.419u 0.442s 0:09.87 99.7%     0+0k 0+0io 0pf+0w
%

Decent. It means that PubChem (19,218,991 compounds) would take about 2.5 minutes, which is several times faster than the 7 minutes I estimated in pure Python, but still quite a bit slower than the 20 seconds I estimated was the limit using pure C.

Push blocks of targets to the C code

Where is the extra time being spent? Probably because I'm doing a lot of work in Python in order to do a very little bit of work in C. Every Tanimoto calculation requires a Python function, plus another to read 1024 bits, plus a test to see if it's the end of file. Python function calls take a lot of time compared to C function calls.

I could do some profiling to figure out if this guess is correct but it's about as easy to implement another solution and compare the timings. Rather than computing a single Tanimoto at a time I'm going to pass many fingerprints to the C function, and get a list of Tanimoto values back.

One of the points I've been bring up in these writings is that there are many solutions to a problem. In this case I'm going to pass a set of fingerprints to a C function, but I get to decide how I want to pass the data in. I could use a Python list of strings, or a Numpy array of unsigned integers, or many more possibilities. The solutions I'm exploring are based on experience but it doesn't mean I'm right, nor does it mean I can get to a good solution without some exploration.

tanimoto.block_similarity - my desired API

For my solution now I'm going to pass the set of query fingerprints as a single string, with the first fingerprint as the first 128 bytes, the second as the next 128 bytes, and so on. I'll call this single string which contains many fingerprints a "block" so my new function will be called "block_similarity." I want the result to be something like a Python list, where I can iterate over the results for each fingerprint, in order. In code, I want the main loop of the calcultion to act like the following interface:

while 1:
    target_fps = f.read(128*1024)   # 1024 fingerprints of 1024 bits each
    if not target_fps:
        # reached the end of file
        break
    for score in tanimoto.block_similarity(query_fp, target_fps):
        print score

I want the results returned as a list but C functions can only return a simple data type like an integer or a pointer. Again, there are several possible solutions. I'll have my Python wrapper module allocate space for the results and pass a pointer to that space to the C function, which assigns the correct values.

libtanimoto.c:block_similarity

With most of the hard work of allocating results handled by some yet-to-be-written Python code, the C code isn't that hard:

/* This is in libtanimoto.c */
void block_similarity(int num_bytes, unsigned char *query,
                      int num_targets, unsigned char *targets,
                      double *results) {
  int a=0, b=0, c=0, i;

  /* precompute popcount(query) */
  for (i=0; i<num_bytes; i++) {
    a += _popcount_counts[query[i]];
  }

  /* For each fingerprint in the block */
  while (num_targets-- > 0) {
    /* compute popcount(target) and popcount(target&query) */
	b=0; c=0;
    for (i=0; i<num_bytes; i++) {
      b += _popcount_counts[targets[i]];
      c += _popcount_counts[query[i]&targets[i]];
    }
    /* Append the result and go on to the next fingerprint */
    *results = ((double)c)/(a+b-c);
    results++;
    targets += num_bytes;
  }
}

Because I expect to do a Tanimoto calculation using many targets I could do a bit of optimization. The Tanimoto function is "c/(a+b-c)" where "a" is the number of on-bits in the query fingerprint. That's a constant so I compute it once.

tanimoto.py:block_similarity

The hard work of allocating memory goes into the 'tanimoto.py' file. I decided to use a numpy array to store the results because they are designed to work with ctypes. I can pass the underlying data storage area to a C function with only a 'cast'. Numpy also supports functions like 'numpy.sort' so working with numpy arrays can be more convenient than working with a memory block allocated directly by ctypes, which is my other likely possibility. On the other hand it does put a third-party dependency in the library.

Here's the code:

# This is added to 'tanimoto.py'
_block_similarity = libtanimoto.block_similarity
_block_similarity.argtypes = [ctypes.c_int, ctypes.c_char_p,
                              ctypes.c_int, ctypes.c_char_p,
                              ctypes.POINTER(ctypes.c_double)]
_block_similarity.restype = None

import numpy
def block_similarity(query, targets):
    n = len(query)
    m = len(targets)
    if m % n != 0:
        raise AssertionError("length mismatch")
    num_targets = m//n
    result = numpy.zeros(num_targets, numpy.double)
    result_ctypes = ctypes.cast(result.ctypes.data,
                                ctypes.POINTER(ctypes.c_double))
    _block_similarity(n, query, num_targets, targets, result_ctypes)
    return result

Looks easy, right? You'll be surprised at how long it took to get that working. I figured that because the numpy array is a 'numpy.double' then ctypes would see it as a ctypes.c_double array, without needing the cast. Instead, without the case ctypes complains:

ctypes.ArgumentError: argument 5: <type 'exceptions.TypeError'>:
 expected LP_c_double instance instead of numpy.ndarray

Strange. But what I did seems to work, so hopefully others who are confused will search on the error message and find this essay useful.

After I compiled the modified libtanimoto shared library (using 'scons') and updated the 'tanimoto.py' wrapper library I did a quick test:

>>> tanimoto.block_similarity("A", "AaB")
array([ 1.        ,  0.66666667,  0.33333333])
>>> 
>>>  [hex(ord(c)) for c in "AaB"]
['0x41', '0x61', '0x42']
>>>

In this case the query fingerprint is 8 bits long (the character 'A", which is the bit pattern 01000001) and the target fingerprint block contains three 8 bit fingerprints (01000001, 11000001 and 01000010). The result should be three Tanimoto scores, of values 1.0, 2/3 and 1/3 each. Which I got. It seems to work.

Not shown here is the step where I compared the results of the "block_similarity" function to one based on the older "similaity" function, to make sure they gave the same results.

Computing all Tanimoto scores using block_similarity

Once I was convinced I got the right answers, I wrote a "print_similarity2.py" function which works the same as "print_similarity.py" (it calculates and prints all of the Tanimoto scores for the given fingerprint file) except it uses the block-based similarity calculations.

# print_similarity2.py
import sys
import tanimoto

filename = "drugs.binary_fp"
if len(sys.argv) > 1:
    filename = sys.argv[1]

# Get the query fingerprint
f = open(filename, "rb")
query_fp = f.read(128)  # 1024 bits


f.seek(0)
while 1:
    target_fps = f.read(128*1024)   # 1024 fingerprints of 1024 bits each
    if not target_fps:
        # reached the end of file
        break
    for score in tanimoto.block_similarity(query_fp, target_fps):
        print score

This code takes about 4.1 seconds to run on my 'nci.binary_fp' data set:

% time python print_similarity2.py nci.binary_fp > /dev/null
3.571u 0.546s 0:04.12 99.7%     0+0k 0+33io 0pf+0w

which is quite a bit better than the 9.9 seconds of "print_similarity.py". Searching PubChem this way should take about 65 seconds, which is about 1/3rd of the speed I expect from C code.

The effects of block size

The code uses a tunable parameter, which is the number of fingerprints to include in a block. I timed a variety of block sizes to figure out what is optimal for my machine:

  number
  of fps    time
 in block  (in s)
      1    42.06
      2    23.41
      5    11.85
     10     8.31
     25     5.68
     50     4.82
    100     4.46
   1000     4.17
   5000     4.17
  10000     4.22
 100000     5.15

It looks like my chosen size of 1024 is a good one. It gives good performance while minimizing memory use. Your machine might be different. (Note that I used my computer to listen to music during this, which is why the numbers are slightly slower than 4.12 seconds.)

My data set is "only" 155,667,200 bytes long (149 MB). It easily fits into memory. I was curious on how much time could be saved if I didn't do any I/O, that is, if I read the entire data file into a string and only computed the time needed to run "block_similarity" on that string. Result: 3.30s, or an estimated PubChem search time of 52 seconds. About 0.9 seconds is being spent in Python startup time and reading the data file.

Why's my code so slow?

Almost all of my time is in C code, yet I'm still at 1/3rd the time I expect from the fastest possible C code. What's going on? There's still a few differences between my code and the "fast" code I wrote.

The fast code repeated the same calculations many times, which meant it didn't need to fetch data from disk and memory to the CPU.
My block_similarity wrapper code allocated memory for the results
I iterate over the results, and print them to /dev/null
My popcount algorithm uses a lookup table, not __builtin_popcount()

Of these, I'm guessing "print score" probably takes the longest. Each number is converted (using __str__) into a string and sent to stdout, only to be redirected to /dev/null. To test that guess I changed "print_similarity2.py" so the inner loop was:

    for score in tanimoto.block_similarity(query_fp, target_fp):
        #print score
        pass

I no longer print out the score. When I run the code I get:

% time python print_similarity2.py nci.binary_fp
1.301u 0.538s 0:01.84 99.4%     0+0k 0+0io 0pf+0w
%

4.12 to 1.84 seconds is quite a difference! The estimated time to compute the Tanimoto for all of PubChem (but not print out the results) is now 30 seconds, which is only 4 times slower than my expected fastest possible performance (using this approach) of 8 seconds.

Of course, if I'm not going to print the values then there's no need to iterate over the return values. This iteration is slower than usual in Python because the data is stored in Numpy array, which hold C doubles, not Python floats. The iteration has to convert each C double into a Python float, which also takes time.

I removed iteration from my code so the main loop becomes:

while 1:
    target_fp = f.read(128*4096)   # 1024 of 1024 bits
    if not target_fp:
        # reached the end of file
        break
    tanimoto.block_similarity(query_fp, target_fp)

and the total time for my test set is 1.6 seconds,

% time python print_similarity2.py nci.binary_fp
1.033u 0.536s 0:01.57 99.3%     0+0k 0+11io 0pf+0w

or an estimated 26 seconds for all of PubChem. I feel better now, as I wasn't sure why my code was so much slower than it should be. I could go back and rewrite my text so you would never see my confusion and wonderings about this, but where's the fun in that? In this way you can see some of how I got to where I did.

mmap - memory-mapped files

Earlier I mentioned that the block size was a tuneable parameter. I don't like tuneable parameters. I prefer self-tuning systems or using a better algorithm. It turns out there is another solution - a memory-mapped file. This is an operating system mechanism to treat a a file as a string in memory instead of using file operations like 'read()'. In theory I would mmap the file to a block of memory then pass the block directly to to the "block_similarity" function.

Sadly, it doesn't work as easily as I thought it would. Python has a mmap module which handles making a memory-mapped file, but the mmap object isn't compatible with the 'ctypes' interface. It seems that perhaps Python 2.6 makes this easier, but I'm not sure.

Instead I bypassed Python's mmap module and called the mmap function directly from libc, using ctypes. I ran into some difficulties, which ended up being because the "off_t" C type is a 64 bit integer, and not a 32 bit one like I expected. Getting the data types correct is one of the hard parts of using ctypes.

Here's the code for "print_similarity3.py". It's using the same libtanimoto:block_similarity function as "print_similarity2.py" but because the entire file is mapped to a string, I don't have to worry about reading a block at a time - I can pass the entire mmap'ed file to the function. (This will create in the PubChem case a 152 MByte Numpy array of doubles, which may be unexpected.)

# print_similarity3.py
import sys
import os
import mmap
import ctypes, ctypes.util
from ctypes import c_void_p, c_int, c_char, c_int64

import tanimoto

libc = ctypes.CDLL(ctypes.util.find_library("c"))
libc.mmap.restype = c_void_p
libc.mmap.argtypes = (c_int, c_int, c_int, c_int, c_int, c_int64)

filename = "drugs.binary_fp"
if len(sys.argv) > 1:
    filename = sys.argv[1]

# Get the query fingerprint and reset
f = open(filename, "rb")
query_fp = f.read(128)  # 1024 bits
f.seek(0)

size = os.path.getsize(filename)

# mmap(void *addr, size_t len, int prot, int flags, int fd, off_t offset);
ptr = libc.mmap(0, size, mmap.PROT_READ, mmap.MAP_SHARED,  f.fileno(), 0)

targets = ctypes.cast(ptr, ctypes.POINTER(c_char*size))[0]

#  Don't iterate and print the scores; that takes a long time
#  and I want to focus on the computation part.
#for score in tanimoto.block_similarity(query_fp, targets):
#    print score
tanimoto.block_similarity(query_fp, targets)

Getting timing numbers for this has been a bit tricky. The first time I ran this took a few tens of seconds as my Mac swapped pages around. I had a lot of apps using virtual memory. The second time it went faster. I've found in general that using mmap'ed files gives performance profiles that are harder to predict, so I tend to shy away from them. I also have read strange things about using them with big files (>1GB, or ast least > main memory) but I don't have the experience to judge.

On the plus side, here's the timings:

% time python print_similarity3.py nci.binary_fp
0.950u 0.318s 0:01.27 99.2%     0+0k 0+0io 0pf+0w

PubChem's fingerprint file would be 2.3 GB. If the mmap access scales to that file size then I can compute all its Tanimoto values in about 20 seconds, which is is impressively fast.

Python and NumPy startup costs

My reference C code could in theory calculate Tanimoto values for PubChem in 8 seconds so my Python program timings, even with mmap, seems rather slow. Then again, these aren't computing the same things. For one thing, by simply scaling the time I'm exaggerating the overhead of starting Python. That should not be affected by the number of compounds to compute.

I changed the last few lines of the "print_similarity3.py" file so I could see how much time was spent in running "block_similarity"

  ...
import time
t1 = time.time()
tanimoto.block_similarity(query_fp, targets)
t2 = time.time()
print >>sys.stderr, t2-t1

It was 0.90 seconds out of a total time of 1.28 seconds. That means 0.4 seconds is being spend on startup costs? That's unexpected. Of that, Python's startup time is 0.05 seconds

% time python -c 'a=1'
0.014u 0.040s 0:00.05 100.0%    0+0k 0+0io 0pf+0w

I added some print statements and calls to time.time() and found the time was spent in importing 'numpy'! You can see that in the following:

% time python -c 'import numpy'
0.141u 0.221s 0:00.37 97.2%     0+0k 0+0io 0pf+0w

For whatever reason, 'import numpy' also imports a huge number of modules, about half of which are submodules.

>>> import sys
>>> len(sys.modules)
28
>>> import numpy
>>> len(sys.modules)
225
>>> len([s for s in sorted(sys.modules) if 'numpy' in s])
127
>>>

Timing just 'libtanimoto.block_similarity' gave me 0.9 seconds. Of that, the difference between

    for (i=0; i<num_bytes; i++) {
      b += _popcount_counts[targets[i]];
      c += _popcount_counts[query[i]&targets[i]];
    }

and the code without the table lookup

	for (i=0; i<num_bytes; i++) {
      b += targets[i];
      c += query[i]&targets[i];
    }

is 0.88 seconds vs. 0.50 seconds (or estimated 14s for PubChem vs. 7.9s). Okay, I'm convinced now that a lot time is now being spent on the Tanimoto calculation.

I also tried replacing the "i<num_bytes" with "i<128" and it turns out that reduces the time from 0.88 seconds to 0.84 seconds (13 seconds for PubChem). I'll work on more performance optimizations in the next article.

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me