Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2008/06/27/computing_tanimoto_scores

Computing Tanimoto scores, quickly

This is part 3 of an on-going series dealing with molecular fingerprints:

  1. Molecular fingerprints, background
  2. Generating molecular fingerprints with OpenBabel
  3. Computing Tanimoto scores, quickly
  4. Block Tanimoto calculations
  5. Faster block Tanimoto calculations
  6. HAKMEM 169 and other popcount implementations
  7. Testing the bitslice algorithm for popcount
  8. 2011 update: Faster population counts
Comments?


For now my driving example will be to print the Tanimoto values for each compound in a data set, in order. I get to control where the data comes from and the format that it's in, and I can precompute whatever I want about the target data set. I deliberately don't care about the identifiers, only the Tanimoto scores, so I can focus on the search algorithm. In this essay I'll focus on computing the Tanimoto between two fingerprints. The next one will compute the Tanimoto between a single query fingerprint and a large number of targets.

Converting Babel's "fpt" format to "binary_fp" format

I'll start by converting the hex fingerprints from the "fpt" file into a new binary file format, and I'll assume that there are 1024 bits in the fingerprint. This new format will be very simple. Each fingerprint takes up 1024/8 = 128 bytes, and in the same order as the input file. This is not a good format for general use. I'll talk about why later. Because it's a bad format, I'm using an ugly and long extension to the filename. Don't use this format for real work!

I decided to use Python for this conversion rather than C because while my eventual goal is to do fast Tanimoto searches, this is a preprocessing step where the performance is not critical. Plus, I usually don't like parsing in C.

Parsing the "fpt" format turns out to be a bit of a pain, in almost the same way that parsing a FASTA file is a pain. There only way to know you are done with one record is to read the first line of the next record, or get to the end of the file. I read the header line (which is different for the first record vs. the other records in the file) and merge the hex lines into a single hex fingerprint string.

(Earlier I made it easier for myself by assuming I could read the entire file into memory, but because I'm going to parse a large fingerprint file soon, I can't make that assumption now.)

After I have the fingerprint as a hex digits string I need to convert it into the raw byte values. I thought this would be tedious until I remembered there's a "hex" encoding in Python which does that for me, and which makes this step trivial.

>>> "Hello!".encode("hex")
'48656c6c6f21'
>>> "48656c6c6f21".decode("hex")
'Hello!'
>>> 

Here's the conversion code, which I named "fpt_to_binary.py" for boringly obvious reasons.

# fpt_to_binary_fp.py
# Convert from an OpenBabel "fpt" format with hex fingerprints (use
# -xf to generate the fingerprints) into a binary fingerprint file.

# Fingerprints are written consecutively, with no title information or
# separator between the fingerprints.  This is a very raw format and
# should not be used for anything other then benchmarking and
# bootstrapping into a better format.

# Parsing formats with no explict end-of-record marker is a nuisance.
def read_records(infile):
    # Parse the header line for the first record, which looks like:
    #   >Strychnine   183 bits set 
    # I'm not splitting on the " " because the title might have a space.
    # Using a regular expression proved a bit too cumbersome.
    line = infile.readline()
    assert line[0] == ">"
    # Turn ">Strychnine   183 bits set " into
    #     [">Strychnine", "183", "bits", "set"]
    words = line.rsplit(None, 3)
    assert words[-2:] == ["bits", "set"]
    title = words[0][1:]

    while 1:
        
        hex_blocks = []
        for line in infile:
            if line.startswith("Possible"):  # "Possible superstructure"
                continue
            if line[:1] == ">":
                break
            hex_blocks.append("".join(line.split()))
        else:
            # End of input
            yield title, "".join(hex_blocks)
            break

        # "line" contains a title line for the next record
        yield title, "".join(hex_blocks)

        # Parse the header line for the other records, which look like:
        #   >cocaine   Tanimoto from Strychnine = 0.353234 
        assert line[0] == ">"
        i = line.rindex("Tanimoto from")
        title = line[1:i].strip()
    
import sys, os

input_filename = sys.argv[1]
output_filename = os.path.splitext(input_filename)[0] + ".binary_fp"

infile = open(input_filename)
outfile = open(output_filename, "wb")
for title, hexdata in read_records(infile):
    #print repr(title)
    outfile.write(hexdata.decode("hex"))

Does it work? I'll convert the "drugs.fpt" file and compare its contents with the newly created "drugs.binary_fp" file.

% python fpt_to_binary_fp.py drugs.fpt
% head drugs.fpt
>Strychnine   183 bits set 
00054062 81009600 00100102 81010700 200c0000 00202850 
03102000 28000404 10200882 001849e4 483c0024 50039002 
1402801a 01b90000 00020540 010c100a 22a4c000 82000300 
2ac02018 00002201 60102120 183c9630 2100a080 81500019 
00041401 80008c00 00484810 90001080 040c8202 0006081b 
24020080 a2042610 
>cocaine   Tanimoto from Strychnine = 0.353234
00010000 01000800 00100100 00010700 000c0050 00204010 
00000000 20000000 00000100 010008c0 00200000 50019000 
% od -t xC drugs.binary_fp | head
0000000    00  05  40  62  81  00  96  00  00  10  01  02  81  01  07  00
0000020    20  0c  00  00  00  20  28  50  03  10  20  00  28  00  04  04
0000040    10  20  08  82  00  18  49  e4  48  3c  00  24  50  03  90  02
0000060    14  02  80  1a  01  b9  00  00  00  02  05  40  01  0c  10  0a
0000100    22  a4  c0  00  82  00  03  00  2a  c0  20  18  00  00  22  01
0000120    60  10  21  20  18  3c  96  30  21  00  a0  80  81  50  00  19
0000140    00  04  14  01  80  00  8c  00  00  48  48  10  90  00  10  80
0000160    04  0c  82  02  00  06  08  1b  24  02  00  80  a2  04  26  10
0000200    00  01  00  00  01  00  08  00  00  10  01  00  00  01  07  00
0000220    00  0c  00  50  00  20  40  10  00  00  00  00  20  00  00  00
% 
The "od -t cH" displays the hex values for the bytes of the "drugs.binary_fp" file. You can see that the output file is simply the input hex digits, expressed directly in binary. (You might have to stare at it for a bit.) So yes, it does work.

Computing the popcount using only Python

Onwards to computing the Tanimoto scores! This will require computing the popcount of the intersection of two fingerprints. Normal Python is really quite bad at that. I could convert the 1024 bit/128 byte fingerprint into a Python integer (like I did earlier). Computing the bit intersection between two integers is easy with the '&' operator, but finding the popcount is a bit complicated. (See below for elaboration.)

I could keep everything as a byte string. Finding the popcount of the fingerprint is the same as the sum of the popcount of each byte. There are only 256 possible bytes, so that's a small lookup table. The problem is computing the bits in common to both fingerprints. Python doesn't support bit operations on characters, so I would have to convert the bytes to a number before doing the &. This is completely doable, but there's a somewhat better solution.

I can use the 'array' module and create a byte array. This is a seldom used Python module but it seems just the thing for this case.

>>>import array
>>> array.array("b", "This is a string")
array('b', [84, 104, 105, 115, 32, 105, 115, 32, 97, 32, 115, 116, 114, 105, 110, 103])
>>> 

This converted the input string into an array of bytes, stored as integers. To compute the total fingerprint popcount I'll sum the popcount for each byte in the fingerprint. I can get that by looking it up in a table of precomputed values for each possible byte. To compute the number of bits in common between the two fingerprints I'll compute the sum of the number of bits in common between the corresponding bytes in the fingerprints.

import array

# Create a lookup table from byte value to 'on' bit count
# chr(0) has 0 'on' bits
# chr(1) has 1 'on' bit
# chr(2) has 1 'on' bit
# chr(3) has 2 'on' bits
#  ...
# chr(255) has 8 'on' bits
popcount_in_byte = (
    0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8,
    )
fp = array.array('b', [84, 104, 105, 115, 32, 105, 115, 32,
                       97, 32, 115, 116, 114, 105, 110, 103])
print "popcount =", sum(popcount_in_byte[byte] for byte in fp)
If you copy and past that into a Python terminal (or save it and run it), you'll get:
popcount = 57
which is the number of 'on' bits in the string "This is a string". To verify that, here's a completely different (and slower) method based on simple bit counting.
>>> "This is a string".encode("hex")
'54686973206973206120737472696e67'
>>> int("This is a string".encode("hex"), 16)
112197289293498629157862941796076187239L
>>> x = int("This is a string".encode("hex"), 16)
>>> popcount = 0
>>> while x:
...   if x & 1:
...     popcount += 1
...   x >>= 1
... 
>>> popcount
57
>>> 
The two methods agree, so it's very likely that the code works. Although there could still be mistakes in the lookup table.

Computing the Tanimoto using only Python

Now that I have a popcount it's easy to compute the Tanimoto. In the following program I'll print the Tanimoto for every fingerprint in the specified binary_fp file, using the first fingerprint as the query fingerprint. I'll include computing the Tanimoto of the query fingerprint against itself.

# compute_tanimoto.py
import sys
import array

# Create a lookup table from byte value to 'on' bit count
# chr(0) has 0 'on' bits
# chr(1) has 1 'on' bit
# chr(2) has 1 'on' bit
# chr(3) has 2 'on' bits
#  ...
# chr(255) has 8 'on' bits
popcount_in_byte = (
    0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8,
    )


def tanimoto(query, target):
    a = b = c = 0
    for query_byte, target_byte in zip(query, target):
        a += popcount_in_byte[query_byte]
        b += popcount_in_byte[target_byte]
        c += popcount_in_byte[query_byte & target_byte]
    # I'll define that if neither fingerprint has any
    # set bits (so a == b == 0) then the Tanimoto is 0.
    # Otherwise the equation will try to compute 0/0
    # and Python will throw an exception.
    if not c:
        return 0
    return float(c)/(a+b-c)

if len(sys.argv) > 1:
    filename = sys.argv[1]
else:
    filename = "drugs.binary_fp"

# Use the first fingerprint as the query
infile = open(filename, "rb")
s = infile.read(1024//8)  # 128 bytes in a fingerprint
query = array.array("b", s)

# Reopen and compute the Tanimoto against all fingerprints
# including itself.
infile = open(filename, "rb")

while 1:
    s = infile.read(1024//8)
    if not s:
        # End of file
        break

    target = array.array("b", s)
    print tanimoto(query, target)

The output from this program on the previously created "drugs.binary_fp" file is:

% python compute_tanimoto.py drugs.binary_fp
1.0
0.353233830846
0.330316742081
0.401486988848
0.332046332046
0.303571428571
0.317406143345
0.22009569378
0.124444444444
0.1
%
How do I verify those are correct? OpenBabel also computed the Tanimoto scores when it generated the ".fpt" file. You can go back and compare (as I did) and see that they agree. The only difference is that I'm printing the results to double precision while OpenBabel prints them to single precision. Since the denominator is at best 1024, single precision is more than accurate enough. In fact, double implies an accuracy that isn't warranted.

Can I do this faster without writing C code?

There are ways to make the popcount code go faster using standard Python or available 3rd party extensions. I'll review some of them but won't use them outside this section.

The standard Python distribution is not a good platform for doing fast Tanimoto calculations. There is no builtin function for it and while I can build them they are all relatively slow. That's why I'll shortly develop a C extension for doing this calculation. Before doing that, just what other options exist?

I could convert the 1024 bit/128 byte fingerprint into a Python integer. The first problem is converting the raw fingerprint bytes into an integer. I don't know a direct way, but int(s.encode("hex"), 16) works. Computing the bit intersection is simple - just use "&". The problem is computing the popcount.

The simple popcount algorithm is to see if the number has the 1 bit set, then the 2 bit, then the 4 bit, ...

# This computes about 14,000 popcounts per second
all_bits = [1<<bitno for bitno in range(1024)]

def popcount_1024(x):
    n = 0
    for bit in all_bits:
        if x & bit:
            n += 1
    return n
That's really quite slow. I could use a clever bit-manipulation algorithm to calculate it, which I extended from the Wikipedia article on Hamming weight. Here it is:
# Compute the popcount of a Python long, up to 1024 bits.
# My MacBook Pro only does about 47,000 per second.

m1 = int("0x" + "5" * (1024//4), 16)
m2 = int("0x" + "3" * (1024//4), 16)
m3 = int("0x" + "0f" * (1024//8), 16)
m4 = int("0x" + "00ff" * (1024//16), 16)
m5 = int("0x" + "0000ffff" * (1024//32), 16)
m6 = int("0x" + ("0"*8+"f"*8) * (1024//64), 16)
m7 = int("0x" + ("0"*16+"f"*16) * (1024//128), 16)
m8 = int("0x" + ("0"*32+"f"*32) * (1024//256), 16)
m9 = int("0x" + ("0"*64+"f"*64) * (1024//512), 16)
m10 = int("0x" + ("0"*128+"f"*128), 16)
    

def popcount_1024(x):
    x = (x & m1 ) + ((x >>   1) & m1 )
    x = (x & m2 ) + ((x >>   2) & m2 )
    x = (x & m3 ) + ((x >>   4) & m3 )
    x = (x & m4 ) + ((x >>   8) & m4 )
    x = (x & m5 ) + ((x >>  16) & m5 )
    x = (x & m6 ) + ((x >>  32) & m6 )
    x = (x & m7 ) + ((x >>  64) & m7 )
    x = (x & m8 ) + ((x >> 128) & m8 )
    x = (x & m9 ) + ((x >> 256) & m9 )
    x = (x & m10) + ((x >> 512) & m10)
    return x
By being more clever I could make this a bit faster, but I'm not even going to try.

During proofreading I remembered another way to make the popcount of the and-ed fingerprints be faster - I could use 16 bit integers for everything and look up the popcounts in a 16 bit lookup table. But this doesn't work that well because Python doesn't have a 16 bit array type. You would have to use one of the Numpy/numeric/numarray arrays.

Another way to compute the popcount of an integer is to use Scott David Daniels' 'bits' module. It's a C extension and is quite fast. I used it a couple of years ago when I was doing some experiments with fingerprint searches.

If you're going to use a third-party C extension then you might instead consider using GMPY. It's based on the GNU Multiple Precision Library and includes a popcount function. Digging in the source code I see that the popcount function is implemented in hand-crafted assembly for different platforms.

Yet another possibility for computing the popcount of a fingerprint is to convert the integer into hex string, or generate a binary pickle string, and work with the resulting string using the same lookup table approach I did earlier. This has the advantage of working with any length Python integer, but has a lot of overhead.

# Compute the popcount of a Python long
# For 1024 bit fingerprints my MacBook Pro does about 
# 24,000 per second.

popcount_lookup_table = {}
for i, n in enumerate( (
    0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8,
    ) ):
    popcount_lookup_table[chr(i)] = n

# "pickle" does 15,000/second
# "cPickle" does 24,000/second
import cPickle as pickle

def popcount(x):
    s = pickle.dumps(x, 2)[4:-1]
    L = popcount_lookup_table
    return sum(L[c] for c in s)

Or I could stick with the raw byte string. The hard part there was computing the number of bits in common between two characters. The easiest solution is to convert the characters into integers, do the bitwise-and, and convert back to a character, but that's a lot of Python function calls for something that's a single assembly instruction. I could perhaps lookup table, where the key in the table is (query_byte, target_byte) and the value is the number of bits in common. I don't like that solution. For one, it requires 256*256=65,536 entries.

Computing the Tanimoto in C, using GCC extensions

Computing the Tanimoto score is much easier in C, and even easier with gcc because the latter includes a built-in function for computing the popcount of an integer, a long, and a long long. I'll use gcc for now so I don't need to define a popcount just yet. (I'll do that in the next section.)

In the following I assume that integers are 32 bits in length, so there are 32 integer parts to a 1024 bit fingerprint. I'll write each unsigned integer in hex, so the number starts with "0x" (to mean "hex") and ends with "u" (to mean "unsigned").

/* simple_tanimoto.c */
#include <stdio.h>

/* This uses a non-portable builtin function for gcc */
#define popcount(x) __builtin_popcount(x)

unsigned int query[] = {
  /* >Strychnine   183 bits set  */
  0x00054062u, 0x81009600u, 0x00100102u, 0x81010700u, 0x200c0000u, 0x00202850u,
  0x03102000u, 0x28000404u, 0x10200882u, 0x001849e4u, 0x483c0024u, 0x50039002u,
  0x1402801au, 0x01b90000u, 0x00020540u, 0x010c100au, 0x22a4c000u, 0x82000300u,
  0x2ac02018u, 0x00002201u, 0x60102120u, 0x183c9630u, 0x2100a080u, 0x81500019u,
  0x00041401u, 0x80008c00u, 0x00484810u, 0x90001080u, 0x040c8202u, 0x0006081bu,
  0x24020080u, 0xa2042610u};

unsigned int target[] = {
  /* >cocaine   Tanimoto from Strychnine = 0.353234 */
  0x00010000u, 0x01000800u, 0x00100100u, 0x00010700u, 0x000c0050u, 0x00204010u,
  0x00000000u, 0x20000000u, 0x00000100u, 0x010008c0u, 0x00200000u, 0x50019000u,
  0x1c00801au, 0x01890000u, 0x00000000u, 0x02081008u, 0x00a20000u, 0x02000000u,
  0x01c02010u, 0x00000201u, 0x00000020u, 0x18020a30u, 0x01000080u, 0x00500014u,
  0x00001401u, 0x80002000u, 0x00088000u, 0x10001000u, 0x00000200u, 0x0006000eu,
  0xc0020000u, 0x02002200u};

int main(void) {
  int A=0, B=0, c=0;
  int i;
  for (i=0; i<32; i++) {
    /*printf("  %lx %lx\n", fp1[i], fp2[i]);*/
    A += popcount(query[i]);
    B += popcount(target[i]);
    c += popcount(query[i]&target[i]);
  }
  /* Should print "183 89 71 -> 0.353234" */
  printf("%d %d %d -> %f\n", A, B, c, ((double)c)/(A+B-c));
}
I'll compile and run it:
% cc simple_tanimoto.c -O3 -o simple_tanimoto
% ./simple_tanimoto
183 89 71 -> 0.353234
%
so you can see that it gave the same Tanimoto score that OpenBabel computed.

To me this bit of C is much easier to understand than the same Python code. Part of the reason is that I'm using the __builtin_popcount of gcc but the real reason is that in C things like byte arrays, integer arrays, and the bytes and bits in an integer are very fluid. It's very easy to go from one form to another. Even if I have to implement the bitcount myself, with the following bit of code, I can compute over 1 million Tanimoto similarities per second. which is more than 20 times faster than my fastest Python code. Note though that this computing score between the same pair. Computing Tanimoto scores for a large number of targets will be slower because bringing that data to the CPU takes additional time, but then there are other way to make the performance be better. I'll talk about this in a bit.

Computing the Tanimoto in C, without GCC extensions

Here's the equivalent C code using an 8-bit lookup table to compute the popcount for a 32 bit integer.

/* simple_tanimoto2.c */
#include <stdio.h>

/* Implement popcount(unsigned int) myself. */
/* Source based on http://popcnt.org/2007/09/magic-popcount-popcnt-command.html */
const int _popcount_counts[] = {
    0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8,
};
/* This assumes that 'unsigned int' is 32 bits */
int popcount(unsigned int x) {
  return (_popcount_counts[x       & 0xff] +
          _popcount_counts[x >>  8 & 0xff] +
          _popcount_counts[x >> 16 & 0xff] +
          _popcount_counts[x >> 24 & 0xff]);
}

unsigned int query[] = {
  /* Strychnine   183 bits set */
  0x00054062u, 0x81009600u, 0x00100102u, 0x81010700u, 0x200c0000u, 0x00202850u,
  0x03102000u, 0x28000404u, 0x10200882u, 0x001849e4u, 0x483c0024u, 0x50039002u,
  0x1402801au, 0x01b90000u, 0x00020540u, 0x010c100au, 0x22a4c000u, 0x82000300u,
  0x2ac02018u, 0x00002201u, 0x60102120u, 0x183c9630u, 0x2100a080u, 0x81500019u,
  0x00041401u, 0x80008c00u, 0x00484810u, 0x90001080u, 0x040c8202u, 0x0006081bu,
  0x24020080u, 0xa2042610u};

unsigned int target[] = {
  /* cocaine  Tanimoto from Strychnine = 0.353234 */
  0x00010000u, 0x01000800u, 0x00100100u, 0x00010700u, 0x000c0050u, 0x00204010u,
  0x00000000u, 0x20000000u, 0x00000100u, 0x010008c0u, 0x00200000u, 0x50019000u,
  0x1c00801au, 0x01890000u, 0x00000000u, 0x02081008u, 0x00a20000u, 0x02000000u,
  0x01c02010u, 0x00000201u, 0x00000020u, 0x18020a30u, 0x01000080u, 0x00500014u,
  0x00001401u, 0x80002000u, 0x00088000u, 0x10001000u, 0x00000200u, 0x0006000eu,
  0xc0020000u, 0x02002200u};

int main(void) {
  int A=0, B=0, c=0;
  int i;
  for (i=0; i<sizeof(query)/sizeof(unsigned int); i++) {
    /*printf("  %lx %lx\n", fp1[i], fp2[i]);*/
    A += popcount(query[i]);
    B += popcount(target[i]);
    c += popcount(query[i]&target[i]);
  }
  /* Should print "183 89 71 -> 0.353234" */
  printf("%d %d %d -> %f\n", A, B, c, ((double)c)/(A+B-c));
}

Computing the pairwise Tanimoto in C is fast. I put the Tanimoto calculation in a loop and calculated it 100,000,000 times, which took 42 seconds. The last time I looked, PubChem contained 19,218,991 compounds. Assuming everything was available in fast memory then I should be able to search it in 8 seconds using a C program, or about 7 minutes using Python. Which would you prefer?

This is interesting. Doing the popcount using a lookup table took 43 seconds. Using __builtin_popcount took 109 seconds. I was not expecting that.

Computing the Tanimoto in Python via a C extension

I prefer the speed of C but the general ease of programming using Python. (That's a sort of "have my cake and eat it too" answer.) Using Python for Tanimoto searches is slow in part because there is no fast way to compute the popcount. I can fix that by writing a Python extension using C. Rather than add only the popcount I'll have the extension compute the entire Tanimoto between two fingerprints.

The easiest way to pass fingerprints around in Python is as string, and more specifically a byte string. (As compared to a Unicode string.) I want my Python code to work like:

import tanimoto
query = "... 128 bytes ..."
target = "... 128 bytes ..."
print "Tanimoto score is", tanimoto.similarity(query, target)
There are many ways to do this. Some ways I'm not going to use now are: writing C extension directly, or using Boost Python, SWIG or Pyrex and its offshoot Cython. Instead, I'll use ctypes.

Ctypes is a "foreign function interface" ("FFI") package for Python. It lets Python load a shared library and call C functions directly. You still have to write some interface code to tell ctypes how to call the function, and perhaps write some extra error checking code, but not much.

I find ctypes to be the easiest way to get Python to talk to C code, but there are a few downsides. Calling a function through ctypes has a bit more overhead (it's a bit slower) than using a special-purpose compiled interface like the other solutions use. It's also very easy to crash Python using ctypes.

But the advantage is that I can write C code without worrying about writing the Python/C interface in C. I do have to worry, but I'll worry about it from Python. Here's a simple C library to compute the Tanimoto similarity between two fingerprints as stored in unsigned byte strings.

/* libtanimoto.c */
static const int _popcount_counts[] = {
    0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8,
};

double similarity(int num_bytes,
                  unsigned char *query, unsigned char *target) {
  int a=0, b=0, c=0, i;
  for (i=0; i<num_bytes; i++) {
    a += _popcount_counts[query[i]];
    b += _popcount_counts[target[i]];
    c += _popcount_counts[query[i]&target[i]];
  }
  return ((double)c)/(a+b-c);
}
Very simple, yes? Next, I need to compile it. Again there are many possible solutions. I'll use SCons which one of several replacements for make. Instead of using a "Makefile" it uses a configuration file named "SConstruct". The configuration language is Python, extended with functions that know how to build binaries, shared libraries, and other targets. This is very nice because building shared libraries across multiple platforms is a nasty mess and I don't want to do it myself.

Here's my 'SConstruct' file:

# SConstruct
SharedLibrary("libtanimoto", ["libtanimoto.c"],
              CFLAGS="-O3")
I compiled it using the "scons" command:
% scons
scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
gcc -o libtanimoto.os -c -O3 -fPIC libtanimoto.c
gcc -o libtanimoto.dylib -dynamiclib libtanimoto.os
scons: done building targets.
The result is a "libtanimoto.dylib", which is what Macs use as a run-time loadable shared library. Linux and other Unix OSes use the extension ".so" and MS Windows uses ".dll".

Using ctypes to access the shared library

To load the function in ctypes I first load the shared library (remember, the extension depends on which OS you are using - 'find_library' handles that detail) then get the function from the module. I then have to annotate the function so ctypes knows how to convert the Python call arguments to the right C types and how to convert the C result into Python. Here's an example of me doing that by hand, with a fingerprint of 4 bytes:

>>> import ctypes, ctypes.util
>>> libtanimoto = ctypes.CDLL(ctypes.util.find_library("tanimoto"))
>>> libtanimoto.similarity
<_FuncPtr object at 0x88c00>
>>> libtanimoto.similarity.argtypes = [ctypes.c_int, ctypes.c_char_p, ctypes.c_char_p] 
>>> libtanimoto.similarity.restype = ctypes.c_double
>>> libtanimoto.similarity(4, "\0\0\0\0", "\0\0\0\0")
nan
>>> libtanimoto.similarity(4, "\0\0\0\0", "\0\0\0\1")
0.0
>>> libtanimoto.similarity(4, "\0\0\0\1", "\0\0\0\1")
1.0
>>> libtanimoto.similarity(4, "\0\0\0\1", "\0\0\0\3")
0.5
>>> 
If I know the number of bytes in the fingerprint then I can call this interface directly. Most people would prefer a simpler solution and not have to worry about passing in the length every time. After all, Python strings already know their length.

Making a simplified interface

What I'll do is create a new Python module in the file "tanimoto.py". This will load the shared library and do the ctypes annotation. Once done I'll define a "high-level" function also called "similarity" which gets the fingerprint lengths, checks that they are the same, and calls the underlying C similarity implementation.

# tanimoto.py - wrapper to the libtanimoto shared library

import ctypes, ctypes.util

# Load the shared library (the extension depends on the OS type)
libtanimoto = ctypes.CDLL(ctypes.util.find_library("tanimoto"))

# Annotate the function call argument and result types
_similarity = libtanimoto.similarity
_similarity.argtypes = [ctypes.c_int, ctypes.c_char_p, ctypes.c_char_p]
_similarity.restype = ctypes.c_double

# Define the high-level public interface as a Python function
def similarity(query, target):
    n = len(query)
    m = len(target)
    if n != m:
        raise AssertionError("fingerprints must be the same length")
    return _similarity(n, query, target)

In action:

>>> import tanimoto
>>> tanimoto.similarity("Andrew", "andrew")
0.95999997854232788
>>> tanimoto.similarity("Andrew", "ANDREW")
0.79166668653488159
>>> tanimoto.similarity("Andrew", "123456")
0.40625
>>> tanimoto.similarity("Andrew", "13456")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "tanimoto.py", line 23, in similarity
    m = len(target)
AssertionError: fingerprints must be the same length
>>> 
Looks good. But does it give the right answer? I'll give it a test case where we know the right answer and we'll see what it says:
>>> s1 = "".join("""
... 00054062 81009600 00100102 81010700 200c0000 00202850 
... 03102000 28000404 10200882 001849e4 483c0024 50039002 
... 1402801a 01b90000 00020540 010c100a 22a4c000 82000300 
... 2ac02018 00002201 60102120 183c9630 2100a080 81500019 
... 00041401 80008c00 00484810 90001080 040c8202 0006081b 
... 24020080 a2042610 
... """.split()).decode("hex")
>>> s2 = "".join("""
... 00010000 01000800 00100100 00010700 000c0050 00204010 
... 00000000 20000000 00000100 010008c0 00200000 50019000 
... 1c00801a 01890000 00000000 02081008 00a20000 02000000 
... 01c02010 00000201 00000020 18020a30 01000080 00500014 
... 00001401 80002000 00088000 10001000 00000200 0006000e 
... c0020000 02002200 
... """.split()).decode("hex")
>>> 
>>> import tanimoto
>>> tanimoto.similarity(s1, s2)
0.35323384404182434
>>> 
Sweet! The right answer! Though we're display it to double precision while OpenBabel only reports to float precision. That's fine - there's only 1024 bits so the resolution of the answer can't be better than 1/1024.

The high-level Python wrapper interface is easier to use than calling the C interface exposed by ctypes. This will usually be the case. C and Python have different world views. C libaries tend to optimize towards performance and require that programmers think more about all the details, while Python libraries tend be safer and easier to use, but slower. I'm hoping to show that it's possible to get both by mixing the two together.

Performance timings for the C extension

I also did some performance timings. The public high-level interface can compute Tanimoto scores about 340,000 fingerprints (of 1024 bits each) per second. That's about 15% the speed of the native C code but about 7 times faster than my pure Python solution. Remember, the high-level interface does extra work and then calls the low-level ctypes function. If I remove that extra work then I can do about 440,000 calculations per second.

Here's how I got those numbers:

# Time the tanimoto.similarity function calls
import tanimoto

import time
def go(n):
  range_n = range(n)
  s = "\1"*128
  
  similarity = tanimoto.similarity
  t1 = time.time()
  for i in range_n:
    similarity(s, s)
  t2 = time.time()
  print "Using public interface", float(n)/(t2-t1), "calls/sec"
  
  _similarity = tanimoto._similarity
  t1 = time.time()
  for i in range_n:
    _similarity(128, s, s)
  t2 = time.time()
  print "Using ctypes interface", float(n)/(t2-t1), "calls/sec"
>>> go(100000)
Using public interface 341912.926942 calls/sec
Using ctypes interface 446837.633061 calls/sec
>>> go(1000000)
Using public interface 340170.861425 calls/sec
Using ctypes interface 443564.430149 calls/sec
>>> 
One nice thing about having a benchmark in-place is I can test alternate implementations and see if I get better performance. My Tanimoto calculation uses fingerprint bit counts 'a', 'b', and 'c', based on the Daylight usage. This has general applicability because other similarity measures can be described as a function of those variables. The Tanimoto calculation has a simpler form, which is the number of on bits in the bitwise-and of the two fingerprints divided by the number of on bits in their bitwise-or. It's is succienctly written as:
double similarity(int num_bytes,
                  unsigned char *query, unsigned char *target) {
  int and_count=0, or_count=0, i;
  for (i=0; i<num_bytes; i++) {
    or_count  += _popcount_counts[query[i] | target[i]];
    and_count += _popcount_counts[query[i] & target[i]];
  }
  return ((double)and_count)/or_count;
}
Doing a bitwise-or is faster than doing another popcount. My benchmark became about 1% to 2% faster with this variation. That's almost in the noise, but useful to note if you want to eek out a bit more performance.

Comments?


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB