Diary RSS | All of Andrew's writings | Diary archive
Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure. Code to taste. Best served at room temperature.Read 0 bytes, run out of memory #
I've been working on a file format for chemical fingerprints, influenced by the PNG file format. To make sure I'm doing it right, I wrote a program to dump blocks from PNG files. I made a mistake and my program gave a MemoryError. How did that happen when my test file is only a few K long?
I tracked it down. I don't know if it's a bug. Here's something for you all to ponder over:
BLOCKSIZE = 10*1024*1024
f=open("empty.txt", "w")
f.close()
f=open("empty.txt")
data = []
for i in range(10000):
s = f.read(BLOCKSIZE)
assert len(s) == 0
data.append(s)
That's an empty file, so the read() must return empty strings. The
assert statement verifies that that's the case. But when I run it I get:
Python(18996) malloc: *** vm_allocate(size=10489856) failed (error code=3)
Python(18996) malloc: *** error: can't allocate region
Python(18996) malloc: *** set a breakpoint in szone_error to debug
Traceback (most recent call last):
File "mem_fill.py", line 9, in <module>
s = f.read(BLOCKSIZE)
MemoryError
The reason why is in the C implementation of read
(Objects/fileobject.c). The relevant line i:
v = PyString_FromStringAndSize((char *)NULL, buffersize);
That preallocates space assuming the requested read size will be
correct. In my example code it preallocates 10MB of space even though
the result is 0 bytes long. Since I keep the result around, all of
the preallocated space is also kept around. Repeat that 10,000 times
and my machine quickly runs out of memory. So will yours.
Bug in Python? Correct behavior? You decide. Feel free to make comments if you wish.
Update 9 August: I submitted this as issue3531 in the Python bug tracker. Antoine Pitrou pointed out that there's a string resize at the end of the function, so my assumption on the source of the problem was wrong. I dug into it some more and tracked the problem down to obmalloc.c inside of PyObject_Realloc. The resize ends up calling the C function 'realloc', and it didn't seem to be reallocing.
With those clues I did some searching and found Bob Ippolito's blog post realloc.. doesn't. "Apparently, Darwin's implementation of realloc never frees memory if you ask for a smaller size." There's a thread about this from January 2005 on python-dev titled: "Darwin's realloc(...) implementation never shrinks allocations." There's additional discussion with issue1092502.
I've been looking through various libraries to see if this causes a problem in real life. A denial of service attack is possible but only for formats where the data describes how many bytes to read. The PNG format is one of many. The reader code has to trust that number and use it blindly, but any hardened reader knows those values are under suspicion. For example, the PNG reader in PIL uses ImageFile.py:_safe_read:
# Reads large blocks in a safe way. Unlike fp.read(n), this function
# doesn't trust the user. If the requested size is larger than
# SAFEBLOCK, the file is read block by block.
#
# @param fp File handle. Must implement a read method.
# @param size Number of bytes to read.
# @return A string containing up to size bytes of data.
def _safe_read(fp, size):
if size <= 0:
return ""
if size <= SAFEBLOCK:
return fp.read(size)
data = []
while size > 0:
block = fp.read(min(size, SAFEBLOCK))
if not block:
break
data.append(block)
size = size - len(block)
return string.join(data, "")
which does not end up with un-realloc'ed blocks.
If you aren't using a hardened reader then it's easy to do a standard denial-of-service attack.
I could come up with hypothetical examples, but they are pretty contrived. I tried to write one which sounded at least somewhat reasonable, but I couldn't. Therefore, I don't think this realloc(3C) behavior can cause a serious denial-of-service attack. At worst it seems to make already bad code somewhat more suspectable to attack.
Teaching Python programming for cheminformatics #
I'm announcing two short training classes in Python programming for chemical informatics. I will teach the first in Leipzig, Germany on 6-7 October. It will be hosted by the Python Academy. I'm still working on the details of the second. It will be in the San Francisco Bay Area in early December.
For details about the price, topics and other information, see http://dalkescientific.com/training/ or send me an email.
I work with and develop software tools for computational chemists, mostly in small-molecule chemistry. These are scientists, not programmers, but they use computers every day as part of their research. Most need to do some programming, pehaps to implement a new algorithm, or more likely to handle something that's too tedious or error prone to do manually, like automating analysis of 10,000 compounds.
Most chemists have some training in programming, usually a semester or two of introductory programming classes in college and a lot of training from the school of experience. The latter usually means informal training from labmates, who are also not programming experts.
I've seen many of my chemists friends work hard at getting software to do what they want it to do. Chemists, like most other people who had to spend years of mostly isolated work to get a PhD, know how to perservere. But they would rather do science, not spend time figuring out how to use software.
My training classes are meant for them. I'll cover different aspects of how to be better at Python programming using examples that are directly relevant to doing small molecule in silico research.Testing the bitslice algorithm for popcount #
This is part 7 of an on-going series dealing with molecular fingerprints:
- Molecular fingerprints, background
- Generating molecular fingerprints with OpenBabel
- Computing Tanimoto scores, quickly
- Block Tanimoto calculations
- Faster block Tanimoto calculations
- HAKMEM 169 and other popcount implementations
- Testing the bitslice algorithm for popcount
I want to compute the popcount of molecular fingerprints containing 1024 bits. My algorithm has been to compute the popcount 8, 16, 32, or 64 bits at a time and repeat that for all of the words in the fingerprint. There are a couple of algorithms which compute the popcount of multiple words in parallel by doing a partial popcount of each word and combining the intermediate results to get the result over all words.
Lauradoux Cédric wrote a nice overview on Hamming weight, including explanations of the tree merging and bit-slicing algorithms. In personal email he pointed out this page from comp.arch.arithmetic where someone attributes the algorithm to Robert Harley.
It took me a while to understand it. It's been rather a while since I had to do much thinking about half-adders. About 17 years. But it's not hard. I'm not going to explain it though. It's best in pictures and that will take me too long to do. You can follow those links though.
While he hasn't finished the overview explanation on how to combine the two algorithms, his Hamming weight page includes source.
I've added two bit-slicing implementations to my popcnt.c benchmark. The algorithm from comp.arch.arithmetic processes 7 words at a time and the one from Cédric processes 24 words at a time. Both fall back to other methods for the last few words if the input isn't an exact multiple of 24 or 7.
Cédric implemented several algorithms in his code distributions and included programs to compare their performance. The results suggested that the 24 word bitslice algorithm would be twice as fast as as the 16 bit looup table. I implemented it and found that while it was fast, and in my laptop even faster than the 16 bit lookup table, it was only about 10% faster and not a factor of 2.
% sysctl machdep.cpu.brand_string machdep.cpu.brand_string: Intel(R) Core(TM)2 CPU T7600 @ 2.33GHz % gcc --version i686-apple-darwin8-gcc-4.0.1 (GCC) 4.0.1 (Apple Computer, Inc. build 5367) ... deleted ... % cc -O3 popcnt.c % ./a.out FreeBSD version 1 : 3854904 us; cnt=32511665 FreeBSD version 2 : 3517953 us; cnt=32511665 16-bit LUT : 2127718 us; cnt=32511665 was the fastest 8-bit LUT : 5317244 us; cnt=32511665 8-bit LUT v2 : 3762467 us; cnt=32511665 Wikipedia #2 : 5493756 us; cnt=32511665 Wikipedia #3 : 5295541 us; cnt=32511665 gcc popcount : 6242596 us; cnt=32511665 gcc popcountll : 10402364 us; cnt=32511665 HAKMEM 169/X11 : 4305536 us; cnt=32511665 Bitslice(7) : 2077473 us; cnt=32511665 Bitslice(24) : 1843217 us; cnt=32511665 fastest naive : 23460323 us; cnt=32511665 Wegner/Kernigan : 14976014 us; cnt=32511665 Anderson : 63971600 us; cnt=32511665 8x shift and add : 21728414 us; cnt=32511665 32x shift and add : 19747808 us; cnt=32511665
I dug around for a while and used Toby's comment to try Shark as a profiler. Things made a bit more sense after I remembered to add the '-g' option to gcc so I could get Shark to line up the code and assembly side-by-side. My conclusion is that Cédric's code has too much function call overhead because the 16-bit LUT is being called for every 32 bit number, while the 24 word bitslice code is only called every 24 words. My code doesn't have that overhead, which is why the times are closer to each other.
As before, I wondered how much "-m64" affects things. I'll declare it a tie between the new bitslice code and the old 16-bit lookup table. (While though the numbers aren't the same, the difference is about the same as when I run the same algorithm again.)
% cc -O3 popcnt.c -m64 % ./a.out FreeBSD version 1 : 3902385 us; cnt=32511665 FreeBSD version 2 : 3422655 us; cnt=32511665 16-bit LUT : 1621557 us; cnt=32511665 fastest 8-bit LUT : 2070320 us; cnt=32511665 8-bit LUT v2 : 2094362 us; cnt=32511665 Wikipedia #2 : 2135130 us; cnt=32511665 Wikipedia #3 : 2142576 us; cnt=32511665 gcc popcount : 8128024 us; cnt=32511665 gcc popcountll : 4087390 us; cnt=32511665 HAKMEM 169/X11 : 5805272 us; cnt=32511665 Bitslice(7) : 1728262 us; cnt=32511665 Bitslice(24) : 1698257 us; cnt=32511665 fastest naive : 24990997 us; cnt=32511665 Wegner/Kernigan : 16864887 us; cnt=32511665 Anderson : 12463648 us; cnt=32511665 8x shift and add : 21715482 us; cnt=32511665 32x shift and add : 20003160 us; cnt=32511665
All of this was on one processor and compiler. I decided to try another machine with a slightly different processor and an earlier version of gcc.
%sysctl -w hw.model
hw.model: Intel(R) Core(TM)2 CPU 6420 @ 2.13GHz
%gcc --version
gcc (GCC) 3.4.6 [FreeBSD] 20060305
...
%cc popcnt.c -O3
%./a.out
FreeBSD version 1 : 4270677 us; cnt=32511665
FreeBSD version 2 : 3785047 us; cnt=32511665
16-bit LUT : 2034836 us; cnt=32511665
8-bit LUT : 1935156 us; cnt=32511665 fastest
8-bit LUT v2 : 1987370 us; cnt=32511665
Wikipedia #2 : 5010639 us; cnt=32511665
Wikipedia #3 : 4931266 us; cnt=32511665
gcc popcount : 5793299 us; cnt=32511665
gcc popcountll : 6815575 us; cnt=32511665
HAKMEM 169/X11 : 10788366 us; cnt=32511665
Bitslice(7) : 2467150 us; cnt=32511665
Bitslice(24) : 2469399 us; cnt=32511665
naive : 27722786 us; cnt=32511665
... too slow to worry about ...
In this case I would use the 8-bit lookup table, which is about 20%
faster than the the bitslice code.
Here's the results on an older Pentium 4 (Prescott) CPU 2.80GHz with gcc 3.4.6:
FreeBSD version 1 : 3745509 us; cnt=32500610 FreeBSD version 2 : 3774150 us; cnt=32500610 16-bit LUT : 2762583 us; cnt=32500610 8-bit LUT : 2430362 us; cnt=32500610 fastest 8-bit LUT v2 : 2433684 us; cnt=32500610 Wikipedia #2 : 11509758 us; cnt=32500610 Wikipedia #3 : 8540921 us; cnt=32500610 gcc popcount : 9100055 us; cnt=32500610 gcc popcountll : 10568664 us; cnt=32500610 HAKMEM 169/X11 : 10970609 us; cnt=32500610 Bitslice(7) : 4226871 us; cnt=32500610 Bitslice(24) : 3700020 us; cnt=32500610 naive : 36117662 us; cnt=32500610 Wegner/Kernigan : 46479326 us; cnt=32500610 Anderson : 174285325 us; cnt=32500610 8x shift and add : 90028247 us; cnt=32500610 32x shift and add : 86390410 us; cnt=32500610Bitslice(24) is still the fastest bit-twiddler solution, but quite a bit slower than using an 8-bit lookup table. There is no single best solution. It seems the code is also compiler dependent so I'm installing GCC 4.2 to see how that affects things. That'll probably take a few hours so it's time to sleep.
Okay, fink installed gcc 4.2.2 (the most recent is 4.3.1 but I decided it wasn't worth the time to compile it myself). Here's the numbers on my laptop for 32-bit:
% gcc-4 -O3 popcnt.c % ./a.out FreeBSD version 1 : 4161621 us; cnt=32511665 FreeBSD version 2 : 2930977 us; cnt=32511665 was 3517953 16-bit LUT : 2534380 us; cnt=32511665 was 2127718 8-bit LUT : 4472733 us; cnt=32511665 8-bit LUT v2 : 4499285 us; cnt=32511665 was 3762467 Wikipedia #2 : 5031855 us; cnt=32511665 Wikipedia #3 : 3874931 us; cnt=32511665 was 5295541 gcc popcount : 4808307 us; cnt=32511665 gcc popcountll : 5978393 us; cnt=32511665 HAKMEM 169/X11 : 4315751 us; cnt=32511665 Bitslice(7) : 2111625 us; cnt=32511665 Bitslice(24) : 1443337 us; cnt=32511665 fastest; was 1843217 naive : 51557971 us; cnt=32511665 Wegner/Kernigan : 15616245 us; cnt=32511665 Anderson : 63424235 us; cnt=32511665 8x shift and add : 19179445 us; cnt=32511665 32x shift and add : 19502222 us; cnt=32511665 %and for 64 bit
% gcc-4 -O3 popcnt.c -m64 % ./a.out FreeBSD version 1 : 4192391 us; cnt=32511665 FreeBSD version 2 : 2812570 us; cnt=32511665 was 3422655 16-bit LUT : 1494747 us; cnt=32511665 was 1621557 8-bit LUT : 2182569 us; cnt=32511665 8-bit LUT v2 : 2170191 us; cnt=32511665 was 2094362 Wikipedia #2 : 2001489 us; cnt=32511665 Wikipedia #3 : 1361488 us; cnt=32511665 was 2142576 gcc popcount : 7350367 us; cnt=32511665 gcc popcountll : 3718116 us; cnt=32511665 HAKMEM 169/X11 : 4339839 us; cnt=32511665 Bitslice(7) : 1693849 us; cnt=32511665 Bitslice(24) : 1206318 us; cnt=32511665 fastest; was 1698257 naive : 24914902 us; cnt=32511665 Wegner/Kernigan : 15557531 us; cnt=32511665 Anderson : 8764008 us; cnt=32511665 8x shift and add : 21710211 us; cnt=32511665 32x shift and add : 19583803 us; cnt=32511665 %It looks like things are nearly always faster for 64 bit code, and sometimes faster for 32 bit code.
HAKMEM 169 and other popcount implementations #
This is part 6 of an on-going series dealing with molecular fingerprints:
- Molecular fingerprints, background
- Generating molecular fingerprints with OpenBabel
- Computing Tanimoto scores, quickly
- Block Tanimoto calculations
- Faster block Tanimoto calculations
- HAKMEM 169 and other popcount implementations
- Testing the bitslice algorithm for popcount
HAKMEM
Many other people have worked on doing high speed population counts. In 1972 the AI Laboratory at MIT published "HAKMEM", or "Artificial Intelligence Memo No. 239". It is a collection of algorithms, tricks, observations, and unsolved but interesting problems. One of which is to "Solve Go." Item 169 is a PDP-6/10 assembly way to count the number of ones in a word. It was called "count ones" and is now called popcount.
I first came across HAKMEM Item 169 in 2005 when I last looked at doing fast Tanimoto calculations. I tried it and a few other solutions but concluded that using special Altivec instructions were the fastest solution for my PowerPC-based PowerBook. I don't have a PowerPC any more. I bought a MacBook Pro last year with a 2.33 GHz Intel Core 2 Duo. I've been looking at the MMX instructions quizzically trying to figure out if there's an equivalent solution. I am not an assembly programmer.
The best I could find was an AMD "x86 Code Optimization Guide". It gives a 32-bit integer version using normal Intel assembly and a 64-bit version using MMX which "can do popcounts about twice as fast as the integer version (for an identical number of bits)."
The newer guide (I misplaced the URL) says to use the popcount instruction in SSE4a.
Other popcount references
Computing popcount is apparently very widely used. You can find many people have talked about it:
- Bart
Massey:
popcount_naive: 1.6666e+07 iters in 688 msecs for 41.28 nsecs/iter popcount_8: 1e+08 iters in 995 msecs for 9.95 nsecs/iter popcount_6: 2e+08 iters in 1725 msecs for 8.625 nsecs/iter popcount_hakmem: 2e+08 iters in 1411 msecs for 7.055 nsecs/iter popcount_keane: 1e+09 iters in 6566 msecs for 6.566 nsecs/iter popcount_3: 1e+09 iters in 6270 msecs for 6.27 nsecs/iter popcount_2: 1e+09 iters in 5658 msecs for 5.658 nsecs/iter popcount_mult: 1e+09 iters in 4169 msecs for 4.169 nsecs/iter ... At the suggestion of Bill Trost, I added 8 and 16-bit table-driven popcount implementations. These perform the fastest in the benchmark test-stand, at about 3ns/iteration. They're about the same speed, so one would obviously use the 8-bit version.
- Majek has pointers to code. That's where I found the AMD optimization guide link.
- Lauradoux Cédric's Hamming weight
page has an
overview of the topic. He reports:
and nicely summarizes why few people use the fastest solution:Algorithms Throughput Naive 29.97 MB/s Kernighan 33.34 MB/s Tree 1 122.97 MB/s Tree 2 130.60 MB/s Tree 3 125.57 MB/s Table 8 bits 200.94 MB/s Table 16 bits 244.64 MB/s The fastest solutions use pre-computation. However, you will never find this two implementations in current software product (X11, linux kernel. . . ). The reason of that is simple: the programmers are reluctant to use solutions implying a big amount of memory since it can affect the portability of the applications and their performance. As a consequence, the algorithms based on tree are the most popular.
- There was a reddit thread in 2007 titled: Popcount
- the "NSA instruction" missing from modern CPUs (or at least their
datasheets.). In it, a1k0n posted a link to a popcount benchmark
he wrote. After tweaking the 8-bit version slightly for better
performance I got:
FreeBSD version 1 : 3784823 us; cnt=32511665 FreeBSD version 2 : 3421292 us; cnt=32511665 16-bit LUT : 2068947 us; cnt=32511665 8-bit LUT : 3632031 us; cnt=32511665 8x shift and add : 21401951 us; cnt=32511665 32x shift and add : 19419464 us; cnt=32511665
and he also found that lookups were the fastest solution. - There's a gcc bug
report from April 2008 showing that that __builtin_popcount isn't
even as fast as other bit twiddling solutions. Joseph Zbiciak attached
a benchmark. On my machine it shows that gcc's builtin solution
is about 1/3rd the speed of a better one:
popcount64_1 = 4114 clocks gcc's __builtin_popcount popcount64_2 = 1881 clocks popcount64_3 = 1627 clocks popcount64_4 = 1381 clocks
- There's also a number of popcount algorithms documented by Sean Eron Anderson including descriptions and further links.
- Cédric in private mail pointed out a nice collection of popcount implementations from 1998/1999 including a cute 11 bit lookup table and an implementation of Robert Harley's bit slice solution, which is great when doing popcount across several words at a time.
- Finally, you can fetch the GMP package. It contains hand-written assembly implementations for many different architectures. The Intel ones are based on bit twiddling and do not use a lookup table.
Benchmarking popcount in GMP
Interesting. Since I have gmpy installed on this machine I can figure out just how good gmp's hand-written bit twiddling actually is. Here's what I did, with commentary inserted.
# Read the entire data file into memory
>>> s = open("nci.binary_fp").read()
>>> len(s)
155667200
# Convert each character into hex
>>> s = s.encode("hex")
# Convert the hex number into a gmp integer, using base 16
>>> import gmpy
>>> n = gmpy.mpz(s, 16)
# Do the popcount
>>> gmpy.popcount(n)
136963400
# How long did it take?
>>> import time
>>> t1 = time.time(); gmpy.popcount(n); t2 = time.time()
136963400
>>> t2-t1
0.36236190795898438
>>> print (155667200/256)*(t2-t1), "fingerprints per second"
220343.217182 fingerprints per second
# Verify that I got the right answer by using an 8-bit lookup table
>>> bitcounts = (
... 0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
... 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
... 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
... 2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
... 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
... 2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
... 2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
... 3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8,)
>>> bitcount_table = dict((chr(i), count) for (i,count) in enumerate(bitcounts))
# Double-check the table creation -- this should be 2
>>> bitcount_table["A"]
2
# Do the full popcount; this will take a while
>>> sum(bitcount_table[c] for c in s)
136963400
>>>
GMP takes about 0.36 seconds to compute one popcount. My C code
before loop unwinding using the 16-bit lookup table took 0.41 seconds
to compute two popcounts, or about 0.2 seconds for one popcount. This
is in very good agreement with the numbers from a1k0n's benchmark (3.4
seconds vs. 2.1 seconds).
Another popcount benchmark
So many different algorithms and the benchmarks aren't compatible enough that I could compare them. I decided to expand a1k0n's benchmark and include a few other implementations. The result is popcnt.c. Yes, I didn't even change the name. It implements quite a few algorithms but it's still incomplete. Anyone want to contribute assembly versions using MMX instructions?
Here's what my laptop reports. It's a 2.33 GHz Intel Core 2 Duo and I compiled with gcc 4.0.1:
% cc popcnt.c -O3 % ./a.out FreeBSD version 1 : 3828096 us; cnt=32511665 FreeBSD version 2 : 3474751 us; cnt=32511665 16-bit LUT : 2095066 us; cnt=32511665 8-bit LUT : 5329692 us; cnt=32511665 8-bit LUT v2 : 3691475 us; cnt=32511665 8x shift and add : 21675609 us; cnt=32511665 32x shift and add : 19687827 us; cnt=32511665 Wikipedia #2 : 5264710 us; cnt=32511665 Wikipedia #3 : 5037636 us; cnt=32511665 gcc popcount : 6088138 us; cnt=32511665 gcc popcountll : 7548551 us; cnt=32511665 naive : 23647023 us; cnt=32511665 Wegner/Kernigan : 15026099 us; cnt=32511665 Anderson : 63285941 us; cnt=32511665 HAKMEM 169/X11 : 4205008 us; cnt=32511655The fastest implementation is still the 16-bit lookup table, at 40% faster than that fastest bit-twiddling code. The "8-bit LUT" implementation is by a1k0n. I noticed it was a lot slower than my code so I investigated. The only difference was the order of calculations for a simple addition. The "8-bit LUT v2" is my version. It's interesting to see that that minor change causes a major timing difference. And HAKMEM is holding its own; it's even better than the Wikipedia ones, which I didn't expect.
A Core 2 Duo is a 64-bit processor. I can ask gcc to compile for 64-bits and see if that makes a difference.
% cc popcnt.c -O3 -m64 % ./a.out FreeBSD version 1 : 3882201 us; cnt=32511665 FreeBSD version 2 : 3382603 us; cnt=32511665 16-bit LUT : 1562167 us; cnt=32511665 8-bit LUT : 2013767 us; cnt=32511665 8-bit LUT v2 : 2054639 us; cnt=32511665 8x shift and add : 21679291 us; cnt=32511665 32x shift and add : 20050205 us; cnt=32511665 Wikipedia #2 : 2113876 us; cnt=32511665 Wikipedia #3 : 2102645 us; cnt=32511665 gcc popcount : 8081774 us; cnt=32511665 gcc popcountll : 4217908 us; cnt=32511665 naive : 27717913 us; cnt=32511665 Wegner/Kernigan : 16618378 us; cnt=32511665 Anderson : 12300547 us; cnt=32511665 HAKMEM 169/X11 : 5643968 us; cnt=32511655It does, for some things. The lookup tables are faster (and the 8-bit counts are now the same), and the 64-bit algorithms (Wikipedia, gcc popcountll and Anderson) are much happier. The parallel algorithm used by the Wikipedia implementations are almost as fast as the 8-bit lookup table and only 33% slower than the 16-bit lookup table. While HAKMEM got 25% slower.
Is there any way I can ask gcc to use the 64-bit instructions but still talk to a Python which was compiled for 32 bits?
Profile-directed optimizations with gcc
I recently learned how to do profile-directed optimizations with gcc using the -fprofile-arcs and -fbranch-probabilities options. Sometimes the compiler needs to make a guess on which assembly code is best for a given circumstance. For instance, if a branch for an 'if' statement is usually taken then it might provide hints to the processor pipline to optimistically assume the branch will be taken. It can't guess correctly all the time. Profile-directed optimizations works by asking gcc during compilation to instrument the program, running the program to get timings on how the code is actually used, saving the results to a file, then recompiling while asking gcc to use the instrumentation data.
Is it useful for my case?
compile with instrumentation - these will be slower than usual!
% cc popcnt.c -O3 -fprofile-arcs
% ./a.out
FreeBSD version 1 : 4304773 us; cnt=32511665
FreeBSD version 2 : 4062710 us; cnt=32511665
16-bit LUT : 2835959 us; cnt=32511665
8-bit LUT : 5344514 us; cnt=32511665
8-bit LUT v2 : 4535290 us; cnt=32511665
8x shift and add : 22246178 us; cnt=32511665
32x shift and add : 19915850 us; cnt=32511665
Wikipedia #2 : 5626697 us; cnt=32511665
Wikipedia #3 : 5318770 us; cnt=32511665
gcc popcount : 7005410 us; cnt=32511665
gcc popcountll : 8173944 us; cnt=32511665
naive : 62215630 us; cnt=32511665
Wegner/Kernigan : 34541338 us; cnt=32511665
Anderson : 65256755 us; cnt=32511665
HAKMEM 169/X11 : 4727297 us; cnt=32511665
See the profile data?
% ls -l popcnt.*
-rw-r--r-- 1 dalke staff 9177 Jul 3 05:01 popcnt.c
-rw-r--r-- 1 dalke staff 1816 Jul 3 04:43 popcnt.gcda
-rw-r--r-- 1 dalke staff 7872 Jul 3 04:43 popcnt.gcno
recompile using the profile data
% cc popcnt.c -O3 -fbranch-probabilities
% ./a.out
FreeBSD version 1 : 3815036 us; cnt=32511665
FreeBSD version 2 : 3477915 us; cnt=32511665
16-bit LUT : 2251514 us; cnt=32511665
8-bit LUT : 5291010 us; cnt=32511665
8-bit LUT v2 : 3936142 us; cnt=32511665
8x shift and add : 21715704 us; cnt=32511665
32x shift and add : 19708580 us; cnt=32511665
Wikipedia #2 : 5273262 us; cnt=32511665
Wikipedia #3 : 5074212 us; cnt=32511665
gcc popcount : 6431689 us; cnt=32511665
gcc popcountll : 7562124 us; cnt=32511665
naive : 28044028 us; cnt=32511665
Wegner/Kernigan : 15280455 us; cnt=32511665
Anderson : 63431368 us; cnt=32511665
HAKMEM 169/X11 : 4209093 us; cnt=32511665
No significant differences over the normal 32-bit code. Nor did I see
any real differences if I compiled with the architecture-specific
flags:
cc popcnt.c -O3 -march=nocona
Domain applicability
I expect people will come across this page when looking for popcount implementations, see my numbers, and end up using the 16-bit lookup table. That might be the right solution but bear in mind domain applicability. I plan to compute popcount over megabytes. Most people, like the proverbial implementation in X11, work on only a few words.
It takes time to bring the lookup table into memory. On modern machines it's something like tens of instructions. If you're only doing a few popcounts, or a lot of popcounts but not all the time, then it's probably best to use bit-twiddling. It's easier to implement and for that case likely faster.
Faster block Tanimoto calculations #
This is part 5 of an on-going series dealing with molecular fingerprints:
- Molecular fingerprints, background
- Generating molecular fingerprints with OpenBabel
- Computing Tanimoto scores, quickly
- Block Tanimoto calculations
- Faster block Tanimoto calculations
- HAKMEM 169 and other popcount implementations
- Testing the bitslice algorithm for popcount
After much analysis and digging in part 4 I managed to get my C extension to Python to compute fingerprints pretty quickly. I estimated that a fingerprint search of PubChem (19,218,991 1024 bit fingerprints) would take about 13.5 seconds, but that's still almost a factor of two worse than my estimate of 8 seconds. What's wrong?
To start with, the data bandwidth from the disk is not the limiting factor. Processing each byte of my test set using file reads takes 0.26 seconds
% time wc -l nci.binary_fp 1098650 nci.binary_fp 0.165u 0.089s 0:00.26 92.3% 0+0k 0+0io 0pf+0wwhile the code takes 0.84 seconds to run, and 0.50 seconds if I leave out the actual popcount lookups. Did the code for the popcount lookups change?
Actually, yes, it did. My proof-of-concept code used 32 bit integers and bit shifting:
int popcount(unsigned int x) {
return (_popcount_counts[x & 0xff] +
_popcount_counts[x >> 8 & 0xff] +
_popcount_counts[x >> 16 & 0xff] +
_popcount_counts[x >> 24 & 0xff]);
}
while my more recent code uses a byte lookup:
for (i=0; i<num_bytes; i++) {
b += _popcount_counts[targets[i]];
c += _popcount_counts[query[i]&targets[i]];
}
The most obvious difference is that I'm working with 32 bits at a
time. This won't work if the fingerprint has, say, 40 bits in it. On
the plus side, I don't need to check for the end of the loop after
every 8 bits, so that reduces the branch tests by 75%. (Branches in
modern computers cause performance problems. Some details at the
Wikipedia page on Branch
predictor.
Out of curiousity I generated the assembly code for the corresponding popcount functions (using "gcc -S"). Here's the relevant source:
int byte_lookup(int num_bytes, char *query) {
int a=0, i;
for (i=0; i<num_bytes; i++) {
a += _popcount_counts[query[i]];
}
return a;
}
int int_lookup(int num_bytes, int *query) {
int a=0, q, i;
for (i=0; i<num_bytes/4; i+=4) {
q = query[i];
a += (_popcount_counts[q & 0xff] +
_popcount_counts[q >> 8 & 0xff] +
_popcount_counts[q >> 16 & 0xff] +
_popcount_counts[q >> 24 & 0xff]);
}
return a;
}
This is the inner loop of "byte_lookup"
L5:
movl 12(%ebp), %esi
movsbl (%esi,%edx),%eax
movl -16(%ebp), %esi
addl (%esi,%eax,4), %ecx
addl $1, %edx
cmpl %edx, %edi
jne L5
and this is the inner loop of "int_lookup".
L13:
movl -16(%ebp), %eax
movl 12(%ebp), %edx
movl (%edx,%eax,4), %ecx
movzbl %cl,%eax
movzbl %ch, %esi
movl (%edi,%eax,4), %edx
addl (%edi,%esi,4), %edx
movl %ecx, %eax
sarl $16, %eax
andl $255, %eax
addl (%edi,%eax,4), %edx
shrl $24, %ecx
addl (%edi,%ecx,4), %edx
addl %edx, -20(%ebp)
addl $4, -16(%ebp)
movl -24(%ebp), %eax
cmpl %eax, -16(%ebp)
jl L13
The first has 7 instrutions and one branch test but it's called 4
times more often than the second, which has 18 instructions and 1
branch test. 4*7=28, which is more than 18. I don't know how long it
takes the CPU to run those instructions, so I'll just assume that all
instructions take the same time, or about 33% faster. This isn't
always true because some instructions are faster than others, and
pipelining and other CPU tricks make things even more complicated.
My Python extension using bytes for the lookups is about 40% slower than my high-speed C test case using ints. These are commensurate numbers, or at least suggestive. It looks like I should rework my code to use integers. But there's something I need to worry about first.
__builtin_popcount and data alignment
Earlier I talked about using the GNU C/C++ "__builtin_popcount" extension to compute the popcount of an integer but I've been using my own implementation based on the _popcount_counts lookup table using bytes. I did this for a few reasons. You might not be using the GNU compiler, though that's a low probability. You might want to support fingerprints which are a multiple of 8 but not of 32, though all my tests are for 1024 bit fingerprints. The main reason was I was worried about memory alignment concerns.
Some CPUs require non-character arrays to be memory aligned. Even for CPUs which don't have that requirement (like Intel machines), non-aligned access is slower than aligned access. I'm using Python to pass a character pointer to a C function, but I don't know if that pointer is correctly aligned to be used directly as an integer pointer. I decided it was best to ignore the problem until I got the code working.
The Python string object is implemented in C, and my C extension gets a pointer to the underlying string data. Is it already aligned such that I can cast the string directly into a "int *"? I could figure it in a couple of ways. I could look at the source code and working out what's going on, or I can run the code and asking it. I'll do both, starting with the first one, which is the hard one, first.
The layout for the string class is in "Include/stringobject.h" and is:
typedef struct {
PyObject_VAR_HEAD
long ob_shash;
int ob_sstate;
char ob_sval[1];
/* Invariants:
* ob_sval contains space for 'ob_size+1' elements.
* ob_sval[ob_size] == 0.
* ob_shash is the hash of the string or -1 if not computed yet.
* ob_sstate != 0 iff the string object is in stringobject.c's
* 'interned' dictionary; in this case the two references
* from 'interned' to this object are *not counted* in ob_refcnt.
*/
} PyStringObject;
The "PyObject_VAR_HEAD" is used ... well, if you're really interested
then read the documentation in "Include/object.h". The summary is
that it's a macro which in non-debug builds will expand to
Py_ssize_t ob_refcnt;
struct _typeobject *ob_type;
Py_ssize_t ob_size; /* Number of items in variable part */
Count up the space needed for the variables and you'll see that the
"ob_sval[1]" occurs on a multiple of 4 bytes. It is aligned such that
it can be cast into an integer pointer without problems.
The easier way is to add a printf statement in the block_similarity function call, like this:
printf("At %p %d %d\n", query, ((int)query) % 4, ((int)query) % 8);
This prints the pointer location and the location modulo 4 (for 32 bit
'int' alignment) and module 8 (for 64 bit 'long long' alignment). If
it's integer aligned then the modulo result should be 0. When I ran
this I got:
At 0x13f1884 0 4This means the array is integer aligned (because of the '0') but not long aligned (because of the '4'). This agrees with what I figured out from looking at the typedef layout. I can therefore cast the "char *" pointer to an "int *" then use __builtin_popcount on the values, and not worry about alignment problems. On the other hand, I can't use __builtin_popcountll (the difference is the 'll' a the end; this one is for "long long"s).
My benchmark
I'm going to focus on making the Tanimoto calculations faster. As I showed in earlier essays, my program's total run-time was only a rough estimate for the calculation time. It included Python's start-up cost (importing numpy took 0.25 seconds!) and the implementation choice of reading a file vs. using a memory mapped file.
I created a new benchmark program called "time_block_similarity.py" which does exactly what it says. I'll continue to use the "nci.binary_fp" file as my reference data file, and to avoid disk issues I'll read the entire file into memory before doing processing. It's only 150 MB and I have more than that amount of free memory so there shouldn't be any swapping.
# time_block_similarity.py
import time
import tanimoto
# This is 155,667,200 bytes; 1,216,150 1024-bit fingerprints
targets = open("nci.binary_fp", "rb").read()
# Use the first 1024 bits as the query
query = targets[:1024//8]
t1 = time.time()
results = tanimoto.block_similarity(query, targets)
t2 = time.time()
N = len(results)
print "%s calculations in %.3f seconds; %.0f per second" % (N, t2-t1, N/(t2-t1))
print "Estimated PubChem time: %.1f seconds" % (19218991 * (t2-t1)/N,)
print "First few are:"
for x in results[:5]:
print x
Using my existing byte lookup implementation my fastest-of-three time was:
1216150 calculations in 0.750 seconds; 1620963 per second Estimated PubChem time: 11.9 seconds First few are: 1.0 0.492537313433 0.450980392157 0.546511627907 0.505319148936That's the time I want to beat.
Faster block_similarity using integers
Changing libtanimoto.c:block_similarity to use integers instead of characters is not hard. The following assumes that integers are 32 bits, the 'query' and 'targets' are integer aligned, and that 'num_bytes' is a multiple of 4.
/* This is added to libtanimoto.c */
void block_similarity(int num_bytes, unsigned char *query,
int num_targets, unsigned char *targets,
double *results) {
int a=0, b=0, c=0, i;
int *query_as_int = (int *) query;
int *targets_as_int = (int *) targets;
int num_ints_in_fp = num_bytes / sizeof(int);
unsigned int fp_word;
/* precompute popcount(query) */
for (i=0; i<num_ints_in_fp; i++) {
fp_word = query_as_int[i];
a += (_popcount_counts[fp_word & 0xff] +
_popcount_counts[fp_word >> 8 & 0xff] +
_popcount_counts[fp_word >> 16 & 0xff] +
_popcount_counts[fp_word >> 24 & 0xff]);
}
/* For each fingerprint in the block */
while (num_targets-- > 0) {
/* compute popcount(target) and popcount(target&query) */
b=0; c=0;
for (i=0; i<num_ints_in_fp; i++) {
fp_word = targets_as_int[i];
b += (_popcount_counts[fp_word & 0xff] +
_popcount_counts[fp_word >> 8 & 0xff] +
_popcount_counts[fp_word >> 16 & 0xff] +
_popcount_counts[fp_word >> 24 & 0xff]);
fp_word &= query_as_int[i];
c += (_popcount_counts[fp_word & 0xff] +
_popcount_counts[fp_word >> 8 & 0xff] +
_popcount_counts[fp_word >> 16 & 0xff] +
_popcount_counts[fp_word >> 24 & 0xff]);
}
/* Append the result and go on to the next fingerprint */
*results = ((double)c)/(a+b-c);
results++;
targets_as_int += num_ints_in_fp;
}
}
and my best-of-three output from this is:
1216150 calculations in 0.411 seconds; 2957097 per second Estimated PubChem time: 6.5 seconds First few are: 1.0 0.492537313433 0.450980392157 0.546511627907 0.505319148936That's almost half the time! (45% faster if you want to be picky.) I'm also at about the time I predicted from my prototype. But I can do better.
Loop unrolling
One reason the integer based code became faster was it's now doing one memory fetch instead of 4. Another reason is that it does about 1/4th the number of branches. Branches cause CPUs to cry. I now assume the fingerprint length is a multiple of 4 bytes, so I can do four calculations before checking if I've reached the end.
This technique is called loop unrolling. (Despite what Wikipedia claims, most people use the term 'loop unrolling' and do not use 'loop unwinding'.) As a simple example, here's a snippet of code which would compute the bitcount for a fingerprint of length 256 bits, which is 8 32-bit integers.
count = 0;
for (i=0; i<8; i++) {
count = __builtin_popcount(fp[i]);
}
The snippet uses the loop to make the code shorter. I can unwind it
completely, like this:
count = (__builtin_popcount(fp[0]) +
__builtin_popcount(fp[1]) +
__builtin_popcount(fp[2]) +
__builtin_popcount(fp[3]) +
__builtin_popcount(fp[4]) +
__builtin_popcount(fp[5]) +
__builtin_popcount(fp[6]) +
__builtin_popcount(fp[7]));
This new version contains no tests and no branches. CPUs like that,
and so do compilers.
In my benchmark I know the test fingerprints contain 1024 bits. I want to see if this idea is useful so I'll write a version which completely unrolls the loop in the block_similarity function which computes 'b' and 'c'. The downside with unrolling is there's a lot more code. I'll work around that a bit by using a macro, but the result is still ugly.
void block_similarity(int num_bytes, unsigned char *query,
int num_targets, unsigned char *targets,
double *results) {
int a=0, b=0, c=0, i;
int *query_as_int = (int *) query;
int *targets_as_int = (int *) targets;
int num_ints_in_fp = num_bytes / sizeof(int);
unsigned int fp_word;
/* precompute popcount(query) */
for (i=0; i<num_ints_in_fp; i++) {
fp_word = query_as_int[i];
a += (_popcount_counts[fp_word & 0xff] +
_popcount_counts[fp_word >> 8 & 0xff] +
_popcount_counts[fp_word >> 16 & 0xff] +
_popcount_counts[fp_word >> 24 & 0xff]);
}
/* For each fingerprint in the block */
while (num_targets-- > 0) {
/* compute popcount(target) and popcount(target&query) */
b=0; c=0;
i=0;
#define _B_AND_C_POPCOUNT \
fp_word = targets_as_int[i]; \
b += (_popcount_counts[fp_word & 0xff] + \
_popcount_counts[fp_word >> 8 & 0xff] + \
_popcount_counts[fp_word >> 16 & 0xff] + \
_popcount_counts[fp_word >> 24 & 0xff]); \
fp_word &= query_as_int[i++]; \
c += (_popcount_counts[fp_word & 0xff] + \
_popcount_counts[fp_word >> 8 & 0xff] + \
_popcount_counts[fp_word >> 16 & 0xff] + \
_popcount_counts[fp_word >> 24 & 0xff]);
_B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
_B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
_B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
_B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
_B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
_B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
_B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
_B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
/* Append the result and go on to the next fingerprint */
*results = ((double)c)/(a+b-c);
results++;
targets_as_int += num_ints_in_fp;
}
}
Does it make any difference?
1216150 calculations in 0.306 seconds; 3968043 per second Estimated PubChem time: 4.8 seconds First few are: 1.0 0.492537313433 0.450980392157 0.546511627907 0.505319148936Nice! This is about 25% faster than the previous best time.
16 bit lookup table
There are still more tricks up my sleve. I wondered how much time was spent in computing the popcount for a fingerprint so I removed the calculation of 'b'
b += (_popcount_counts[fp_word & 0xff] + \
_popcount_counts[fp_word >> 8 & 0xff] + \
_popcount_counts[fp_word >> 16 & 0xff] + \
_popcount_counts[fp_word >> 24 & 0xff]); \
from my unwound macro definition. This means 'b' will always be 0,
and the scores will be completely wrong. The resulting report was:
1216150 calculations in 0.185 seconds; 6562436 per second Estimated PubChem time: 2.9 seconds First few are: inf 2.41463414634 1.91666666667 2.04347826087 2.11111111111which means about 40% of the time is spent computing one of the fingerprints. Which makes sense as the entire point is to compute two fingerprints and I'm trying to avoid other overhead.
Can I make those faster? I've been using an 8 bit lookup table. I could cut the number of operations in half by using a 16 bit lookup. That table only takes about 64K of memory, which fits into cache these days. Would that be faster?
The 8 bit lookup table was small enough that I could write it in 10 lines of code. The 16 bit table I did for my own proof-of-concept testing took 2050 lines. I don't want to list that here. Instead, I'll define space for the entire array and an 'init' function in libtanimoto.c. After the Python module, via ctypes, loads the module it will call the initialization function.
It's been a long time since I've shown the full "libtanimoto.c" and "tanimoto.py" files. I've made lots of small changes over that time so it's best to resynchronize. Here are complete copies of each file, with the important new parts in bold:
/* libtanimoto.c */
/* Define space for the 16-bit lookup table */
/* Bootstrap using values for the 8-bit table */
static char _popcount_counts[65536] = {
0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8,
};
/* Bootstrap the table. This will be called by 'tanimoto.py' */
void init_tanimoto(void) {
int i;
/* If I know you are using GCC then you could call __builtin_popcount */
for (i=256; i<65536; i++) {
_popcount_counts[i] = (_popcount_counts[i & 0xff] +
_popcount_counts[i >> 8]);
}
}
/* Compute the Tanimoto similarity between two fingerprints */
double similarity(int num_bytes,
unsigned char *query, unsigned char *target) {
int a=0, b=0, c=0, i;
for (i=0; i<num_bytes; i++) {
a += _popcount_counts[query[i]];
b += _popcount_counts[target[i]];
c += _popcount_counts[query[i]&target[i]];
}
return ((double)c)/(a+b-c);
}
/* Compute the similarity beteen a query fingerprint and a block
of target fingerprints. Result values go into 'results'. */
/* This assumes num_bytes == 32 and that query and targets are int aligned */
void block_similarity(int num_bytes, unsigned char *query,
int num_targets, unsigned char *targets,
double *results) {
int a=0, b=0, c=0, i;
int *query_as_int = (int *) query;
int *targets_as_int = (int *) targets;
int num_ints_in_fp = num_bytes / sizeof(int);
unsigned int fp_word;
/* precompute popcount(query) */
for (i=0; i<num_ints_in_fp; i++) {
fp_word = query_as_int[i];
a += (_popcount_counts[fp_word & 0xffff] +
_popcount_counts[fp_word >> 16]);
}
/* For each fingerprint in the block */
while (num_targets-- > 0) {
/* compute popcount(target) and popcount(target&query) */
b=0; c=0;
i=0;
#define _B_AND_C_POPCOUNT \
fp_word = targets_as_int[i]; \
b += (_popcount_counts[fp_word & 0xffff] + \
_popcount_counts[fp_word >> 16]); \
fp_word &= query_as_int[i++]; \
c += (_popcount_counts[fp_word & 0xffff] + \
_popcount_counts[fp_word >> 16]);
_B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
_B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
_B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
_B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
_B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
_B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
_B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
_B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
/* Append the result and go on to the next fingerprint */
*results = ((double)c)/(a+b-c);
results++;
targets_as_int += num_ints_in_fp;
}
}
and the high-level Python interface to it:
# tanimoto.py - wrapper to the libtanimoto shared library
import ctypes, ctypes.util
import numpy
# Load the shared library (the extension depends on the OS type)
libtanimoto = ctypes.CDLL(ctypes.util.find_library("tanimoto"))
# Call its initialization function
_init_tanimoto = libtanimoto.init_tanimoto
_init_tanimoto.argtypes = []
_init_tanimoto.restype = None
_init_tanimoto()
# Annotate the function call argument and result types
_similarity = libtanimoto.similarity
_similarity.argtypes = [ctypes.c_int, ctypes.c_char_p, ctypes.c_char_p]
_similarity.restype = ctypes.c_double
# Define the high-level public interface as a Python function
def similarity(query, target):
n = len(query)
m = len(target)
if n != m:
raise AssertionError("fingerprints must be the same length")
return _similarity(n, query, target)
_block_similarity = libtanimoto.block_similarity
_block_similarity.argtypes = [ctypes.c_int, ctypes.c_char_p,
ctypes.c_int, ctypes.c_char_p,
ctypes.POINTER(ctypes.c_double)]
_block_similarity.restype = None
# The high-level public interface
def block_similarity(query, targets):
n = len(query)
m = len(targets)
if m % n != 0:
raise AssertionError("length mismatch")
num_targets = m//n
result = numpy.zeros(num_targets, numpy.double)
result_ctypes = ctypes.cast(result.ctypes.data,
ctypes.POINTER(ctypes.c_double))
_block_similarity(n, query, num_targets, targets, result_ctypes)
return result
When I ran it I found:
1216150 calculations in 0.231 seconds; 5269985 per second Estimated PubChem time: 3.6 seconds First few are: 1.0 0.492537313433 0.450980392157 0.546511627907 0.505319148936The code is about 25% faster than the previous fastest version. The original byte lookup implementation at the start of this essay took 0.750 seconds, so the code is now about 3 times faster. Although it's not good enough for general use because it's hard-coded to 1024 bit, integer aligned fingerprints.
That's okay. I can add a test for that and if the input meets all the right criteria then call the specialized function, otherwise use the more general purpose but slower function. Not something I'm going to worry about now, but feel free to do it yourself. It might start something like:
void block_similarity(int num_bytes, unsigned char *query,
int num_targets, unsigned char *targets,
double *results) {
/* ensure the data is all integer aligned */
if (query % 4 == 0 && targets % 4 == 0) {
switch(num_bytes) {
case 128: .. specialized code for 1024 bits ..
case 64: .. specialized code for 512 bits ..
case 40: .. specialized code for the MDL MACCS 320 keys ..
case 32: .. specialized code for 256 bits ..
.. more special cases ..
default: break;
}
}
... general purpose code goes here ..
Block Tanimoto calculations #
This is part 4 of an on-going series dealing with molecular fingerprints:
- Molecular fingerprints, background
- Generating molecular fingerprints with OpenBabel
- Computing Tanimoto scores, quickly
- Block Tanimoto calculations
- Faster block Tanimoto calculations
- HAKMEM 169 and other popcount implementations
- Testing the bitslice algorithm for popcount
One query, many targets
In part 3 I wrote a C function to compute the Tanimoto similarity score between two fingerprints. The Tanimoto is most often used when searching a large set of target compounds. Some questions are "what's the Tanimoto similarity for every compound in the data set", "what are the 10 target compounds that are most similar to the query, sorted by similarity, and what are their similarity values?" and "how many compounds are at least some specified similarity threshold to the query compound?"
With the Tanimo.similarity function extension I defined earlier, it's simple to implement these algorithms. Here's one which prints the Tanimoto score for every compound in a given ".binary_fp" file using the first compound as the query.
# print_similarity.py
import sys
import tanimoto
filename = "drugs.binary_fp"
if len(sys.argv) > 1:
filename = sys.argv[1]
# Get the query fingerprint
f = open(filename, "rb")
query_fp = f.read(128) # 1024 bits
f.seek(0)
while 1:
target_fp = f.read(128) # 1024 bits
if not target_fp:
# reached the end of file
break
print tanimoto.similarity(query_fp, target_fp)
Here's some output to show you what it writes:
% python print_similarity.py nci.binary_fp | head 1.0 0.492537313433 0.450980392157 0.546511627907 0.505319148936 0.686813186813 0.132701421801 0.132701421801 0.602564102564 0.459330143541 Traceback (most recent call last): File "print_similarity.py", line 19, inHow fast is this code? I took the PubChem SD file for compounds "09425001 - 09450000" and converted it into a Babel fingerprint file, and thence into a ".binary_fp" file.print tanimoto.similarity(query_fp, target_fp) IOError: [Errno 32] Broken pipe
% babel Compound_09425001_09450000.sdf.gz nci_09425001_09450000.fpt -xh 24323 molecules converted 364134 audit log messages % python fpt_to_binary.py nci_09425001_09450000.fpt % ls -l nci_09425001_09450000.* -rw-r--r-- 1 dalke staff 3113344 Jun 9 16:30 nci_09425001_09450000.binary_fp -rw-r--r-- 1 dalke staff 8215091 Jun 9 16:20 nci_09425001_09450000.fpt % python -c 'print 3113344/24323.' 128.0I concatenated 50 copies of the binary file together to get a single large file, called "nci.binary_fp":
% ls -l nci.binary_fp -rw-r--r-- 1 dalke staff 155667200 Jun 9 16:46 nci.binary_fp %I had to make it this large because I wanted it to be somewhat realistic. This corresponds to a fingerprint database of 1,216,150 compounds.
How long does it take to print the Tanimoto scores for the entire data set?
% time python print_similarity.py nci.binary_fp > /dev/null 9.419u 0.442s 0:09.87 99.7% 0+0k 0+0io 0pf+0w %Decent. It means that PubChem (19,218,991 compounds) would take about 2.5 minutes, which is several times faster than the 7 minutes I estimated in pure Python, but still quite a bit slower than the 20 seconds I estimated was the limit using pure C.
Push blocks of targets to the C code
Where is the extra time being spent? Probably because I'm doing a lot of work in Python in order to do a very little bit of work in C. Every Tanimoto calculation requires a Python function, plus another to read 1024 bits, plus a test to see if it's the end of file. Python function calls take a lot of time compared to C function calls.
I could do some profiling to figure out if this guess is correct but it's about as easy to implement another solution and compare the timings. Rather than computing a single Tanimoto at a time I'm going to pass many fingerprints to the C function, and get a list of Tanimoto values back.
One of the points I've been bring up in these writings is that there are many solutions to a problem. In this case I'm going to pass a set of fingerprints to a C function, but I get to decide how I want to pass the data in. I could use a Python list of strings, or a Numpy array of unsigned integers, or many more possibilities. The solutions I'm exploring are based on experience but it doesn't mean I'm right, nor does it mean I can get to a good solution without some exploration.
tanimoto.block_similarity - my desired API
For my solution now I'm going to pass the set of query fingerprints as a single string, with the first fingerprint as the first 128 bytes, the second as the next 128 bytes, and so on. I'll call this single string which contains many fingerprints a "block" so my new function will be called "block_similarity." I want the result to be something like a Python list, where I can iterate over the results for each fingerprint, in order. In code, I want the main loop of the calcultion to act like the following interface:
while 1:
target_fps = f.read(128*1024) # 1024 fingerprints of 1024 bits each
if not target_fps:
# reached the end of file
break
for score in tanimoto.block_similarity(query_fp, target_fps):
print score
I want the results returned as a list but C functions can only return a simple data type like an integer or a pointer. Again, there are several possible solutions. I'll have my Python wrapper module allocate space for the results and pass a pointer to that space to the C function, which assigns the correct values.
libtanimoto.c:block_similarity
With most of the hard work of allocating results handled by some yet-to-be-written Python code, the C code isn't that hard:
/* This is in libtanimoto.c */
void block_similarity(int num_bytes, unsigned char *query,
int num_targets, unsigned char *targets,
double *results) {
int a=0, b=0, c=0, i;
/* precompute popcount(query) */
for (i=0; i<num_bytes; i++) {
a += _popcount_counts[query[i]];
}
/* For each fingerprint in the block */
while (num_targets-- > 0) {
/* compute popcount(target) and popcount(target&query) */
b=0; c=0;
for (i=0; i<num_bytes; i++) {
b += _popcount_counts[targets[i]];
c += _popcount_counts[query[i]&targets[i]];
}
/* Append the result and go on to the next fingerprint */
*results = ((double)c)/(a+b-c);
results++;
targets += num_bytes;
}
}
Because I expect to do a Tanimoto calculation using many targets I
could do a bit of optimization. The Tanimoto function is "c/(a+b-c)"
where "a" is the number of on-bits in the query fingerprint. That's a
constant so I compute it once.
tanimoto.py:block_similarity
The hard work of allocating memory goes into the 'tanimoto.py' file. I decided to use a numpy array to store the results because they are designed to work with ctypes. I can pass the underlying data storage area to a C function with only a 'cast'. Numpy also supports functions like 'numpy.sort' so working with numpy arrays can be more convenient than working with a memory block allocated directly by ctypes, which is my other likely possibility. On the other hand it does put a third-party dependency in the library.
Here's the code:
# This is added to 'tanimoto.py'
_block_similarity = libtanimoto.block_similarity
_block_similarity.argtypes = [ctypes.c_int, ctypes.c_char_p,
ctypes.c_int, ctypes.c_char_p,
ctypes.POINTER(ctypes.c_double)]
_block_similarity.restype = None
import numpy
def block_similarity(query, targets):
n = len(query)
m = len(targets)
if m % n != 0:
raise AssertionError("length mismatch")
num_targets = m//n
result = numpy.zeros(num_targets, numpy.double)
result_ctypes = ctypes.cast(result.ctypes.data,
ctypes.POINTER(ctypes.c_double))
_block_similarity(n, query, num_targets, targets, result_ctypes)
return result
Looks easy, right? You'll be surprised at how long it took to get
that working. I figured that because the numpy array is a
'numpy.double' then ctypes would see it as a ctypes.c_double array,
without needing the cast. Instead, without the case ctypes complains:
ctypes.ArgumentError: argument 5: <type 'exceptions.TypeError'>: expected LP_c_double instance instead of numpy.ndarrayStrange. But what I did seems to work, so hopefully others who are confused will search on the error message and find this essay useful.
After I compiled the modified libtanimoto shared library (using 'scons') and updated the 'tanimoto.py' wrapper library I did a quick test:
>>> tanimoto.block_similarity("A", "AaB")
array([ 1. , 0.66666667, 0.33333333])
>>>
>>> [hex(ord(c)) for c in "AaB"]
['0x41', '0x61', '0x42']
>>>
In this case the query fingerprint is 8 bits long (the character 'A",
which is the bit pattern 01000001) and the target fingerprint block
contains three 8 bit fingerprints (01000001, 11000001 and 01000010).
The result should be three Tanimoto scores, of values 1.0, 2/3 and 1/3
each. Which I got. It seems to work.
Not shown here is the step where I compared the results of the "block_similarity" function to one based on the older "similaity" function, to make sure they gave the same results.
Computing all Tanimoto scores using block_similarity
Once I was convinced I got the right answers, I wrote a "print_similarity2.py" function which works the same as "print_similarity.py" (it calculates and prints all of the Tanimoto scores for the given fingerprint file) except it uses the block-based similarity calculations.
# print_similarity2.py
import sys
import tanimoto
filename = "drugs.binary_fp"
if len(sys.argv) > 1:
filename = sys.argv[1]
# Get the query fingerprint
f = open(filename, "rb")
query_fp = f.read(128) # 1024 bits
f.seek(0)
while 1:
target_fps = f.read(128*1024) # 1024 fingerprints of 1024 bits each
if not target_fps:
# reached the end of file
break
for score in tanimoto.block_similarity(query_fp, target_fps):
print score
This code takes about 4.1 seconds to run on my 'nci.binary_fp' data set:
% time python print_similarity2.py nci.binary_fp > /dev/null 3.571u 0.546s 0:04.12 99.7% 0+0k 0+33io 0pf+0wwhich is quite a bit better than the 9.9 seconds of "print_similarity.py". Searching PubChem this way should take about 65 seconds, which is about 1/3rd of the speed I expect from C code.
The effects of block size
The code uses a tunable parameter, which is the number of fingerprints to include in a block. I timed a variety of block sizes to figure out what is optimal for my machine:
number
of fps time
in block (in s)
1 42.06
2 23.41
5 11.85
10 8.31
25 5.68
50 4.82
100 4.46
1000 4.17
5000 4.17
10000 4.22
100000 5.15
It looks like my chosen size of 1024 is a good one. It gives good
performance while minimizing memory use. Your machine might be
different. (Note that I used my computer to listen to music during
this, which is why the numbers are slightly slower than 4.12 seconds.)
My data set is "only" 155,667,200 bytes long (149 MB). It easily fits into memory. I was curious on how much time could be saved if I didn't do any I/O, that is, if I read the entire data file into a string and only computed the time needed to run "block_similarity" on that string. Result: 3.30s, or an estimated PubChem search time of 52 seconds. About 0.9 seconds is being spent in Python startup time and reading the data file.
Why's my code so slow?
Almost all of my time is in C code, yet I'm still at 1/3rd the time I expect from the fastest possible C code. What's going on? There's still a few differences between my code and the "fast" code I wrote.
- The fast code repeated the same calculations many times, which meant it didn't need to fetch data from disk and memory to the CPU.
- My block_similarity wrapper code allocated memory for the results
- I iterate over the results, and print them to /dev/null
- My popcount algorithm uses a lookup table, not __builtin_popcount()
for score in tanimoto.block_similarity(query_fp, target_fp):
#print score
pass
I no longer print out the score. When I run the code I get:
% time python print_similarity2.py nci.binary_fp 1.301u 0.538s 0:01.84 99.4% 0+0k 0+0io 0pf+0w %4.12 to 1.84 seconds is quite a difference! The estimated time to compute the Tanimoto for all of PubChem (but not print out the results) is now 30 seconds, which is only 4 times slower than my expected fastest possible performance (using this approach) of 8 seconds.
Of course, if I'm not going to print the values then there's no need to iterate over the return values. This iteration is slower than usual in Python because the data is stored in Numpy array, which hold C doubles, not Python floats. The iteration has to convert each C double into a Python float, which also takes time.
I removed iteration from my code so the main loop becomes:
while 1:
target_fp = f.read(128*4096) # 1024 of 1024 bits
if not target_fp:
# reached the end of file
break
tanimoto.block_similarity(query_fp, target_fp)
and the total time for my test set is 1.6 seconds,
% time python print_similarity2.py nci.binary_fp 1.033u 0.536s 0:01.57 99.3% 0+0k 0+11io 0pf+0wor an estimated 26 seconds for all of PubChem. I feel better now, as I wasn't sure why my code was so much slower than it should be. I could go back and rewrite my text so you would never see my confusion and wonderings about this, but where's the fun in that? In this way you can see some of how I got to where I did.
mmap - memory-mapped files
Earlier I mentioned that the block size was a tuneable parameter. I don't like tuneable parameters. I prefer self-tuning systems or using a better algorithm. It turns out there is another solution - a memory-mapped file. This is an operating system mechanism to treat a a file as a string in memory instead of using file operations like 'read()'. In theory I would mmap the file to a block of memory then pass the block directly to to the "block_similarity" function.
Sadly, it doesn't work as easily as I thought it would. Python has a mmap module which handles making a memory-mapped file, but the mmap object isn't compatible with the 'ctypes' interface. It seems that perhaps Python 2.6 makes this easier, but I'm not sure.
Instead I bypassed Python's mmap module and called the mmap function directly from libc, using ctypes. I ran into some difficulties, which ended up being because the "off_t" C type is a 64 bit integer, and not a 32 bit one like I expected. Getting the data types correct is one of the hard parts of using ctypes.
Here's the code for "print_similarity3.py". It's using the same libtanimoto:block_similarity function as "print_similarity2.py" but because the entire file is mapped to a string, I don't have to worry about reading a block at a time - I can pass the entire mmap'ed file to the function. (This will create in the PubChem case a 152 MByte Numpy array of doubles, which may be unexpected.)
# print_similarity3.py
import sys
import os
import mmap
import ctypes, ctypes.util
from ctypes import c_void_p, c_int, c_char, c_int64
import tanimoto
libc = ctypes.CDLL(ctypes.util.find_library("c"))
libc.mmap.restype = c_void_p
libc.mmap.argtypes = (c_int, c_int, c_int, c_int, c_int, c_int64)
filename = "drugs.binary_fp"
if len(sys.argv) > 1:
filename = sys.argv[1]
# Get the query fingerprint and reset
f = open(filename, "rb")
query_fp = f.read(128) # 1024 bits
f.seek(0)
size = os.path.getsize(filename)
# mmap(void *addr, size_t len, int prot, int flags, int fd, off_t offset);
ptr = libc.mmap(0, size, mmap.PROT_READ, mmap.MAP_SHARED, f.fileno(), 0)
targets = ctypes.cast(ptr, ctypes.POINTER(c_char*size))[0]
# Don't iterate and print the scores; that takes a long time
# and I want to focus on the computation part.
#for score in tanimoto.block_similarity(query_fp, targets):
# print score
tanimoto.block_similarity(query_fp, targets)
Getting timing numbers for this has been a bit tricky. The first time I ran this took a few tens of seconds as my Mac swapped pages around. I had a lot of apps using virtual memory. The second time it went faster. I've found in general that using mmap'ed files gives performance profiles that are harder to predict, so I tend to shy away from them. I also have read strange things about using them with big files (>1GB, or ast least > main memory) but I don't have the experience to judge.
On the plus side, here's the timings:
% time python print_similarity3.py nci.binary_fp 0.950u 0.318s 0:01.27 99.2% 0+0k 0+0io 0pf+0wPubChem's fingerprint file would be 2.3 GB. If the mmap access scales to that file size then I can compute all its Tanimoto values in about 20 seconds, which is is impressively fast.
Python and NumPy startup costs
My reference C code could in theory calculate Tanimoto values for PubChem in 8 seconds so my Python program timings, even with mmap, seems rather slow. Then again, these aren't computing the same things. For one thing, by simply scaling the time I'm exaggerating the overhead of starting Python. That should not be affected by the number of compounds to compute.
I changed the last few lines of the "print_similarity3.py" file so I could see how much time was spent in running "block_similarity"
... import time t1 = time.time() tanimoto.block_similarity(query_fp, targets) t2 = time.time() print >>sys.stderr, t2-t1It was 0.90 seconds out of a total time of 1.28 seconds. That means 0.4 seconds is being spend on startup costs? That's unexpected. Of that, Python's startup time is 0.05 seconds
% time python -c 'a=1' 0.014u 0.040s 0:00.05 100.0% 0+0k 0+0io 0pf+0wI added some print statements and calls to time.time() and found the time was spent in importing 'numpy'! You can see that in the following:
% time python -c 'import numpy' 0.141u 0.221s 0:00.37 97.2% 0+0k 0+0io 0pf+0wFor whatever reason, 'import numpy' also imports a huge number of modules, about half of which are submodules.
>>> import sys >>> len(sys.modules) 28 >>> import numpy >>> len(sys.modules) 225 >>> len([s for s in sorted(sys.modules) if 'numpy' in s]) 127 >>>Timing just 'libtanimoto.block_similarity' gave me 0.9 seconds. Of that, the difference between
for (i=0; i<num_bytes; i++) {
b += _popcount_counts[targets[i]];
c += _popcount_counts[query[i]&targets[i]];
}
and the code without the table lookup
for (i=0; i<num_bytes; i++) {
b += targets[i];
c += query[i]&targets[i];
}
is 0.88 seconds vs. 0.50 seconds (or estimated 14s for PubChem
vs. 7.9s). Okay, I'm convinced now that a lot time is now being spent
on the Tanimoto calculation.
I also tried replacing the "i<num_bytes" with "i<128" and it turns out that reduces the time from 0.88 seconds to 0.84 seconds (13 seconds for PubChem). I'll work on more performance optimizations in the next article.
Computing Tanimoto scores, quickly #
This is part 3 of an on-going series dealing with molecular fingerprints:
- Molecular fingerprints, background
- Generating molecular fingerprints with OpenBabel
- Computing Tanimoto scores, quickly
- Block Tanimoto calculations
- Faster block Tanimoto calculations
- HAKMEM 169 and other popcount implementations
- Testing the bitslice algorithm for popcount
For now my driving example will be to print the Tanimoto values for each compound in a data set, in order. I get to control where the data comes from and the format that it's in, and I can precompute whatever I want about the target data set. I deliberately don't care about the identifiers, only the Tanimoto scores, so I can focus on the search algorithm. In this essay I'll focus on computing the Tanimoto between two fingerprints. The next one will compute the Tanimoto between a single query fingerprint and a large number of targets.
Converting Babel's "fpt" format to "binary_fp" format
I'll start by converting the hex fingerprints from the "fpt" file into a new binary file format, and I'll assume that there are 1024 bits in the fingerprint. This new format will be very simple. Each fingerprint takes up 1024/8 = 128 bytes, and in the same order as the input file. This is not a good format for general use. I'll talk about why later. Because it's a bad format, I'm using an ugly and long extension to the filename. Don't use this format for real work!
I decided to use Python for this conversion rather than C because while my eventual goal is to do fast Tanimoto searches, this is a preprocessing step where the performance is not critical. Plus, I usually don't like parsing in C.
Parsing the "fpt" format turns out to be a bit of a pain, in almost the same way that parsing a FASTA file is a pain. There only way to know you are done with one record is to read the first line of the next record, or get to the end of the file. I read the header line (which is different for the first record vs. the other records in the file) and merge the hex lines into a single hex fingerprint string.
(Earlier I made it easier for myself by assuming I could read the entire file into memory, but because I'm going to parse a large fingerprint file soon, I can't make that assumption now.)
After I have the fingerprint as a hex digits string I need to convert it into the raw byte values. I thought this would be tedious until I remembered there's a "hex" encoding in Python which does that for me, and which makes this step trivial.
>>> "Hello!".encode("hex")
'48656c6c6f21'
>>> "48656c6c6f21".decode("hex")
'Hello!'
>>>
Here's the conversion code, which I named "fpt_to_binary.py" for boringly obvious reasons.
# fpt_to_binary_fp.py
# Convert from an OpenBabel "fpt" format with hex fingerprints (use
# -xf to generate the fingerprints) into a binary fingerprint file.
# Fingerprints are written consecutively, with no title information or
# separator between the fingerprints. This is a very raw format and
# should not be used for anything other then benchmarking and
# bootstrapping into a better format.
# Parsing formats with no explict end-of-record marker is a nuisance.
def read_records(infile):
# Parse the header line for the first record, which looks like:
# >Strychnine 183 bits set
# I'm not splitting on the " " because the title might have a space.
# Using a regular expression proved a bit too cumbersome.
line = infile.readline()
assert line[0] == ">"
# Turn ">Strychnine 183 bits set " into
# [">Strychnine", "183", "bits", "set"]
words = line.rsplit(None, 3)
assert words[-2:] == ["bits", "set"]
title = words[0][1:]
while 1:
hex_blocks = []
for line in infile:
if line.startswith("Possible"): # "Possible superstructure"
continue
if line[:1] == ">":
break
hex_blocks.append("".join(line.split()))
else:
# End of input
yield title, "".join(hex_blocks)
break
# "line" contains a title line for the next record
yield title, "".join(hex_blocks)
# Parse the header line for the other records, which look like:
# >cocaine Tanimoto from Strychnine = 0.353234
assert line[0] == ">"
i = line.rindex("Tanimoto from")
title = line[1:i].strip()
import sys, os
input_filename = sys.argv[1]
output_filename = os.path.splitext(input_filename)[0] + ".binary_fp"
infile = open(input_filename)
outfile = open(output_filename, "wb")
for title, hexdata in read_records(infile):
#print repr(title)
outfile.write(hexdata.decode("hex"))
Does it work? I'll convert the "drugs.fpt" file and compare its contents with the newly created "drugs.binary_fp" file.
% python fpt_to_binary_fp.py drugs.fpt % head drugs.fpt >Strychnine 183 bits set 00054062 81009600 00100102 81010700 200c0000 00202850 03102000 28000404 10200882 001849e4 483c0024 50039002 1402801a 01b90000 00020540 010c100a 22a4c000 82000300 2ac02018 00002201 60102120 183c9630 2100a080 81500019 00041401 80008c00 00484810 90001080 040c8202 0006081b 24020080 a2042610 >cocaine Tanimoto from Strychnine = 0.353234 00010000 01000800 00100100 00010700 000c0050 00204010 00000000 20000000 00000100 010008c0 00200000 50019000 % od -t xC drugs.binary_fp | head 0000000 00 05 40 62 81 00 96 00 00 10 01 02 81 01 07 00 0000020 20 0c 00 00 00 20 28 50 03 10 20 00 28 00 04 04 0000040 10 20 08 82 00 18 49 e4 48 3c 00 24 50 03 90 02 0000060 14 02 80 1a 01 b9 00 00 00 02 05 40 01 0c 10 0a 0000100 22 a4 c0 00 82 00 03 00 2a c0 20 18 00 00 22 01 0000120 60 10 21 20 18 3c 96 30 21 00 a0 80 81 50 00 19 0000140 00 04 14 01 80 00 8c 00 00 48 48 10 90 00 10 80 0000160 04 0c 82 02 00 06 08 1b 24 02 00 80 a2 04 26 10 0000200 00 01 00 00 01 00 08 00 00 10 01 00 00 01 07 00 0000220 00 0c 00 50 00 20 40 10 00 00 00 00 20 00 00 00 %The "od -t cH" displays the hex values for the bytes of the "drugs.binary_fp" file. You can see that the output file is simply the input hex digits, expressed directly in binary. (You might have to stare at it for a bit.) So yes, it does work.
Computing the popcount using only Python
Onwards to computing the Tanimoto scores! This will require computing the popcount of the intersection of two fingerprints. Normal Python is really quite bad at that. I could convert the 1024 bit/128 byte fingerprint into a Python integer (like I did earlier). Computing the bit intersection between two integers is easy with the '&' operator, but finding the popcount is a bit complicated. (See below for elaboration.)
I could keep everything as a byte string. Finding the popcount of the fingerprint is the same as the sum of the popcount of each byte. There are only 256 possible bytes, so that's a small lookup table. The problem is computing the bits in common to both fingerprints. Python doesn't support bit operations on characters, so I would have to convert the bytes to a number before doing the &. This is completely doable, but there's a somewhat better solution.
I can use the 'array' module and create a byte array. This is a seldom used Python module but it seems just the thing for this case.
>>>import array
>>> array.array("b", "This is a string")
array('b', [84, 104, 105, 115, 32, 105, 115, 32, 97, 32, 115, 116, 114, 105, 110, 103])
>>>
This converted the input string into an array of bytes, stored as integers. To compute the total fingerprint popcount I'll sum the popcount for each byte in the fingerprint. I can get that by looking it up in a table of precomputed values for each possible byte. To compute the number of bits in common between the two fingerprints I'll compute the sum of the number of bits in common between the corresponding bytes in the fingerprints.
import array
# Create a lookup table from byte value to 'on' bit count
# chr(0) has 0 'on' bits
# chr(1) has 1 'on' bit
# chr(2) has 1 'on' bit
# chr(3) has 2 'on' bits
# ...
# chr(255) has 8 'on' bits
popcount_in_byte = (
0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8,
)
fp = array.array('b', [84, 104, 105, 115, 32, 105, 115, 32,
97, 32, 115, 116, 114, 105, 110, 103])
print "popcount =", sum(popcount_in_byte[byte] for byte in fp)
If you copy and past that into a Python terminal (or save it and run
it), you'll get:
popcount = 57which is the number of 'on' bits in the string "This is a string". To verify that, here's a completely different (and slower) method based on simple bit counting.
>>> "This is a string".encode("hex")
'54686973206973206120737472696e67'
>>> int("This is a string".encode("hex"), 16)
112197289293498629157862941796076187239L
>>> x = int("This is a string".encode("hex"), 16)
>>> popcount = 0
>>> while x:
... if x & 1:
... popcount += 1
... x >>= 1
...
>>> popcount
57
>>>
The two methods agree, so it's very likely that the code works.
Although there could still be mistakes in the lookup table.
Computing the Tanimoto using only Python
Now that I have a popcount it's easy to compute the Tanimoto. In the following program I'll print the Tanimoto for every fingerprint in the specified binary_fp file, using the first fingerprint as the query fingerprint. I'll include computing the Tanimoto of the query fingerprint against itself.
# compute_tanimoto.py
import sys
import array
# Create a lookup table from byte value to 'on' bit count
# chr(0) has 0 'on' bits
# chr(1) has 1 'on' bit
# chr(2) has 1 'on' bit
# chr(3) has 2 'on' bits
# ...
# chr(255) has 8 'on' bits
popcount_in_byte = (
0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8,
)
def tanimoto(query, target):
a = b = c = 0
for query_byte, target_byte in zip(query, target):
a += popcount_in_byte[query_byte]
b += popcount_in_byte[target_byte]
c += popcount_in_byte[query_byte & target_byte]
# I'll define that if neither fingerprint has any
# set bits (so a == b == 0) then the Tanimoto is 0.
# Otherwise the equation will try to compute 0/0
# and Python will throw an exception.
if not c:
return 0
return float(c)/(a+b-c)
if len(sys.argv) > 1:
filename = sys.argv[1]
else:
filename = "drugs.binary_fp"
# Use the first fingerprint as the query
infile = open(filename, "rb")
s = infile.read(1024//8) # 128 bytes in a fingerprint
query = array.array("b", s)
# Reopen and compute the Tanimoto against all fingerprints
# including itself.
infile = open(filename, "rb")
while 1:
s = infile.read(1024//8)
if not s:
# End of file
break
target = array.array("b", s)
print tanimoto(query, target)
The output from this program on the previously created "drugs.binary_fp" file is:
% python compute_tanimoto.py drugs.binary_fp 1.0 0.353233830846 0.330316742081 0.401486988848 0.332046332046 0.303571428571 0.317406143345 0.22009569378 0.124444444444 0.1 %How do I verify those are correct? OpenBabel also computed the Tanimoto scores when it generated the ".fpt" file. You can go back and compare (as I did) and see that they agree. The only difference is that I'm printing the results to double precision while OpenBabel prints them to single precision. Since the denominator is at best 1024, single precision is more than accurate enough. In fact, double implies an accuracy that isn't warranted.
Can I do this faster without writing C code?
There are ways to make the popcount code go faster using standard Python or available 3rd party extensions. I'll review some of them but won't use them outside this section.
The standard Python distribution is not a good platform for doing fast Tanimoto calculations. There is no builtin function for it and while I can build them they are all relatively slow. That's why I'll shortly develop a C extension for doing this calculation. Before doing that, just what other options exist?
I could convert the 1024 bit/128 byte fingerprint into a Python integer. The first problem is converting the raw fingerprint bytes into an integer. I don't know a direct way, but int(s.encode("hex"), 16) works. Computing the bit intersection is simple - just use "&". The problem is computing the popcount.
The simple popcount algorithm is to see if the number has the 1 bit set, then the 2 bit, then the 4 bit, ...
# This computes about 14,000 popcounts per second
all_bits = [1<<bitno for bitno in range(1024)]
def popcount_1024(x):
n = 0
for bit in all_bits:
if x & bit:
n += 1
return n
That's really quite slow. I could use a clever bit-manipulation
algorithm to calculate it, which I extended from the Wikipedia article
on Hamming
weight. Here it is:
# Compute the popcount of a Python long, up to 1024 bits.
# My MacBook Pro only does about 47,000 per second.
m1 = int("0x" + "5" * (1024//4), 16)
m2 = int("0x" + "3" * (1024//4), 16)
m3 = int("0x" + "0f" * (1024//8), 16)
m4 = int("0x" + "00ff" * (1024//16), 16)
m5 = int("0x" + "0000ffff" * (1024//32), 16)
m6 = int("0x" + ("0"*8+"f"*8) * (1024//64), 16)
m7 = int("0x" + ("0"*16+"f"*16) * (1024//128), 16)
m8 = int("0x" + ("0"*32+"f"*32) * (1024//256), 16)
m9 = int("0x" + ("0"*64+"f"*64) * (1024//512), 16)
m10 = int("0x" + ("0"*128+"f"*128), 16)
def popcount_1024(x):
x = (x & m1 ) + ((x >> 1) & m1 )
x = (x & m2 ) + ((x >> 2) & m2 )
x = (x & m3 ) + ((x >> 4) & m3 )
x = (x & m4 ) + ((x >> 8) & m4 )
x = (x & m5 ) + ((x >> 16) & m5 )
x = (x & m6 ) + ((x >> 32) & m6 )
x = (x & m7 ) + ((x >> 64) & m7 )
x = (x & m8 ) + ((x >> 128) & m8 )
x = (x & m9 ) + ((x >> 256) & m9 )
x = (x & m10) + ((x >> 512) & m10)
return x
By being more clever I could make this a bit faster, but I'm not even
going to try.
During proofreading I remembered another way to make the popcount of the and-ed fingerprints be faster - I could use 16 bit integers for everything and look up the popcounts in a 16 bit lookup table. But this doesn't work that well because Python doesn't have a 16 bit array type. You would have to use one of the Numpy/numeric/numarray arrays.
Another way to compute the popcount of an integer is to use Scott David Daniels' 'bits' module. It's a C extension and is quite fast. I used it a couple of years ago when I was doing some experiments with fingerprint searches.
If you're going to use a third-party C extension then you might instead consider using GMPY. It's based on the GNU Multiple Precision Library and includes a popcount function. Digging in the source code I see that the popcount function is implemented in hand-crafted assembly for different platforms.
Yet another possibility for computing the popcount of a fingerprint is to convert the integer into hex string, or generate a binary pickle string, and work with the resulting string using the same lookup table approach I did earlier. This has the advantage of working with any length Python integer, but has a lot of overhead.
# Compute the popcount of a Python long
# For 1024 bit fingerprints my MacBook Pro does about
# 24,000 per second.
popcount_lookup_table = {}
for i, n in enumerate( (
0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8,
) ):
popcount_lookup_table[chr(i)] = n
# "pickle" does 15,000/second
# "cPickle" does 24,000/second
import cPickle as pickle
def popcount(x):
s = pickle.dumps(x, 2)[4:-1]
L = popcount_lookup_table
return sum(L[c] for c in s)
Or I could stick with the raw byte string. The hard part there was computing the number of bits in common between two characters. The easiest solution is to convert the characters into integers, do the bitwise-and, and convert back to a character, but that's a lot of Python function calls for something that's a single assembly instruction. I could perhaps lookup table, where the key in the table is (query_byte, target_byte) and the value is the number of bits in common. I don't like that solution. For one, it requires 256*256=65,536 entries.
Computing the Tanimoto in C, using GCC extensions
Computing the Tanimoto score is much easier in C, and even easier with gcc because the latter includes a built-in function for computing the popcount of an integer, a long, and a long long. I'll use gcc for now so I don't need to define a popcount just yet. (I'll do that in the next section.)
In the following I assume that integers are 32 bits in length, so there are 32 integer parts to a 1024 bit fingerprint. I'll write each unsigned integer in hex, so the number starts with "0x" (to mean "hex") and ends with "u" (to mean "unsigned").
/* simple_tanimoto.c */
#include <stdio.h>
/* This uses a non-portable builtin function for gcc */
#define popcount(x) __builtin_popcount(x)
unsigned int query[] = {
/* >Strychnine 183 bits set */
0x00054062u, 0x81009600u, 0x00100102u, 0x81010700u, 0x200c0000u, 0x00202850u,
0x03102000u, 0x28000404u, 0x10200882u, 0x001849e4u, 0x483c0024u, 0x50039002u,
0x1402801au, 0x01b90000u, 0x00020540u, 0x010c100au, 0x22a4c000u, 0x82000300u,
0x2ac02018u, 0x00002201u, 0x60102120u, 0x183c9630u, 0x2100a080u, 0x81500019u,
0x00041401u, 0x80008c00u, 0x00484810u, 0x90001080u, 0x040c8202u, 0x0006081bu,
0x24020080u, 0xa2042610u};
unsigned int target[] = {
/* >cocaine Tanimoto from Strychnine = 0.353234 */
0x00010000u, 0x01000800u, 0x00100100u, 0x00010700u, 0x000c0050u, 0x00204010u,
0x00000000u, 0x20000000u, 0x00000100u, 0x010008c0u, 0x00200000u, 0x50019000u,
0x1c00801au, 0x01890000u, 0x00000000u, 0x02081008u, 0x00a20000u, 0x02000000u,
0x01c02010u, 0x00000201u, 0x00000020u, 0x18020a30u, 0x01000080u, 0x00500014u,
0x00001401u, 0x80002000u, 0x00088000u, 0x10001000u, 0x00000200u, 0x0006000eu,
0xc0020000u, 0x02002200u};
int main(void) {
int A=0, B=0, c=0;
int i;
for (i=0; i<32; i++) {
/*printf(" %lx %lx\n", fp1[i], fp2[i]);*/
A += popcount(query[i]);
B += popcount(target[i]);
c += popcount(query[i]&target[i]);
}
/* Should print "183 89 71 -> 0.353234" */
printf("%d %d %d -> %f\n", A, B, c, ((double)c)/(A+B-c));
}
I'll compile and run it:
% cc simple_tanimoto.c -O3 -o simple_tanimoto % ./simple_tanimoto 183 89 71 -> 0.353234 %so you can see that it gave the same Tanimoto score that OpenBabel computed.
To me this bit of C is much easier to understand than the same Python code. Part of the reason is that I'm using the __builtin_popcount of gcc but the real reason is that in C things like byte arrays, integer arrays, and the bytes and bits in an integer are very fluid. It's very easy to go from one form to another. Even if I have to implement the bitcount myself, with the following bit of code, I can compute over 1 million Tanimoto similarities per second. which is more than 20 times faster than my fastest Python code. Note though that this computing score between the same pair. Computing Tanimoto scores for a large number of targets will be slower because bringing that data to the CPU takes additional time, but then there are other way to make the performance be better. I'll talk about this in a bit.
Computing the Tanimoto in C, without GCC extensions
Here's the equivalent C code using an 8-bit lookup table to compute the popcount for a 32 bit integer.
/* simple_tanimoto2.c */
#include <stdio.h>
/* Implement popcount(unsigned int) myself. */
/* Source based on http://popcnt.org/2007/09/magic-popcount-popcnt-command.html */
const int _popcount_counts[] = {
0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8,
};
/* This assumes that 'unsigned int' is 32 bits */
int popcount(unsigned int x) {
return (_popcount_counts[x & 0xff] +
_popcount_counts[x >> 8 & 0xff] +
_popcount_counts[x >> 16 & 0xff] +
_popcount_counts[x >> 24 & 0xff]);
}
unsigned int query[] = {
/* Strychnine 183 bits set */
0x00054062u, 0x81009600u, 0x00100102u, 0x81010700u, 0x200c0000u, 0x00202850u,
0x03102000u, 0x28000404u, 0x10200882u, 0x001849e4u, 0x483c0024u, 0x50039002u,
0x1402801au, 0x01b90000u, 0x00020540u, 0x010c100au, 0x22a4c000u, 0x82000300u,
0x2ac02018u, 0x00002201u, 0x60102120u, 0x183c9630u, 0x2100a080u, 0x81500019u,
0x00041401u, 0x80008c00u, 0x00484810u, 0x90001080u, 0x040c8202u, 0x0006081bu,
0x24020080u, 0xa2042610u};
unsigned int target[] = {
/* cocaine Tanimoto from Strychnine = 0.353234 */
0x00010000u, 0x01000800u, 0x00100100u, 0x00010700u, 0x000c0050u, 0x00204010u,
0x00000000u, 0x20000000u, 0x00000100u, 0x010008c0u, 0x00200000u, 0x50019000u,
0x1c00801au, 0x01890000u, 0x00000000u, 0x02081008u, 0x00a20000u, 0x02000000u,
0x01c02010u, 0x00000201u, 0x00000020u, 0x18020a30u, 0x01000080u, 0x00500014u,
0x00001401u, 0x80002000u, 0x00088000u, 0x10001000u, 0x00000200u, 0x0006000eu,
0xc0020000u, 0x02002200u};
int main(void) {
int A=0, B=0, c=0;
int i;
for (i=0; i<sizeof(query)/sizeof(unsigned int); i++) {
/*printf(" %lx %lx\n", fp1[i], fp2[i]);*/
A += popcount(query[i]);
B += popcount(target[i]);
c += popcount(query[i]&target[i]);
}
/* Should print "183 89 71 -> 0.353234" */
printf("%d %d %d -> %f\n", A, B, c, ((double)c)/(A+B-c));
}
Computing the pairwise Tanimoto in C is fast. I put the Tanimoto calculation in a loop and calculated it 100,000,000 times, which took 42 seconds. The last time I looked, PubChem contained 19,218,991 compounds. Assuming everything was available in fast memory then I should be able to search it in 8 seconds using a C program, or about 7 minutes using Python. Which would you prefer?
This is interesting. Doing the popcount using a lookup table took 43 seconds. Using __builtin_popcount took 109 seconds. I was not expecting that.
Computing the Tanimoto in Python via a C extension
I prefer the speed of C but the general ease of programming using Python. (That's a sort of "have my cake and eat it too" answer.) Using Python for Tanimoto searches is slow in part because there is no fast way to compute the popcount. I can fix that by writing a Python extension using C. Rather than add only the popcount I'll have the extension compute the entire Tanimoto between two fingerprints.
The easiest way to pass fingerprints around in Python is as string, and more specifically a byte string. (As compared to a Unicode string.) I want my Python code to work like:
import tanimoto query = "... 128 bytes ..." target = "... 128 bytes ..." print "Tanimoto score is", tanimoto.similarity(query, target)There are many ways to do this. Some ways I'm not going to use now are: writing C extension directly, or using Boost Python, SWIG or Pyrex and its offshoot Cython. Instead, I'll use ctypes.
Ctypes is a "foreign function interface" ("FFI") package for Python. It lets Python load a shared library and call C functions directly. You still have to write some interface code to tell ctypes how to call the function, and perhaps write some extra error checking code, but not much.
I find ctypes to be the easiest way to get Python to talk to C code, but there are a few downsides. Calling a function through ctypes has a bit more overhead (it's a bit slower) than using a special-purpose compiled interface like the other solutions use. It's also very easy to crash Python using ctypes.
But the advantage is that I can write C code without worrying about writing the Python/C interface in C. I do have to worry, but I'll worry about it from Python. Here's a simple C library to compute the Tanimoto similarity between two fingerprints as stored in unsigned byte strings.
/* libtanimoto.c */
static const int _popcount_counts[] = {
0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8,
};
double similarity(int num_bytes,
unsigned char *query, unsigned char *target) {
int a=0, b=0, c=0, i;
for (i=0; i<num_bytes; i++) {
a += _popcount_counts[query[i]];
b += _popcount_counts[target[i]];
c += _popcount_counts[query[i]&target[i]];
}
return ((double)c)/(a+b-c);
}
Very simple, yes? Next, I need to compile it. Again there are many
possible solutions. I'll use SCons which one of several
replacements for make. Instead of using a "Makefile" it uses a
configuration file named "SConstruct". The configuration language is
Python, extended with functions that know how to build binaries,
shared libraries, and other targets. This is very nice because
building shared libraries across multiple platforms is a nasty mess
and I don't want to do it myself.
Here's my 'SConstruct' file:
# SConstruct
SharedLibrary("libtanimoto", ["libtanimoto.c"],
CFLAGS="-O3")
I compiled it using the "scons" command:
% scons scons: Reading SConscript files ... scons: done reading SConscript files. scons: Building targets ... gcc -o libtanimoto.os -c -O3 -fPIC libtanimoto.c gcc -o libtanimoto.dylib -dynamiclib libtanimoto.os scons: done building targets.The result is a "libtanimoto.dylib", which is what Macs use as a run-time loadable shared library. Linux and other Unix OSes use the extension ".so" and MS Windows uses ".dll".
Using ctypes to access the shared library
To load the function in ctypes I first load the shared library (remember, the extension depends on which OS you are using - 'find_library' handles that detail) then get the function from the module. I then have to annotate the function so ctypes knows how to convert the Python call arguments to the right C types and how to convert the C result into Python. Here's an example of me doing that by hand, with a fingerprint of 4 bytes:
>>> import ctypes, ctypes.util
>>> libtanimoto = ctypes.CDLL(ctypes.util.find_library("tanimoto"))
>>> libtanimoto.similarity
<_FuncPtr object at 0x88c00>
>>> libtanimoto.similarity.argtypes = [ctypes.c_int, ctypes.c_char_p, ctypes.c_char_p]
>>> libtanimoto.similarity.restype = ctypes.c_double
>>> libtanimoto.similarity(4, "\0\0\0\0", "\0\0\0\0")
nan
>>> libtanimoto.similarity(4, "\0\0\0\0", "\0\0\0\1")
0.0
>>> libtanimoto.similarity(4, "\0\0\0\1", "\0\0\0\1")
1.0
>>> libtanimoto.similarity(4, "\0\0\0\1", "\0\0\0\3")
0.5
>>>
If I know the number of bytes in the fingerprint then I can call this
interface directly. Most people would prefer a simpler solution and
not have to worry about passing in the length every time. After all,
Python strings already know their length.
Making a simplified interface
What I'll do is create a new Python module in the file "tanimoto.py". This will load the shared library and do the ctypes annotation. Once done I'll define a "high-level" function also called "similarity" which gets the fingerprint lengths, checks that they are the same, and calls the underlying C similarity implementation.
# tanimoto.py - wrapper to the libtanimoto shared library
import ctypes, ctypes.util
# Load the shared library (the extension depends on the OS type)
libtanimoto = ctypes.CDLL(ctypes.util.find_library("tanimoto"))
# Annotate the function call argument and result types
_similarity = libtanimoto.similarity
_similarity.argtypes = [ctypes.c_int, ctypes.c_char_p, ctypes.c_char_p]
_similarity.restype = ctypes.c_double
# Define the high-level public interface as a Python function
def similarity(query, target):
n = len(query)
m = len(target)
if n != m:
raise AssertionError("fingerprints must be the same length")
return _similarity(n, query, target)
In action:
>>> import tanimoto
>>> tanimoto.similarity("Andrew", "andrew")
0.95999997854232788
>>> tanimoto.similarity("Andrew", "ANDREW")
0.79166668653488159
>>> tanimoto.similarity("Andrew", "123456")
0.40625
>>> tanimoto.similarity("Andrew", "13456")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "tanimoto.py", line 23, in similarity
m = len(target)
AssertionError: fingerprints must be the same length
>>>
Looks good. But does it give the right answer? I'll give it a test
case where we know the right answer and we'll see what it says:
>>> s1 = "".join("""
... 00054062 81009600 00100102 81010700 200c0000 00202850
... 03102000 28000404 10200882 001849e4 483c0024 50039002
... 1402801a 01b90000 00020540 010c100a 22a4c000 82000300
... 2ac02018 00002201 60102120 183c9630 2100a080 81500019
... 00041401 80008c00 00484810 90001080 040c8202 0006081b
... 24020080 a2042610
... """.split()).decode("hex")
>>> s2 = "".join("""
... 00010000 01000800 00100100 00010700 000c0050 00204010
... 00000000 20000000 00000100 010008c0 00200000 50019000
... 1c00801a 01890000 00000000 02081008 00a20000 02000000
... 01c02010 00000201 00000020 18020a30 01000080 00500014
... 00001401 80002000 00088000 10001000 00000200 0006000e
... c0020000 02002200
... """.split()).decode("hex")
>>>
>>> import tanimoto
>>> tanimoto.similarity(s1, s2)
0.35323384404182434
>>>
Sweet! The right answer! Though we're display it to double precision
while OpenBabel only reports to float precision. That's fine -
there's only 1024 bits so the resolution of the answer can't be better
than 1/1024.
The high-level Python wrapper interface is easier to use than calling the C interface exposed by ctypes. This will usually be the case. C and Python have different world views. C libaries tend to optimize towards performance and require that programmers think more about all the details, while Python libraries tend be safer and easier to use, but slower. I'm hoping to show that it's possible to get both by mixing the two together.
Performance timings for the C extension
I also did some performance timings. The public high-level interface can compute Tanimoto scores about 340,000 fingerprints (of 1024 bits each) per second. That's about 15% the speed of the native C code but about 7 times faster than my pure Python solution. Remember, the high-level interface does extra work and then calls the low-level ctypes function. If I remove that extra work then I can do about 440,000 calculations per second.
Here's how I got those numbers:
# Time the tanimoto.similarity function calls
import tanimoto
import time
def go(n):
range_n = range(n)
s = "\1"*128
similarity = tanimoto.similarity
t1 = time.time()
for i in range_n:
similarity(s, s)
t2 = time.time()
print "Using public interface", float(n)/(t2-t1), "calls/sec"
_similarity = tanimoto._similarity
t1 = time.time()
for i in range_n:
_similarity(128, s, s)
t2 = time.time()
print "Using ctypes interface", float(n)/(t2-t1), "calls/sec"
>>> go(100000) Using public interface 341912.926942 calls/sec Using ctypes interface 446837.633061 calls/sec >>> go(1000000) Using public interface 340170.861425 calls/sec Using ctypes interface 443564.430149 calls/sec >>>One nice thing about having a benchmark in-place is I can test alternate implementations and see if I get better performance. My Tanimoto calculation uses fingerprint bit counts 'a', 'b', and 'c', based on the Daylight usage. This has general applicability because other similarity measures can be described as a function of those variables. The Tanimoto calculation has a simpler form, which is the number of on bits in the bitwise-and of the two fingerprints divided by the number of on bits in their bitwise-or. It's is succienctly written as:
double similarity(int num_bytes,
unsigned char *query, unsigned char *target) {
int and_count=0, or_count=0, i;
for (i=0; i<num_bytes; i++) {
or_count += _popcount_counts[query[i] | target[i]];
and_count += _popcount_counts[query[i] & target[i]];
}
return ((double)and_count)/or_count;
}
Doing a bitwise-or is faster than doing another popcount. My
benchmark became about 1% to 2% faster with this variation. That's
almost in the noise, but useful to note if you want to eek out a bit
more performance.
Generating molecular fingerprints with OpenBabel #
This is part 2 of an on-going series dealing with molecular fingerprints:
- Molecular fingerprints, background
- Generating molecular fingerprints with OpenBabel
- Computing Tanimoto scores, quickly
- Block Tanimoto calculations
- Faster block Tanimoto calculations
- HAKMEM 169 and other popcount implementations
- Testing the bitslice algorithm for popcount
I am not yet going to discuss implementation details of how to generate molecular fingerprints. I'll assume you have access to a program which generates them for you. This might be a commercial program like from Daylight or from Mesa Analytics. For those following along from home (or don't have license for those, or want to dig into the step-by-step details of how to handle one of those program), install OpenBabel. It implements three different fingerprint types:
- FP2 (default) - linear fragments of length 1 to 7 (with some exceptions) using a hash code generating bits 0≤bit#<1021
- FP3 - functional groups based on 55 SMARTS patterns (see "patterns.txt")
- FP4 - functional groups based on 307 SMARTS patterns (see "SMARTS_InteLigand.txt")
Why are there two different file formats?
The FP3 and FP4 fingerprints are substructure fingerprints based on SMARTS patterns listed in text files. Sadly, while I think the two files should be in the same format, they aren't. The first looks like:
[+] 1 [-] 2 [#6][CX3](=O) 3 aldehyde or ketone [CX3H1](=O)[#6] 4 aldehyde [#6][CX3](=O)[#6] 5 ketonewhile the second looks like:
# I.1.1 Alkanes: Primary_carbon: [CX4H3][#6] Secondary_carbon: [CX4H2]([#6])[#6] Tertiary_carbon: [CX4H1]([#6])([#6])[#6] Quartary_carbon: [CX4]([#6])([#6])([#6])[#6]
In looking through the source I found this comment in the method "ReadPatternFile" in the "fingerprints/finger3.cpp".
// Christian Laggner's format: SMARTS at end of lineThis is the code which parses the FP3 and FP4 formats. (Yes, even though it's named 'finger3.cpp' it also parses the FP4 format.) Specially this comment comes from the section which figures out the specific file format. My most likely conjecture is that the FP3 format existed then someone decided to add the FP4 definitions, using SMARTS patterns that someone else defined.
There's a tradeoff here: should code be changed to fit data or data changed to fit code? How malleable is code, and how much should you change someone else's data? In this case the implementer decided the easiest solution was to keep the original format unchanged; perhaps because it would be easier to import updates in the future. I would have reformated the file to fit the expected format, perhaps keeping the conversion program around to handle any updates.
Is it better to implement more features but haphazardly, or fewer features but more cleanly? I'm biased much futher towards the second than most people seem to be, and I think that the combined solution of "more features and cleanly" is generally posssible. Perhaps scientific software is worse than average because most people's explicit goal is to get some scientific project done, and not to support the needs of unknown others in the future. Or because most scientists aren't trained programmers? Or perhaps I have an unrealistic expectation of how green the grass might be on the other side of the fence.
Generating fingerprints with Babel
Babel generates two types of fingerprint formats. One is an index file used to store precomputed fingerprints when doing a substructure search. It's a binary file format that's a bit too complex to talk about here. The other is, I'm told, more of a debugging aid that saves the fingerprints as hexadecimal strings. It's much easier to show and understand so I'll use it.
I happen to have a small set of pharmacologically popular compounds that I got from JJ some years back. It was for an example data set he put together while at Daylight.
N12CCC36C1CC(C(C2)=CCOC4CC5=O)C4C3N5c7ccccc76 Strychnine c1ccccc1C(=O)OC2CC(N3C)CCC3C2C(=O)OC cocaine COc1cc2c(ccnc2cc1)C(O)C4CC(CC3)C(C=C)CN34 quinine OC(=O)C1CN(C)C2CC3=CCNc(ccc4)c3c4C2=C1 lyseric acid CCN(CC)C(=O)C1CN(C)C2CC3=CNc(ccc4)c3c4C2=C1 LSD C123C5C(O)C=CC2C(N(C)CC1)Cc(ccc4O)c3c4O5 morphine C123C5C(OC(=O)C)C=CC2C(N(C)CC1)Cc(ccc4OC(=O)C)c3c4O5 heroin c1ncccc1C1CCCN1C nicotine CN1C(=O)N(C)C(=O)C(N(C)C=N2)=C12 caffeine C1C(C)=C(C=CC(C)=CC=CC(C)=CCO)C(C)(C)C1 vitamin aI'll convert that to the text fingerprint format ".fpt" file using FP2-style fingerprints. Here I'll specify "-xfFP2" but that's optional and if I left it out I would still get the FP2 fingerprints. The "-xh" option says to include the hex fingerprints in the output file:
% babel drugs.smi drugs.fpt -xh -xfFP2 10 molecules converted 192 audit log messages % cat drugs.fpt >Strychnine 183 bits set 00054062 81009600 00100102 81010700 200c0000 00202850 03102000 28000404 10200882 001849e4 483c0024 50039002 1402801a 01b90000 00020540 010c100a 22a4c000 82000300 2ac02018 00002201 60102120 183c9630 2100a080 81500019 00041401 80008c00 00484810 90001080 040c8202 0006081b 24020080 a2042610 >cocaine Tanimoto from Strychnine = 0.353234 00010000 01000800 00100100 00010700 000c0050 00204010 00000000 20000000 00000100 010008c0 00200000 50019000 1c00801a 01890000 00000000 02081008 00a20000 02000000 01c02010 00000201 00000020 18020a30 01000080 00500014 00001401 80002000 00088000 10001000 00000200 0006000e c0020000 02002200 >quinine Tanimoto from Strychnine = 0.330317 00000002 09000808 2010410a 40010600 000c1080 0000c000 020c0000 20000400 00400980 001808c8 08300004 50018000 98000806 00e00000 01080441 000c000c 10000000 02000000 00612000 00020201 40000100 18208038 23800000 81500010 00002000 80000000 00520000 90000000 04000202 00060002 64060000 00002620 >lyseric acid Tanimoto from Strychnine = 0.401487 00084212 29008208 41901b02 00010600 22010024 00002040 830800c0 a400040b 104008c0 285828d8 08300204 71c19008 08029102 08311800 00063446 420c034a 20248600 02080140 0a80201a 00800203 44900504 58200a20 2


