Testing the bitslice algorithm for popcount
This is part 7 of an on-going series dealing with molecular fingerprints:
- Molecular fingerprints, background
- Generating molecular fingerprints with OpenBabel
- Computing Tanimoto scores, quickly
- Block Tanimoto calculations
- Faster block Tanimoto calculations
- HAKMEM 169 and other popcount implementations
- Testing the bitslice algorithm for popcount
- 2011 update: Faster population counts
I want to compute the popcount of molecular fingerprints containing 1024 bits. My algorithm has been to compute the popcount 8, 16, 32, or 64 bits at a time and repeat that for all of the words in the fingerprint. There are a couple of algorithms which compute the popcount of multiple words in parallel by doing a partial popcount of each word and combining the intermediate results to get the result over all words.
Lauradoux Cédric wrote a nice overview on Hamming weight, including explanations of the tree merging and bit-slicing algorithms. In personal email he pointed out this page from comp.arch.arithmetic where someone attributes the algorithm to Robert Harley.
It took me a while to understand it. It's been rather a while since I had to do much thinking about half-adders. About 17 years. But it's not hard. I'm not going to explain it though. It's best in pictures and that will take me too long to do. You can follow those links though.
While he hasn't finished the overview explanation on how to combine the two algorithms, his Hamming weight page includes source.
I've added two bit-slicing implementations to my popcnt.c benchmark. The algorithm from comp.arch.arithmetic processes 7 words at a time and the one from Cédric processes 24 words at a time. Both fall back to other methods for the last few words if the input isn't an exact multiple of 24 or 7.
Cédric implemented several algorithms in his code distributions and included programs to compare their performance. The results suggested that the 24 word bitslice algorithm would be twice as fast as as the 16 bit looup table. I implemented it and found that while it was fast, and in my laptop even faster than the 16 bit lookup table, it was only about 10% faster and not a factor of 2.
% sysctl machdep.cpu.brand_string machdep.cpu.brand_string: Intel(R) Core(TM)2 CPU T7600 @ 2.33GHz % gcc --version i686-apple-darwin8-gcc-4.0.1 (GCC) 4.0.1 (Apple Computer, Inc. build 5367) ... deleted ... % cc -O3 popcnt.c % ./a.out FreeBSD version 1 : 3854904 us; cnt=32511665 FreeBSD version 2 : 3517953 us; cnt=32511665 16-bit LUT : 2127718 us; cnt=32511665 was the fastest 8-bit LUT : 5317244 us; cnt=32511665 8-bit LUT v2 : 3762467 us; cnt=32511665 Wikipedia #2 : 5493756 us; cnt=32511665 Wikipedia #3 : 5295541 us; cnt=32511665 gcc popcount : 6242596 us; cnt=32511665 gcc popcountll : 10402364 us; cnt=32511665 HAKMEM 169/X11 : 4305536 us; cnt=32511665 Bitslice(7) : 2077473 us; cnt=32511665 Bitslice(24) : 1843217 us; cnt=32511665 fastest naive : 23460323 us; cnt=32511665 Wegner/Kernigan : 14976014 us; cnt=32511665 Anderson : 63971600 us; cnt=32511665 8x shift and add : 21728414 us; cnt=32511665 32x shift and add : 19747808 us; cnt=32511665
I dug around for a while and used Toby's comment to try Shark as a profiler. Things made a bit more sense after I remembered to add the '-g' option to gcc so I could get Shark to line up the code and assembly side-by-side. My conclusion is that Cédric's code has too much function call overhead because the 16-bit LUT is being called for every 32 bit number, while the 24 word bitslice code is only called every 24 words. My code doesn't have that overhead, which is why the times are closer to each other.
As before, I wondered how much "-m64" affects things. I'll declare it a tie between the new bitslice code and the old 16-bit lookup table. (While though the numbers aren't the same, the difference is about the same as when I run the same algorithm again.)
% cc -O3 popcnt.c -m64 % ./a.out FreeBSD version 1 : 3902385 us; cnt=32511665 FreeBSD version 2 : 3422655 us; cnt=32511665 16-bit LUT : 1621557 us; cnt=32511665 fastest 8-bit LUT : 2070320 us; cnt=32511665 8-bit LUT v2 : 2094362 us; cnt=32511665 Wikipedia #2 : 2135130 us; cnt=32511665 Wikipedia #3 : 2142576 us; cnt=32511665 gcc popcount : 8128024 us; cnt=32511665 gcc popcountll : 4087390 us; cnt=32511665 HAKMEM 169/X11 : 5805272 us; cnt=32511665 Bitslice(7) : 1728262 us; cnt=32511665 Bitslice(24) : 1698257 us; cnt=32511665 fastest naive : 24990997 us; cnt=32511665 Wegner/Kernigan : 16864887 us; cnt=32511665 Anderson : 12463648 us; cnt=32511665 8x shift and add : 21715482 us; cnt=32511665 32x shift and add : 20003160 us; cnt=32511665
All of this was on one processor and compiler. I decided to try another machine with a slightly different processor and an earlier version of gcc.
%sysctl -w hw.model hw.model: Intel(R) Core(TM)2 CPU 6420 @ 2.13GHz %gcc --version gcc (GCC) 3.4.6 [FreeBSD] 20060305 ... %cc popcnt.c -O3 %./a.out FreeBSD version 1 : 4270677 us; cnt=32511665 FreeBSD version 2 : 3785047 us; cnt=32511665 16-bit LUT : 2034836 us; cnt=32511665 8-bit LUT : 1935156 us; cnt=32511665 fastest 8-bit LUT v2 : 1987370 us; cnt=32511665 Wikipedia #2 : 5010639 us; cnt=32511665 Wikipedia #3 : 4931266 us; cnt=32511665 gcc popcount : 5793299 us; cnt=32511665 gcc popcountll : 6815575 us; cnt=32511665 HAKMEM 169/X11 : 10788366 us; cnt=32511665 Bitslice(7) : 2467150 us; cnt=32511665 Bitslice(24) : 2469399 us; cnt=32511665 naive : 27722786 us; cnt=32511665 ... too slow to worry about ...In this case I would use the 8-bit lookup table, which is about 20% faster than the the bitslice code.
Here's the results on an older Pentium 4 (Prescott) CPU 2.80GHz with gcc 3.4.6:
FreeBSD version 1 : 3745509 us; cnt=32500610 FreeBSD version 2 : 3774150 us; cnt=32500610 16-bit LUT : 2762583 us; cnt=32500610 8-bit LUT : 2430362 us; cnt=32500610 fastest 8-bit LUT v2 : 2433684 us; cnt=32500610 Wikipedia #2 : 11509758 us; cnt=32500610 Wikipedia #3 : 8540921 us; cnt=32500610 gcc popcount : 9100055 us; cnt=32500610 gcc popcountll : 10568664 us; cnt=32500610 HAKMEM 169/X11 : 10970609 us; cnt=32500610 Bitslice(7) : 4226871 us; cnt=32500610 Bitslice(24) : 3700020 us; cnt=32500610 naive : 36117662 us; cnt=32500610 Wegner/Kernigan : 46479326 us; cnt=32500610 Anderson : 174285325 us; cnt=32500610 8x shift and add : 90028247 us; cnt=32500610 32x shift and add : 86390410 us; cnt=32500610Bitslice(24) is still the fastest bit-twiddler solution, but quite a bit slower than using an 8-bit lookup table. There is no single best solution. It seems the code is also compiler dependent so I'm installing GCC 4.2 to see how that affects things. That'll probably take a few hours so it's time to sleep.
Okay, fink installed gcc 4.2.2 (the most recent is 4.3.1 but I decided it wasn't worth the time to compile it myself). Here's the numbers on my laptop for 32-bit:
% gcc-4 -O3 popcnt.c % ./a.out FreeBSD version 1 : 4161621 us; cnt=32511665 FreeBSD version 2 : 2930977 us; cnt=32511665 was 3517953 16-bit LUT : 2534380 us; cnt=32511665 was 2127718 8-bit LUT : 4472733 us; cnt=32511665 8-bit LUT v2 : 4499285 us; cnt=32511665 was 3762467 Wikipedia #2 : 5031855 us; cnt=32511665 Wikipedia #3 : 3874931 us; cnt=32511665 was 5295541 gcc popcount : 4808307 us; cnt=32511665 gcc popcountll : 5978393 us; cnt=32511665 HAKMEM 169/X11 : 4315751 us; cnt=32511665 Bitslice(7) : 2111625 us; cnt=32511665 Bitslice(24) : 1443337 us; cnt=32511665 fastest; was 1843217 naive : 51557971 us; cnt=32511665 Wegner/Kernigan : 15616245 us; cnt=32511665 Anderson : 63424235 us; cnt=32511665 8x shift and add : 19179445 us; cnt=32511665 32x shift and add : 19502222 us; cnt=32511665 %and for 64 bit
% gcc-4 -O3 popcnt.c -m64 % ./a.out FreeBSD version 1 : 4192391 us; cnt=32511665 FreeBSD version 2 : 2812570 us; cnt=32511665 was 3422655 16-bit LUT : 1494747 us; cnt=32511665 was 1621557 8-bit LUT : 2182569 us; cnt=32511665 8-bit LUT v2 : 2170191 us; cnt=32511665 was 2094362 Wikipedia #2 : 2001489 us; cnt=32511665 Wikipedia #3 : 1361488 us; cnt=32511665 was 2142576 gcc popcount : 7350367 us; cnt=32511665 gcc popcountll : 3718116 us; cnt=32511665 HAKMEM 169/X11 : 4339839 us; cnt=32511665 Bitslice(7) : 1693849 us; cnt=32511665 Bitslice(24) : 1206318 us; cnt=32511665 fastest; was 1698257 naive : 24914902 us; cnt=32511665 Wegner/Kernigan : 15557531 us; cnt=32511665 Anderson : 8764008 us; cnt=32511665 8x shift and add : 21710211 us; cnt=32511665 32x shift and add : 19583803 us; cnt=32511665 %It looks like things are nearly always faster for 64 bit code, and sometimes faster for 32 bit code.
Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me
Copyright © 2001-2020 Andrew Dalke Scientific AB