Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2008/07/05/bitslice_and_popcount

Testing the bitslice algorithm for popcount

This is part 7 of an on-going series dealing with molecular fingerprints:

  1. Molecular fingerprints, background
  2. Generating molecular fingerprints with OpenBabel
  3. Computing Tanimoto scores, quickly
  4. Block Tanimoto calculations
  5. Faster block Tanimoto calculations
  6. HAKMEM 169 and other popcount implementations
  7. Testing the bitslice algorithm for popcount
  8. 2011 update: Faster population counts
Comments?


I want to compute the popcount of molecular fingerprints containing 1024 bits. My algorithm has been to compute the popcount 8, 16, 32, or 64 bits at a time and repeat that for all of the words in the fingerprint. There are a couple of algorithms which compute the popcount of multiple words in parallel by doing a partial popcount of each word and combining the intermediate results to get the result over all words.

Lauradoux Cédric wrote a nice overview on Hamming weight, including explanations of the tree merging and bit-slicing algorithms. In personal email he pointed out this page from comp.arch.arithmetic where someone attributes the algorithm to Robert Harley.

It took me a while to understand it. It's been rather a while since I had to do much thinking about half-adders. About 17 years. But it's not hard. I'm not going to explain it though. It's best in pictures and that will take me too long to do. You can follow those links though.

While he hasn't finished the overview explanation on how to combine the two algorithms, his Hamming weight page includes source.

I've added two bit-slicing implementations to my popcnt.c benchmark. The algorithm from comp.arch.arithmetic processes 7 words at a time and the one from Cédric processes 24 words at a time. Both fall back to other methods for the last few words if the input isn't an exact multiple of 24 or 7.

Cédric implemented several algorithms in his code distributions and included programs to compare their performance. The results suggested that the 24 word bitslice algorithm would be twice as fast as as the 16 bit looup table. I implemented it and found that while it was fast, and in my laptop even faster than the 16 bit lookup table, it was only about 10% faster and not a factor of 2.

% sysctl machdep.cpu.brand_string
machdep.cpu.brand_string: Intel(R) Core(TM)2 CPU         T7600  @ 2.33GHz
% gcc --version
i686-apple-darwin8-gcc-4.0.1 (GCC) 4.0.1 (Apple Computer, Inc. build 5367)
   ... deleted ...
% cc -O3 popcnt.c
% ./a.out
FreeBSD version 1   :  3854904 us; cnt=32511665
FreeBSD version 2   :  3517953 us; cnt=32511665
16-bit LUT          :  2127718 us; cnt=32511665   was the fastest
8-bit LUT           :  5317244 us; cnt=32511665
8-bit LUT v2        :  3762467 us; cnt=32511665
Wikipedia #2        :  5493756 us; cnt=32511665
Wikipedia #3        :  5295541 us; cnt=32511665
gcc popcount        :  6242596 us; cnt=32511665
gcc popcountll      : 10402364 us; cnt=32511665
HAKMEM 169/X11      :  4305536 us; cnt=32511665
Bitslice(7)         :  2077473 us; cnt=32511665
Bitslice(24)        :  1843217 us; cnt=32511665  fastest
naive               : 23460323 us; cnt=32511665
Wegner/Kernigan     : 14976014 us; cnt=32511665
Anderson            : 63971600 us; cnt=32511665
8x shift and add    : 21728414 us; cnt=32511665
32x shift and add   : 19747808 us; cnt=32511665

I dug around for a while and used Toby's comment to try Shark as a profiler. Things made a bit more sense after I remembered to add the '-g' option to gcc so I could get Shark to line up the code and assembly side-by-side. My conclusion is that Cédric's code has too much function call overhead because the 16-bit LUT is being called for every 32 bit number, while the 24 word bitslice code is only called every 24 words. My code doesn't have that overhead, which is why the times are closer to each other.

As before, I wondered how much "-m64" affects things. I'll declare it a tie between the new bitslice code and the old 16-bit lookup table. (While though the numbers aren't the same, the difference is about the same as when I run the same algorithm again.)

% cc -O3 popcnt.c -m64
% ./a.out
FreeBSD version 1   :  3902385 us; cnt=32511665
FreeBSD version 2   :  3422655 us; cnt=32511665
16-bit LUT          :  1621557 us; cnt=32511665 fastest
8-bit LUT           :  2070320 us; cnt=32511665
8-bit LUT v2        :  2094362 us; cnt=32511665
Wikipedia #2        :  2135130 us; cnt=32511665
Wikipedia #3        :  2142576 us; cnt=32511665
gcc popcount        :  8128024 us; cnt=32511665
gcc popcountll      :  4087390 us; cnt=32511665
HAKMEM 169/X11      :  5805272 us; cnt=32511665
Bitslice(7)         :  1728262 us; cnt=32511665
Bitslice(24)        :  1698257 us; cnt=32511665 fastest
naive               : 24990997 us; cnt=32511665
Wegner/Kernigan     : 16864887 us; cnt=32511665
Anderson            : 12463648 us; cnt=32511665
8x shift and add    : 21715482 us; cnt=32511665
32x shift and add   : 20003160 us; cnt=32511665

All of this was on one processor and compiler. I decided to try another machine with a slightly different processor and an earlier version of gcc.

%sysctl -w hw.model
hw.model: Intel(R) Core(TM)2 CPU          6420  @ 2.13GHz
%gcc --version
gcc (GCC) 3.4.6 [FreeBSD] 20060305
    ...
%cc popcnt.c -O3 
%./a.out 
FreeBSD version 1   :  4270677 us; cnt=32511665
FreeBSD version 2   :  3785047 us; cnt=32511665
16-bit LUT          :  2034836 us; cnt=32511665
8-bit LUT           :  1935156 us; cnt=32511665   fastest
8-bit LUT v2        :  1987370 us; cnt=32511665
Wikipedia #2        :  5010639 us; cnt=32511665
Wikipedia #3        :  4931266 us; cnt=32511665
gcc popcount        :  5793299 us; cnt=32511665
gcc popcountll      :  6815575 us; cnt=32511665
HAKMEM 169/X11      : 10788366 us; cnt=32511665
Bitslice(7)         :  2467150 us; cnt=32511665
Bitslice(24)        :  2469399 us; cnt=32511665
naive               : 27722786 us; cnt=32511665
  ... too slow to worry about ...
In this case I would use the 8-bit lookup table, which is about 20% faster than the the bitslice code.

Here's the results on an older Pentium 4 (Prescott) CPU 2.80GHz with gcc 3.4.6:

FreeBSD version 1   :  3745509 us; cnt=32500610
FreeBSD version 2   :  3774150 us; cnt=32500610
16-bit LUT          :  2762583 us; cnt=32500610
8-bit LUT           :  2430362 us; cnt=32500610    fastest
8-bit LUT v2        :  2433684 us; cnt=32500610
Wikipedia #2        : 11509758 us; cnt=32500610
Wikipedia #3        :  8540921 us; cnt=32500610
gcc popcount        :  9100055 us; cnt=32500610
gcc popcountll      : 10568664 us; cnt=32500610
HAKMEM 169/X11      : 10970609 us; cnt=32500610
Bitslice(7)         :  4226871 us; cnt=32500610
Bitslice(24)        :  3700020 us; cnt=32500610
naive               : 36117662 us; cnt=32500610
Wegner/Kernigan     : 46479326 us; cnt=32500610
Anderson            : 174285325 us; cnt=32500610
8x shift and add    : 90028247 us; cnt=32500610
32x shift and add   : 86390410 us; cnt=32500610
Bitslice(24) is still the fastest bit-twiddler solution, but quite a bit slower than using an 8-bit lookup table. There is no single best solution. It seems the code is also compiler dependent so I'm installing GCC 4.2 to see how that affects things. That'll probably take a few hours so it's time to sleep.


Okay, fink installed gcc 4.2.2 (the most recent is 4.3.1 but I decided it wasn't worth the time to compile it myself). Here's the numbers on my laptop for 32-bit:

% gcc-4 -O3 popcnt.c
% ./a.out 
FreeBSD version 1   :  4161621 us; cnt=32511665
FreeBSD version 2   :  2930977 us; cnt=32511665  was 3517953
16-bit LUT          :  2534380 us; cnt=32511665  was 2127718
8-bit LUT           :  4472733 us; cnt=32511665
8-bit LUT v2        :  4499285 us; cnt=32511665  was 3762467
Wikipedia #2        :  5031855 us; cnt=32511665
Wikipedia #3        :  3874931 us; cnt=32511665  was 5295541
gcc popcount        :  4808307 us; cnt=32511665
gcc popcountll      :  5978393 us; cnt=32511665
HAKMEM 169/X11      :  4315751 us; cnt=32511665
Bitslice(7)         :  2111625 us; cnt=32511665
Bitslice(24)        :  1443337 us; cnt=32511665   fastest; was 1843217
naive               : 51557971 us; cnt=32511665
Wegner/Kernigan     : 15616245 us; cnt=32511665
Anderson            : 63424235 us; cnt=32511665
8x shift and add    : 19179445 us; cnt=32511665
32x shift and add   : 19502222 us; cnt=32511665
%
and for 64 bit
% gcc-4 -O3 popcnt.c -m64
% ./a.out
FreeBSD version 1   :  4192391 us; cnt=32511665
FreeBSD version 2   :  2812570 us; cnt=32511665    was 3422655
16-bit LUT          :  1494747 us; cnt=32511665    was 1621557
8-bit LUT           :  2182569 us; cnt=32511665
8-bit LUT v2        :  2170191 us; cnt=32511665    was 2094362
Wikipedia #2        :  2001489 us; cnt=32511665
Wikipedia #3        :  1361488 us; cnt=32511665    was 2142576
gcc popcount        :  7350367 us; cnt=32511665
gcc popcountll      :  3718116 us; cnt=32511665
HAKMEM 169/X11      :  4339839 us; cnt=32511665
Bitslice(7)         :  1693849 us; cnt=32511665
Bitslice(24)        :  1206318 us; cnt=32511665    fastest; was 1698257
naive               : 24914902 us; cnt=32511665
Wegner/Kernigan     : 15557531 us; cnt=32511665
Anderson            :  8764008 us; cnt=32511665
8x shift and add    : 21710211 us; cnt=32511665
32x shift and add   : 19583803 us; cnt=32511665
%
It looks like things are nearly always faster for 64 bit code, and sometimes faster for 32 bit code.


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB