Update: Faster population counts

[ previous | newer ] /home/writings/diary/archive/2011/11/02/faster_popcount_update

Update: Faster population counts

Short version: I've updated my popcount benchmark to include code written by Imran Haque (SSE2 and SSSE3 version) and Kim Walisch (improved platform-independent bitslicing and OpenMP support).

Molecular fingerprints, background
Generating molecular fingerprints with OpenBabel
Computing Tanimoto scores, quickly
Block Tanimoto calculations
Faster block Tanimoto calculations
HAKMEM 169 and other popcount implementations
Testing the bitslice algorithm for popcount
2011 update: Faster population counts

In 2008 I wrote a series of essays on how to do a Tanimoto comparison between two fingerprints. The most compute intensive part computes the population count of a set of bits, that is, the number of bits which are set to 1.

This has a long history in computing, with many variations of how to compute it. The earliest reference I know of is cited by Knuth as the "Gillies-Miller method for sideways addition" in The Preparation of Programs for an Electronic Digital Computer by Wilkes, Wheeler, and Gill, second edition, 1957, 191-193. This citation comes from Knuth's MMIXware: a RISC computer for the third millennium and is also in his Pre-fascicle 1A to volume 4 of his The Art of Computer Programming. In modern terms, this is a "bit-twiddler" solution.

However, many other techniques exist. I described a number of them in my essay HAKMEM 169 and other popcount implementations, and collected them into a benchmark so others could test them out. My conclusion in 2008 was that the lookup table was competitive with the best bit-twiddler solution; the fastest algorithm depended on the hardware/compiler combination.

Times have changed. Hardware has changed. Have the results changed? Get the new benchmark and judge for yourself.

My answer is that if you have a chip with the POPCNT instruction built-in then use it. I still don't have one of those chips, but I know someone who does, who has given me some numbers.

Failing that, there's a 4-bit lookup-table you can implement with processor-specific SSSE3 code.

Failing that, there's a portable bitslice implementation which is only 25% slower than the hardware POPCNT instruction!

Finally, if you have OpenMP then you're likely memory bound with any of these fast implementations.

Built-in popcount

If you use GCC (or clang or Intel's icpc) then you can use the __builtin_popcount and __builtin_popcountll functions. These are portable across every platform, but remember to tell GCC to compile for the specific architecture by using the "-mpopcnt" or "-msse4.2" flag. Otherwise you'll be stuck with a rather slow implementation.

(Why is GCC's implementation slow? Because it optimized for infrequent uses, and can't share the setup cost or cache hit cost that the fastest solutions use.)

Shuffling with SSSE3

Years ago I implemented popcount on my PowerPC Mac. The Altivec manual described a neat trick, which is also available on newer Intel chips with the "shuffle" instructions. Briefly, into one 128 bit word pack the byte values {0, 1, 1, 2, 1, 2, ..., 4}, which are the population counts for the values 0..15. With a shuffle instruction and some care you can use this as a 4-bit lookup table. The advantage of this over an 8-bit lookup table is that you can keep the table in a register.

I tend to stay away from assembly programming, or C programming which is this close to assembly. Thankfully, Imran Haque did the work for me. Well, more for a collabration he was doing with Vertex. The end result is Anatomy of High-Performance 2D Similarity Calculations [DOI: 10.1021/ci200235e | author's copy] and the SSSE3 implementation is available as part of the supplemental information.

I haven't yet had the chance to test it on a machine with the POPCNT instruction, but estimating from the data I have it's about 5% slower than using __builtin_popcountll.

Slicing with SSE2

Imran also implemented three different variations of the bitslice algorithm, using the SSE2 intrinsics. The 8-bit version is about 8% faster than the 32-bit version and about twice as fast as what I had called "Bitslice(24)."

However ... !

Improved portable bitslicing

Kim Walisch contributed a better implementation of Lauradoux Cédric's original bitslice implementation. It's portable across the GCC, Microsoft, and Intel compilers and on my laptop it's only 1% slower than the fastest SSE2 version and only 25% slower than the hardware popcount.

My Timings

My laptop is an Apple MacBook Pro with a 2.53 GHz Intel Core 2 Duo. I used gcc 4.2.1 to generate these timings:

Bitslice(7)            1473263 us; cnt = 32511665
Bitslice(24)           1135288 us; cnt = 32511665
Lauradoux               667818 us; cnt = 32511665
SSE2 8-bit              660754 us; cnt = 32511665
SSE2 16-bit             670274 us; cnt = 32511665
SSE2 32-bit             709952 us; cnt = 32511665
SSSE3                   623512 Us; cnt = 32511665
16-bit LUT             1407414 us; cnt = 32511665
8-bit LUT              1848074 us; cnt = 32511665
gcc popcount           3750751 us; cnt = 32511665
gcc popcountll         1923057 us; cnt = 32511665
FreeBSD version 1      3652432 us; cnt = 32511665
FreeBSD version 2      2403148 us; cnt = 32511665
Wikipedia #2           1922601 us; cnt = 32511665
Wikipedia #3           1326290 us; cnt = 32511665
HAKMEM 169/X11         4155523 us; cnt = 32511665
naive                 22654889 us; cnt = 32511665
Wegner/Kernigan       14077482 us; cnt = 32511665
Anderson               8826281 us; cnt = 32511665
8x shift and add      17395633 us; cnt = 32511665
32x shift and add     17009625 us; cnt = 32511665

My desktop is an Apple iMac with a 3.2 GHz Intel Core i3, also with gcc 4.2.1.

Bitslice(7)            1184333 us; cnt = 32511665
Bitslice(24)            851466 us; cnt = 32511665
Lauradoux               472621 us; cnt = 32511665
SSE2 8-bit              464605 us; cnt = 32511665
SSE2 16-bit             490889 us; cnt = 32511665
SSE2 32-bit             559717 us; cnt = 32511665
SSSE3                   381001 us; cnt = 32511665
16-bit LUT             1110590 us; cnt = 32511665
8-bit LUT              1446592 us; cnt = 32511665
gcc popcount           2799601 us; cnt = 32511665
gcc popcountll         1443689 us; cnt = 32511665
FreeBSD version 1      2633763 us; cnt = 32511665
FreeBSD version 2      1779290 us; cnt = 32511665
Wikipedia #2           1428352 us; cnt = 32511665
Wikipedia #3            954748 us; cnt = 32511665
HAKMEM 169/X11         3016396 us; cnt = 32511665
naive                 19652501 us; cnt = 32511665
Wegner/Kernigan       12188782 us; cnt = 32511665
Anderson               6490925 us; cnt = 32511665
8x shift and add      12703512 us; cnt = 32511665
32x shift and add     12917430 us; cnt = 32511665

All of these were done with 64 bit executables.

OpenMP

Kim Walisch also rewrote the benchmark to use OpenMP. Use "-fopenmp" enable OpenMP support on GCC. On my Macs that means:

g++ -O3 popcnt.cpp -o popcnt -mssse3 -fopenmp

My desktop can run four threads, and the timings improve by 17% for the SSSE3 and 30% for the Lauradoux implementations. Indeed, they are competitive timings when OMP_NUM_THREAD=4 on my desktop:

OpenMP Bitslice(7)             550534 us; cnt = 32511665
OpenMP Bitslice(24)            476916 us; cnt = 32511665
OpenMP Lauradoux               318500 us; cnt = 32511665
OpenMP SSE2 8-bit              326695 us; cnt = 32511665
OpenMP SSE2 16-bit             329297 us; cnt = 32511665
OpenMP SSE2 32-bit             337779 us; cnt = 32511665
OpenMP SSSE3                   313788 us; cnt = 32511665
OpenMP 16-bit LUT              665908 us; cnt = 32511665
OpenMP 8-bit LUT               851904 us; cnt = 32511665
OpenMP gcc popcount           1525001 us; cnt = 32511665
OpenMP gcc popcountll          789779 us; cnt = 32511665
OpenMP FreeBSD version 1      1348765 us; cnt = 32511665
OpenMP FreeBSD version 2       951160 us; cnt = 32511665
OpenMP Wikipedia #2            746069 us; cnt = 32511665
OpenMP Wikipedia #3            544974 us; cnt = 32511665
OpenMP HAKMEM 169/X11         1564750 us; cnt = 32511665
OpenMP naive                  7910515 us; cnt = 32511665
OpenMP Wegner/Kernigan        4481416 us; cnt = 32511665
OpenMP Anderson               3348583 us; cnt = 32511665
OpenMP 8x shift and add       4604222 us; cnt = 32511665
OpenMP 32x shift and add      7252250 us; cnt = 32511665

With 2 threads ("env OMP_NUM_THREADS=2 ./popcnt") the timings are, of course, not as great. For my desktop these are:

OpenMP Bitslice(7)             700565 us; cnt = 32511665
OpenMP Bitslice(24)            540767 us; cnt = 32511665
OpenMP Lauradoux               371317 us; cnt = 32511665
OpenMP SSE2 8-bit              353855 us; cnt = 32511665
OpenMP SSE2 16-bit             350098 us; cnt = 32511665
OpenMP SSE2 32-bit             375103 us; cnt = 32511665
OpenMP SSSE3                   341643 us; cnt = 32511665
OpenMP 16-bit LUT              733629 us; cnt = 32511665
OpenMP 8-bit LUT               987779 us; cnt = 32511665
OpenMP gcc popcount           1868047 us; cnt = 32511665
OpenMP gcc popcountll          908507 us; cnt = 32511665
OpenMP FreeBSD version 1      1654957 us; cnt = 32511665
OpenMP FreeBSD version 2      1120131 us; cnt = 32511665
OpenMP Wikipedia #2            882077 us; cnt = 32511665
OpenMP Wikipedia #3            600354 us; cnt = 32511665
OpenMP HAKMEM 169/X11         1935541 us; cnt = 32511665
OpenMP naive                 11598757 us; cnt = 32511665
OpenMP Wegner/Kernigan        6799242 us; cnt = 32511665
OpenMP Anderson               4225183 us; cnt = 32511665
OpenMP 8x shift and add       7053618 us; cnt = 32511665
OpenMP 32x shift and add      8877820 us; cnt = 32511665

while for my laptop (which can only run two threads) it's 19% faster for Lauradoux and 16% faster for SSSE3

OpenMP Bitslice(7)             847764 us; cnt = 32511665
OpenMP Bitslice(24)            678265 us; cnt = 32511665
OpenMP Lauradoux               543400 us; cnt = 32511665
OpenMP SSE2 8-bit              535400 us; cnt = 32511665
OpenMP SSE2 16-bit             537417 us; cnt = 32511665
OpenMP SSE2 32-bit             545232 us; cnt = 32511665
OpenMP SSSE3                   522643 us; cnt = 32511665
OpenMP 16-bit LUT             1083908 us; cnt = 32511665
OpenMP 8-bit LUT              1022720 us; cnt = 32511665
OpenMP gcc popcount           2045593 us; cnt = 32511665
OpenMP gcc popcountll         1059633 us; cnt = 32511665
OpenMP FreeBSD version 1      1983575 us; cnt = 32511665
OpenMP FreeBSD version 2      1286835 us; cnt = 32511665
OpenMP Wikipedia #2           1046206 us; cnt = 32511665
OpenMP Wikipedia #3            749595 us; cnt = 32511665
OpenMP HAKMEM 169/X11         2245144 us; cnt = 32511665
OpenMP naive                 11980561 us; cnt = 32511665
OpenMP Wegner/Kernigan        7477110 us; cnt = 32511665
OpenMP Anderson               4729793 us; cnt = 32511665
OpenMP 8x shift and add       9284746 us; cnt = 32511665
OpenMP 32x shift and add      9427039 us; cnt = 32511665

By comparison, Kim submitted timings on an earlier benchmark, for a machine and compiler with builtin popcount support.

CPU: Intel Core-i5 670 3.46GHz (Dual Core, 4 threads)
Memory: DDR3-1066
Compiler: Intel C++ 12.0
Operating system: Fedora 13 x64 Linux

I've updated his report so the benchmark names and order match the current suite, but you can see I'm missing timings for some of Imran's work. Running with OMP_NUM_THREADS=4, and updating the names and order to match the current benchmark (you can see I'm missing the SSE3 benchmark)

OpenMP Bitslice(7)             406512 us; cnt = 32500610
OpenMP Bitslice(24)            264068 us; cnt = 32500610
OpenMP Lauradoux               259235 us; cnt = 32500610
OpenMP SSE2 32-bit             260720 us; cnt = 32500610
OpenMP 16-bit LUT              477364 us; cnt = 32500610
OpenMP 8-bit LUT               621876 us; cnt = 32500610
OpenMP gcc popcount            254286 us; cnt = 32500610
OpenMP gcc popcountll          242884 us; cnt = 32500610
OpenMP FreeBSD version 1      1029701 us; cnt = 32500610
OpenMP FreeBSD version 2       719584 us; cnt = 32500610
OpenMP Wikipedia #2            543197 us; cnt = 32500610
OpenMP Wikipedia #3            381183 us; cnt = 32500610
OpenMP HAKMEM 169/X11         1134425 us; cnt = 32500610
OpenMP naive                  6033387 us; cnt = 32500610
OpenMP Wegner/Kernigan        3868972 us; cnt = 32500610
OpenMP Anderson               2128988 us; cnt = 32500610
OpenMP 8x shift and add       5983622 us; cnt = 32500610
OpenMP 32x shift and add      5709677 us; cnt = 32500610

and with OMP_NUM_THREADS=1 the first few fields are:

OpenMP Bitslice(7)             973486 us; cnt = 32500610
OpenMP Bitslice(24)            471664 us; cnt = 32500610
OpenMP Lauradoux               407135 us; cnt = 32500610
OpenMP SSE2                    485601 us; cnt = 32500610
OpenMP 16-bit LUT              930829 us; cnt = 32500610
OpenMP 8-bit LUT              1253154 us; cnt = 32500610
OpenMP gcc popcount            359063 us; cnt = 32500610
OpenMP gcc popcountll          323504 us; cnt = 32500610
  ... the rest are just too boring ...

You can quite clearly see that the popcount instruction is fastest, but once you have multiple threads running, the paper by Haque et al. points out that you are memory bandwidth bound, not CPU bound. In other words, it's better to do multiple different tasks with the same data, if you can.

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me