Update: Faster population counts
Short version: I've updated my popcount benchmark to include code written by Imran Haque (SSE2 and SSSE3 version) and Kim Walisch (improved platform-independent bitslicing and OpenMP support).
- Molecular fingerprints, background
- Generating molecular fingerprints with OpenBabel
- Computing Tanimoto scores, quickly
- Block Tanimoto calculations
- Faster block Tanimoto calculations
- HAKMEM 169 and other popcount implementations
- Testing the bitslice algorithm for popcount
- 2011 update: Faster population counts
In 2008 I wrote a series of essays on how to do a Tanimoto comparison between two fingerprints. The most compute intensive part computes the population count of a set of bits, that is, the number of bits which are set to 1.
This has a long history in computing, with many variations of how to compute it. The earliest reference I know of is cited by Knuth as the "Gillies-Miller method for sideways addition" in The Preparation of Programs for an Electronic Digital Computer by Wilkes, Wheeler, and Gill, second edition, 1957, 191-193. This citation comes from Knuth's MMIXware: a RISC computer for the third millennium and is also in his Pre-fascicle 1A to volume 4 of his The Art of Computer Programming. In modern terms, this is a "bit-twiddler" solution.
However, many other techniques exist. I described a number of them in my essay HAKMEM 169 and other popcount implementations, and collected them into a benchmark so others could test them out. My conclusion in 2008 was that the lookup table was competitive with the best bit-twiddler solution; the fastest algorithm depended on the hardware/compiler combination.
Times have changed. Hardware has changed. Have the results changed? Get the new benchmark and judge for yourself.
My answer is that if you have a chip with the POPCNT instruction built-in then use it. I still don't have one of those chips, but I know someone who does, who has given me some numbers.
Failing that, there's a 4-bit lookup-table you can implement with processor-specific SSSE3 code.
Failing that, there's a portable bitslice implementation which is only 25% slower than the hardware POPCNT instruction!
Finally, if you have OpenMP then you're likely memory bound with any of these fast implementations.Built-in popcount
If you use GCC (or clang or Intel's icpc) then you can use the __builtin_popcount and __builtin_popcountll functions. These are portable across every platform, but remember to tell GCC to compile for the specific architecture by using the "-mpopcnt" or "-msse4.2" flag. Otherwise you'll be stuck with a rather slow implementation.
(Why is GCC's implementation slow? Because it optimized for infrequent uses, and can't share the setup cost or cache hit cost that the fastest solutions use.)
Shuffling with SSSE3
Years ago I implemented popcount on my PowerPC Mac. The Altivec manual described a neat trick, which is also available on newer Intel chips with the "shuffle" instructions. Briefly, into one 128 bit word pack the byte values {0, 1, 1, 2, 1, 2, ..., 4}, which are the population counts for the values 0..15. With a shuffle instruction and some care you can use this as a 4-bit lookup table. The advantage of this over an 8-bit lookup table is that you can keep the table in a register.
I tend to stay away from assembly programming, or C programming which is this close to assembly. Thankfully, Imran Haque did the work for me. Well, more for a collabration he was doing with Vertex. The end result is Anatomy of High-Performance 2D Similarity Calculations [DOI: 10.1021/ci200235e | author's copy] and the SSSE3 implementation is available as part of the supplemental information.
I haven't yet had the chance to test it on a machine with the POPCNT instruction, but estimating from the data I have it's about 5% slower than using __builtin_popcountll.
Slicing with SSE2
Imran also implemented three different variations of the bitslice algorithm, using the SSE2 intrinsics. The 8-bit version is about 8% faster than the 32-bit version and about twice as fast as what I had called "Bitslice(24)."
However ... !
Improved portable bitslicing
Kim Walisch contributed a better implementation of Lauradoux Cédric's original bitslice implementation. It's portable across the GCC, Microsoft, and Intel compilers and on my laptop it's only 1% slower than the fastest SSE2 version and only 25% slower than the hardware popcount.
My Timings
My laptop is an Apple MacBook Pro with a 2.53 GHz Intel Core 2 Duo. I used gcc 4.2.1 to generate these timings:
Bitslice(7) 1473263 us; cnt = 32511665 Bitslice(24) 1135288 us; cnt = 32511665 Lauradoux 667818 us; cnt = 32511665 SSE2 8-bit 660754 us; cnt = 32511665 SSE2 16-bit 670274 us; cnt = 32511665 SSE2 32-bit 709952 us; cnt = 32511665 SSSE3 623512 Us; cnt = 32511665 16-bit LUT 1407414 us; cnt = 32511665 8-bit LUT 1848074 us; cnt = 32511665 gcc popcount 3750751 us; cnt = 32511665 gcc popcountll 1923057 us; cnt = 32511665 FreeBSD version 1 3652432 us; cnt = 32511665 FreeBSD version 2 2403148 us; cnt = 32511665 Wikipedia #2 1922601 us; cnt = 32511665 Wikipedia #3 1326290 us; cnt = 32511665 HAKMEM 169/X11 4155523 us; cnt = 32511665 naive 22654889 us; cnt = 32511665 Wegner/Kernigan 14077482 us; cnt = 32511665 Anderson 8826281 us; cnt = 32511665 8x shift and add 17395633 us; cnt = 32511665 32x shift and add 17009625 us; cnt = 32511665My desktop is an Apple iMac with a 3.2 GHz Intel Core i3, also with gcc 4.2.1.
Bitslice(7) 1184333 us; cnt = 32511665 Bitslice(24) 851466 us; cnt = 32511665 Lauradoux 472621 us; cnt = 32511665 SSE2 8-bit 464605 us; cnt = 32511665 SSE2 16-bit 490889 us; cnt = 32511665 SSE2 32-bit 559717 us; cnt = 32511665 SSSE3 381001 us; cnt = 32511665 16-bit LUT 1110590 us; cnt = 32511665 8-bit LUT 1446592 us; cnt = 32511665 gcc popcount 2799601 us; cnt = 32511665 gcc popcountll 1443689 us; cnt = 32511665 FreeBSD version 1 2633763 us; cnt = 32511665 FreeBSD version 2 1779290 us; cnt = 32511665 Wikipedia #2 1428352 us; cnt = 32511665 Wikipedia #3 954748 us; cnt = 32511665 HAKMEM 169/X11 3016396 us; cnt = 32511665 naive 19652501 us; cnt = 32511665 Wegner/Kernigan 12188782 us; cnt = 32511665 Anderson 6490925 us; cnt = 32511665 8x shift and add 12703512 us; cnt = 32511665 32x shift and add 12917430 us; cnt = 32511665All of these were done with 64 bit executables.
OpenMP
Kim Walisch also rewrote the benchmark to use OpenMP. Use "-fopenmp" enable OpenMP support on GCC. On my Macs that means:
g++ -O3 popcnt.cpp -o popcnt -mssse3 -fopenmpMy desktop can run four threads, and the timings improve by 17% for the SSSE3 and 30% for the Lauradoux implementations. Indeed, they are competitive timings when OMP_NUM_THREAD=4 on my desktop:
OpenMP Bitslice(7) 550534 us; cnt = 32511665 OpenMP Bitslice(24) 476916 us; cnt = 32511665 OpenMP Lauradoux 318500 us; cnt = 32511665 OpenMP SSE2 8-bit 326695 us; cnt = 32511665 OpenMP SSE2 16-bit 329297 us; cnt = 32511665 OpenMP SSE2 32-bit 337779 us; cnt = 32511665 OpenMP SSSE3 313788 us; cnt = 32511665 OpenMP 16-bit LUT 665908 us; cnt = 32511665 OpenMP 8-bit LUT 851904 us; cnt = 32511665 OpenMP gcc popcount 1525001 us; cnt = 32511665 OpenMP gcc popcountll 789779 us; cnt = 32511665 OpenMP FreeBSD version 1 1348765 us; cnt = 32511665 OpenMP FreeBSD version 2 951160 us; cnt = 32511665 OpenMP Wikipedia #2 746069 us; cnt = 32511665 OpenMP Wikipedia #3 544974 us; cnt = 32511665 OpenMP HAKMEM 169/X11 1564750 us; cnt = 32511665 OpenMP naive 7910515 us; cnt = 32511665 OpenMP Wegner/Kernigan 4481416 us; cnt = 32511665 OpenMP Anderson 3348583 us; cnt = 32511665 OpenMP 8x shift and add 4604222 us; cnt = 32511665 OpenMP 32x shift and add 7252250 us; cnt = 32511665With 2 threads ("env OMP_NUM_THREADS=2 ./popcnt") the timings are, of course, not as great. For my desktop these are:
OpenMP Bitslice(7) 700565 us; cnt = 32511665 OpenMP Bitslice(24) 540767 us; cnt = 32511665 OpenMP Lauradoux 371317 us; cnt = 32511665 OpenMP SSE2 8-bit 353855 us; cnt = 32511665 OpenMP SSE2 16-bit 350098 us; cnt = 32511665 OpenMP SSE2 32-bit 375103 us; cnt = 32511665 OpenMP SSSE3 341643 us; cnt = 32511665 OpenMP 16-bit LUT 733629 us; cnt = 32511665 OpenMP 8-bit LUT 987779 us; cnt = 32511665 OpenMP gcc popcount 1868047 us; cnt = 32511665 OpenMP gcc popcountll 908507 us; cnt = 32511665 OpenMP FreeBSD version 1 1654957 us; cnt = 32511665 OpenMP FreeBSD version 2 1120131 us; cnt = 32511665 OpenMP Wikipedia #2 882077 us; cnt = 32511665 OpenMP Wikipedia #3 600354 us; cnt = 32511665 OpenMP HAKMEM 169/X11 1935541 us; cnt = 32511665 OpenMP naive 11598757 us; cnt = 32511665 OpenMP Wegner/Kernigan 6799242 us; cnt = 32511665 OpenMP Anderson 4225183 us; cnt = 32511665 OpenMP 8x shift and add 7053618 us; cnt = 32511665 OpenMP 32x shift and add 8877820 us; cnt = 32511665while for my laptop (which can only run two threads) it's 19% faster for Lauradoux and 16% faster for SSSE3
OpenMP Bitslice(7) 847764 us; cnt = 32511665 OpenMP Bitslice(24) 678265 us; cnt = 32511665 OpenMP Lauradoux 543400 us; cnt = 32511665 OpenMP SSE2 8-bit 535400 us; cnt = 32511665 OpenMP SSE2 16-bit 537417 us; cnt = 32511665 OpenMP SSE2 32-bit 545232 us; cnt = 32511665 OpenMP SSSE3 522643 us; cnt = 32511665 OpenMP 16-bit LUT 1083908 us; cnt = 32511665 OpenMP 8-bit LUT 1022720 us; cnt = 32511665 OpenMP gcc popcount 2045593 us; cnt = 32511665 OpenMP gcc popcountll 1059633 us; cnt = 32511665 OpenMP FreeBSD version 1 1983575 us; cnt = 32511665 OpenMP FreeBSD version 2 1286835 us; cnt = 32511665 OpenMP Wikipedia #2 1046206 us; cnt = 32511665 OpenMP Wikipedia #3 749595 us; cnt = 32511665 OpenMP HAKMEM 169/X11 2245144 us; cnt = 32511665 OpenMP naive 11980561 us; cnt = 32511665 OpenMP Wegner/Kernigan 7477110 us; cnt = 32511665 OpenMP Anderson 4729793 us; cnt = 32511665 OpenMP 8x shift and add 9284746 us; cnt = 32511665 OpenMP 32x shift and add 9427039 us; cnt = 32511665By comparison, Kim submitted timings on an earlier benchmark, for a machine and compiler with builtin popcount support.
CPU: Intel Core-i5 670 3.46GHz (Dual Core, 4 threads) Memory: DDR3-1066 Compiler: Intel C++ 12.0 Operating system: Fedora 13 x64 LinuxI've updated his report so the benchmark names and order match the current suite, but you can see I'm missing timings for some of Imran's work. Running with OMP_NUM_THREADS=4, and updating the names and order to match the current benchmark (you can see I'm missing the SSE3 benchmark)
OpenMP Bitslice(7) 406512 us; cnt = 32500610 OpenMP Bitslice(24) 264068 us; cnt = 32500610 OpenMP Lauradoux 259235 us; cnt = 32500610 OpenMP SSE2 32-bit 260720 us; cnt = 32500610 OpenMP 16-bit LUT 477364 us; cnt = 32500610 OpenMP 8-bit LUT 621876 us; cnt = 32500610 OpenMP gcc popcount 254286 us; cnt = 32500610 OpenMP gcc popcountll 242884 us; cnt = 32500610 OpenMP FreeBSD version 1 1029701 us; cnt = 32500610 OpenMP FreeBSD version 2 719584 us; cnt = 32500610 OpenMP Wikipedia #2 543197 us; cnt = 32500610 OpenMP Wikipedia #3 381183 us; cnt = 32500610 OpenMP HAKMEM 169/X11 1134425 us; cnt = 32500610 OpenMP naive 6033387 us; cnt = 32500610 OpenMP Wegner/Kernigan 3868972 us; cnt = 32500610 OpenMP Anderson 2128988 us; cnt = 32500610 OpenMP 8x shift and add 5983622 us; cnt = 32500610 OpenMP 32x shift and add 5709677 us; cnt = 32500610and with OMP_NUM_THREADS=1 the first few fields are:
OpenMP Bitslice(7) 973486 us; cnt = 32500610 OpenMP Bitslice(24) 471664 us; cnt = 32500610 OpenMP Lauradoux 407135 us; cnt = 32500610 OpenMP SSE2 485601 us; cnt = 32500610 OpenMP 16-bit LUT 930829 us; cnt = 32500610 OpenMP 8-bit LUT 1253154 us; cnt = 32500610 OpenMP gcc popcount 359063 us; cnt = 32500610 OpenMP gcc popcountll 323504 us; cnt = 32500610 ... the rest are just too boring ...You can quite clearly see that the popcount instruction is fastest, but once you have multiple threads running, the paper by Haque et al. points out that you are memory bandwidth bound, not CPU bound. In other words, it's better to do multiple different tasks with the same data, if you can.
Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me
Copyright © 2001-2020 Andrew Dalke Scientific AB