HAKMEM 169 and other popcount implementations
This is part 6 of an on-going series dealing with molecular fingerprints:
- Molecular fingerprints, background
- Generating molecular fingerprints with OpenBabel
- Computing Tanimoto scores, quickly
- Block Tanimoto calculations
- Faster block Tanimoto calculations
- HAKMEM 169 and other popcount implementations
- Testing the bitslice algorithm for popcount
- 2011 update: Faster population counts
HAKMEM
Many other people have worked on doing high speed population counts. In 1972 the AI Laboratory at MIT published "HAKMEM", or "Artificial Intelligence Memo No. 239". It is a collection of algorithms, tricks, observations, and unsolved but interesting problems. One of which is to "Solve Go." Item 169 is a PDP-6/10 assembly way to count the number of ones in a word. It was called "count ones" and is now called popcount.
I first came across HAKMEM Item 169 in 2005 when I last looked at doing fast Tanimoto calculations. I tried it and a few other solutions but concluded that using special Altivec instructions were the fastest solution for my PowerPC-based PowerBook. I don't have a PowerPC any more. I bought a MacBook Pro last year with a 2.33 GHz Intel Core 2 Duo. I've been looking at the MMX instructions quizzically trying to figure out if there's an equivalent solution. I am not an assembly programmer.
The best I could find was an AMD "x86 Code Optimization Guide". It gives a 32-bit integer version using normal Intel assembly and a 64-bit version using MMX which "can do popcounts about twice as fast as the integer version (for an identical number of bits)."
The newer guide (I misplaced the URL) says to use the popcount instruction in SSE4a.
Other popcount references
Computing popcount is apparently very widely used. You can find many people have talked about it:
- Bart
Massey:
popcount_naive: 1.6666e+07 iters in 688 msecs for 41.28 nsecs/iter popcount_8: 1e+08 iters in 995 msecs for 9.95 nsecs/iter popcount_6: 2e+08 iters in 1725 msecs for 8.625 nsecs/iter popcount_hakmem: 2e+08 iters in 1411 msecs for 7.055 nsecs/iter popcount_keane: 1e+09 iters in 6566 msecs for 6.566 nsecs/iter popcount_3: 1e+09 iters in 6270 msecs for 6.27 nsecs/iter popcount_2: 1e+09 iters in 5658 msecs for 5.658 nsecs/iter popcount_mult: 1e+09 iters in 4169 msecs for 4.169 nsecs/iter ... At the suggestion of Bill Trost, I added 8 and 16-bit table-driven popcount implementations. These perform the fastest in the benchmark test-stand, at about 3ns/iteration. They're about the same speed, so one would obviously use the 8-bit version.
- Majek has pointers to code. That's where I found the AMD optimization guide link.
- Lauradoux Cédric's Hamming weight
page has an
overview of the topic. He reports:
and nicely summarizes why few people use the fastest solution:Algorithms Throughput Naive 29.97 MB/s Kernighan 33.34 MB/s Tree 1 122.97 MB/s Tree 2 130.60 MB/s Tree 3 125.57 MB/s Table 8 bits 200.94 MB/s Table 16 bits 244.64 MB/s The fastest solutions use pre-computation. However, you will never find this two implementations in current software product (X11, linux kernel. . . ). The reason of that is simple: the programmers are reluctant to use solutions implying a big amount of memory since it can affect the portability of the applications and their performance. As a consequence, the algorithms based on tree are the most popular.
- There was a reddit thread in 2007 titled: Popcount
- the "NSA instruction" missing from modern CPUs (or at least their
datasheets.). In it, a1k0n posted a link to a popcount benchmark
he wrote. After tweaking the 8-bit version slightly for better
performance I got:
FreeBSD version 1 : 3784823 us; cnt=32511665 FreeBSD version 2 : 3421292 us; cnt=32511665 16-bit LUT : 2068947 us; cnt=32511665 8-bit LUT : 3632031 us; cnt=32511665 8x shift and add : 21401951 us; cnt=32511665 32x shift and add : 19419464 us; cnt=32511665
and he also found that lookups were the fastest solution. - There's a gcc bug
report from April 2008 showing that that __builtin_popcount isn't
even as fast as other bit twiddling solutions. Joseph Zbiciak attached
a benchmark. On my machine it shows that gcc's builtin solution
is about 1/3rd the speed of a better one:
popcount64_1 = 4114 clocks gcc's __builtin_popcount popcount64_2 = 1881 clocks popcount64_3 = 1627 clocks popcount64_4 = 1381 clocks
- There's also a number of popcount algorithms documented by Sean Eron Anderson including descriptions and further links.
- Cédric in private mail pointed out a nice collection of popcount implementations from 1998/1999 including a cute 11 bit lookup table and an implementation of Robert Harley's bit slice solution, which is great when doing popcount across several words at a time.
- Finally, you can fetch the GMP package. It contains hand-written assembly implementations for many different architectures. The Intel ones are based on bit twiddling and do not use a lookup table.
Benchmarking popcount in GMP
Interesting. Since I have gmpy installed on this machine I can figure out just how good gmp's hand-written bit twiddling actually is. Here's what I did, with commentary inserted.
# Read the entire data file into memory
>>> s = open("nci.binary_fp").read()
>>> len(s)
155667200
# Convert each character into hex
>>> s = s.encode("hex")
# Convert the hex number into a gmp integer, using base 16
>>> import gmpy
>>> n = gmpy.mpz(s, 16)
# Do the popcount
>>> gmpy.popcount(n)
136963400
# How long did it take?
>>> import time
>>> t1 = time.time(); gmpy.popcount(n); t2 = time.time()
136963400
>>> t2-t1
0.36236190795898438
>>> print (155667200/256)*(t2-t1), "fingerprints per second"
220343.217182 fingerprints per second
# Verify that I got the right answer by using an 8-bit lookup table
>>> bitcounts = (
... 0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
... 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
... 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
... 2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
... 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
... 2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
... 2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
... 3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8,)
>>> bitcount_table = dict((chr(i), count) for (i,count) in enumerate(bitcounts))
# Double-check the table creation -- this should be 2
>>> bitcount_table["A"]
2
# Do the full popcount; this will take a while
>>> sum(bitcount_table[c] for c in s)
136963400
>>>
GMP takes about 0.36 seconds to compute one popcount. My C code
before loop unwinding using the 16-bit lookup table took 0.41 seconds
to compute two popcounts, or about 0.2 seconds for one popcount. This
is in very good agreement with the numbers from a1k0n's benchmark (3.4
seconds vs. 2.1 seconds).
Another popcount benchmark
So many different algorithms and the benchmarks aren't compatible enough that I could compare them. I decided to expand a1k0n's benchmark and include a few other implementations. The result is popcnt.c. Yes, I didn't even change the name. It implements quite a few algorithms but it's still incomplete. Anyone want to contribute assembly versions using MMX instructions?
Here's what my laptop reports. It's a 2.33 GHz Intel Core 2 Duo and I compiled with gcc 4.0.1:
% cc popcnt.c -O3 % ./a.out FreeBSD version 1 : 3828096 us; cnt=32511665 FreeBSD version 2 : 3474751 us; cnt=32511665 16-bit LUT : 2095066 us; cnt=32511665 8-bit LUT : 5329692 us; cnt=32511665 8-bit LUT v2 : 3691475 us; cnt=32511665 8x shift and add : 21675609 us; cnt=32511665 32x shift and add : 19687827 us; cnt=32511665 Wikipedia #2 : 5264710 us; cnt=32511665 Wikipedia #3 : 5037636 us; cnt=32511665 gcc popcount : 6088138 us; cnt=32511665 gcc popcountll : 7548551 us; cnt=32511665 naive : 23647023 us; cnt=32511665 Wegner/Kernigan : 15026099 us; cnt=32511665 Anderson : 63285941 us; cnt=32511665 HAKMEM 169/X11 : 4205008 us; cnt=32511655The fastest implementation is still the 16-bit lookup table, at 40% faster than that fastest bit-twiddling code. The "8-bit LUT" implementation is by a1k0n. I noticed it was a lot slower than my code so I investigated. The only difference was the order of calculations for a simple addition. The "8-bit LUT v2" is my version. It's interesting to see that that minor change causes a major timing difference. And HAKMEM is holding its own; it's even better than the Wikipedia ones, which I didn't expect.
A Core 2 Duo is a 64-bit processor. I can ask gcc to compile for 64-bits and see if that makes a difference.
% cc popcnt.c -O3 -m64 % ./a.out FreeBSD version 1 : 3882201 us; cnt=32511665 FreeBSD version 2 : 3382603 us; cnt=32511665 16-bit LUT : 1562167 us; cnt=32511665 8-bit LUT : 2013767 us; cnt=32511665 8-bit LUT v2 : 2054639 us; cnt=32511665 8x shift and add : 21679291 us; cnt=32511665 32x shift and add : 20050205 us; cnt=32511665 Wikipedia #2 : 2113876 us; cnt=32511665 Wikipedia #3 : 2102645 us; cnt=32511665 gcc popcount : 8081774 us; cnt=32511665 gcc popcountll : 4217908 us; cnt=32511665 naive : 27717913 us; cnt=32511665 Wegner/Kernigan : 16618378 us; cnt=32511665 Anderson : 12300547 us; cnt=32511665 HAKMEM 169/X11 : 5643968 us; cnt=32511655It does, for some things. The lookup tables are faster (and the 8-bit counts are now the same), and the 64-bit algorithms (Wikipedia, gcc popcountll and Anderson) are much happier. The parallel algorithm used by the Wikipedia implementations are almost as fast as the 8-bit lookup table and only 33% slower than the 16-bit lookup table. While HAKMEM got 25% slower.
Is there any way I can ask gcc to use the 64-bit instructions but still talk to a Python which was compiled for 32 bits?
Profile-directed optimizations with gcc
I recently learned how to do profile-directed optimizations with gcc using the -fprofile-arcs and -fbranch-probabilities options. Sometimes the compiler needs to make a guess on which assembly code is best for a given circumstance. For instance, if a branch for an 'if' statement is usually taken then it might provide hints to the processor pipline to optimistically assume the branch will be taken. It can't guess correctly all the time. Profile-directed optimizations works by asking gcc during compilation to instrument the program, running the program to get timings on how the code is actually used, saving the results to a file, then recompiling while asking gcc to use the instrumentation data.
Is it useful for my case?
compile with instrumentation - these will be slower than usual!
% cc popcnt.c -O3 -fprofile-arcs
% ./a.out
FreeBSD version 1 : 4304773 us; cnt=32511665
FreeBSD version 2 : 4062710 us; cnt=32511665
16-bit LUT : 2835959 us; cnt=32511665
8-bit LUT : 5344514 us; cnt=32511665
8-bit LUT v2 : 4535290 us; cnt=32511665
8x shift and add : 22246178 us; cnt=32511665
32x shift and add : 19915850 us; cnt=32511665
Wikipedia #2 : 5626697 us; cnt=32511665
Wikipedia #3 : 5318770 us; cnt=32511665
gcc popcount : 7005410 us; cnt=32511665
gcc popcountll : 8173944 us; cnt=32511665
naive : 62215630 us; cnt=32511665
Wegner/Kernigan : 34541338 us; cnt=32511665
Anderson : 65256755 us; cnt=32511665
HAKMEM 169/X11 : 4727297 us; cnt=32511665
See the profile data?
% ls -l popcnt.*
-rw-r--r-- 1 dalke staff 9177 Jul 3 05:01 popcnt.c
-rw-r--r-- 1 dalke staff 1816 Jul 3 04:43 popcnt.gcda
-rw-r--r-- 1 dalke staff 7872 Jul 3 04:43 popcnt.gcno
recompile using the profile data
% cc popcnt.c -O3 -fbranch-probabilities
% ./a.out
FreeBSD version 1 : 3815036 us; cnt=32511665
FreeBSD version 2 : 3477915 us; cnt=32511665
16-bit LUT : 2251514 us; cnt=32511665
8-bit LUT : 5291010 us; cnt=32511665
8-bit LUT v2 : 3936142 us; cnt=32511665
8x shift and add : 21715704 us; cnt=32511665
32x shift and add : 19708580 us; cnt=32511665
Wikipedia #2 : 5273262 us; cnt=32511665
Wikipedia #3 : 5074212 us; cnt=32511665
gcc popcount : 6431689 us; cnt=32511665
gcc popcountll : 7562124 us; cnt=32511665
naive : 28044028 us; cnt=32511665
Wegner/Kernigan : 15280455 us; cnt=32511665
Anderson : 63431368 us; cnt=32511665
HAKMEM 169/X11 : 4209093 us; cnt=32511665
No significant differences over the normal 32-bit code. Nor did I see
any real differences if I compiled with the architecture-specific
flags:
cc popcnt.c -O3 -march=nocona
Domain applicability
I expect people will come across this page when looking for popcount implementations, see my numbers, and end up using the 16-bit lookup table. That might be the right solution but bear in mind domain applicability. I plan to compute popcount over megabytes. Most people, like the proverbial implementation in X11, work on only a few words.
It takes time to bring the lookup table into memory. On modern machines it's something like tens of instructions. If you're only doing a few popcounts, or a lot of popcounts but not all the time, then it's probably best to use bit-twiddling. It's easier to implement and for that case likely faster.
Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me
Copyright © 2001-2020 Andrew Dalke Scientific AB


