Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2008/07/03/hakmem_and_other_popcounts

HAKMEM 169 and other popcount implementations

This is part 6 of an on-going series dealing with molecular fingerprints:

  1. Molecular fingerprints, background
  2. Generating molecular fingerprints with OpenBabel
  3. Computing Tanimoto scores, quickly
  4. Block Tanimoto calculations
  5. Faster block Tanimoto calculations
  6. HAKMEM 169 and other popcount implementations
  7. Testing the bitslice algorithm for popcount
  8. 2011 update: Faster population counts
Comments?

HAKMEM

Many other people have worked on doing high speed population counts. In 1972 the AI Laboratory at MIT published "HAKMEM", or "Artificial Intelligence Memo No. 239". It is a collection of algorithms, tricks, observations, and unsolved but interesting problems. One of which is to "Solve Go." Item 169 is a PDP-6/10 assembly way to count the number of ones in a word. It was called "count ones" and is now called popcount.

I first came across HAKMEM Item 169 in 2005 when I last looked at doing fast Tanimoto calculations. I tried it and a few other solutions but concluded that using special Altivec instructions were the fastest solution for my PowerPC-based PowerBook. I don't have a PowerPC any more. I bought a MacBook Pro last year with a 2.33 GHz Intel Core 2 Duo. I've been looking at the MMX instructions quizzically trying to figure out if there's an equivalent solution. I am not an assembly programmer.

The best I could find was an AMD "x86 Code Optimization Guide". It gives a 32-bit integer version using normal Intel assembly and a 64-bit version using MMX which "can do popcounts about twice as fast as the integer version (for an identical number of bits)."

The newer guide (I misplaced the URL) says to use the popcount instruction in SSE4a.

Other popcount references

Computing popcount is apparently very widely used. You can find many people have talked about it:

Benchmarking popcount in GMP

Interesting. Since I have gmpy installed on this machine I can figure out just how good gmp's hand-written bit twiddling actually is. Here's what I did, with commentary inserted.

# Read the entire data file into memory
>>> s = open("nci.binary_fp").read()
>>> len(s)
155667200

# Convert each character into hex

>>> s = s.encode("hex")

# Convert the hex number into a gmp integer, using base 16

>>> import gmpy
>>> n = gmpy.mpz(s, 16)

# Do the popcount

>>> gmpy.popcount(n)
136963400

# How long did it take?

>>> import time
>>> t1 = time.time(); gmpy.popcount(n); t2 = time.time()  
136963400
>>> t2-t1
0.36236190795898438
>>> print (155667200/256)*(t2-t1), "fingerprints per second"
220343.217182 fingerprints per second

# Verify that I got the right answer by using an 8-bit lookup table

>>> bitcounts = (
... 0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
... 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
... 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
... 2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
... 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
... 2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
... 2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
... 3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8,)
>>> bitcount_table = dict((chr(i), count) for (i,count) in enumerate(bitcounts))

# Double-check the table creation -- this should be 2

>>> bitcount_table["A"]
2

# Do the full popcount; this will take a while

>>> sum(bitcount_table[c] for c in s)
136963400
>>> 
GMP takes about 0.36 seconds to compute one popcount. My C code before loop unwinding using the 16-bit lookup table took 0.41 seconds to compute two popcounts, or about 0.2 seconds for one popcount. This is in very good agreement with the numbers from a1k0n's benchmark (3.4 seconds vs. 2.1 seconds).

Another popcount benchmark

So many different algorithms and the benchmarks aren't compatible enough that I could compare them. I decided to expand a1k0n's benchmark and include a few other implementations. The result is popcnt.c. Yes, I didn't even change the name. It implements quite a few algorithms but it's still incomplete. Anyone want to contribute assembly versions using MMX instructions?

Here's what my laptop reports. It's a 2.33 GHz Intel Core 2 Duo and I compiled with gcc 4.0.1:

% cc popcnt.c -O3
% ./a.out
FreeBSD version 1   :  3828096 us; cnt=32511665
FreeBSD version 2   :  3474751 us; cnt=32511665
16-bit LUT          :  2095066 us; cnt=32511665
8-bit LUT           :  5329692 us; cnt=32511665
8-bit LUT v2        :  3691475 us; cnt=32511665
8x shift and add    : 21675609 us; cnt=32511665
32x shift and add   : 19687827 us; cnt=32511665
Wikipedia #2        :  5264710 us; cnt=32511665
Wikipedia #3        :  5037636 us; cnt=32511665
gcc popcount        :  6088138 us; cnt=32511665
gcc popcountll      :  7548551 us; cnt=32511665
naive               : 23647023 us; cnt=32511665
Wegner/Kernigan     : 15026099 us; cnt=32511665
Anderson            : 63285941 us; cnt=32511665
HAKMEM 169/X11      :  4205008 us; cnt=32511655
The fastest implementation is still the 16-bit lookup table, at 40% faster than that fastest bit-twiddling code. The "8-bit LUT" implementation is by a1k0n. I noticed it was a lot slower than my code so I investigated. The only difference was the order of calculations for a simple addition. The "8-bit LUT v2" is my version. It's interesting to see that that minor change causes a major timing difference. And HAKMEM is holding its own; it's even better than the Wikipedia ones, which I didn't expect.

A Core 2 Duo is a 64-bit processor. I can ask gcc to compile for 64-bits and see if that makes a difference.

% cc popcnt.c -O3 -m64
% ./a.out
FreeBSD version 1   :  3882201 us; cnt=32511665
FreeBSD version 2   :  3382603 us; cnt=32511665
16-bit LUT          :  1562167 us; cnt=32511665
8-bit LUT           :  2013767 us; cnt=32511665
8-bit LUT v2        :  2054639 us; cnt=32511665
8x shift and add    : 21679291 us; cnt=32511665
32x shift and add   : 20050205 us; cnt=32511665
Wikipedia #2        :  2113876 us; cnt=32511665
Wikipedia #3        :  2102645 us; cnt=32511665
gcc popcount        :  8081774 us; cnt=32511665
gcc popcountll      :  4217908 us; cnt=32511665
naive               : 27717913 us; cnt=32511665
Wegner/Kernigan     : 16618378 us; cnt=32511665
Anderson            : 12300547 us; cnt=32511665
HAKMEM 169/X11      :  5643968 us; cnt=32511655
It does, for some things. The lookup tables are faster (and the 8-bit counts are now the same), and the 64-bit algorithms (Wikipedia, gcc popcountll and Anderson) are much happier. The parallel algorithm used by the Wikipedia implementations are almost as fast as the 8-bit lookup table and only 33% slower than the 16-bit lookup table. While HAKMEM got 25% slower.

Is there any way I can ask gcc to use the 64-bit instructions but still talk to a Python which was compiled for 32 bits?

Profile-directed optimizations with gcc

I recently learned how to do profile-directed optimizations with gcc using the -fprofile-arcs and -fbranch-probabilities options. Sometimes the compiler needs to make a guess on which assembly code is best for a given circumstance. For instance, if a branch for an 'if' statement is usually taken then it might provide hints to the processor pipline to optimistically assume the branch will be taken. It can't guess correctly all the time. Profile-directed optimizations works by asking gcc during compilation to instrument the program, running the program to get timings on how the code is actually used, saving the results to a file, then recompiling while asking gcc to use the instrumentation data.

Is it useful for my case?

     compile with instrumentation - these will be slower than usual!
% cc popcnt.c -O3 -fprofile-arcs
% ./a.out
FreeBSD version 1   :  4304773 us; cnt=32511665
FreeBSD version 2   :  4062710 us; cnt=32511665
16-bit LUT          :  2835959 us; cnt=32511665
8-bit LUT           :  5344514 us; cnt=32511665
8-bit LUT v2        :  4535290 us; cnt=32511665
8x shift and add    : 22246178 us; cnt=32511665
32x shift and add   : 19915850 us; cnt=32511665
Wikipedia #2        :  5626697 us; cnt=32511665
Wikipedia #3        :  5318770 us; cnt=32511665
gcc popcount        :  7005410 us; cnt=32511665
gcc popcountll      :  8173944 us; cnt=32511665
naive               : 62215630 us; cnt=32511665
Wegner/Kernigan     : 34541338 us; cnt=32511665
Anderson            : 65256755 us; cnt=32511665
HAKMEM 169/X11      :  4727297 us; cnt=32511665


      See the profile data?
% ls -l popcnt.*
-rw-r--r--   1 dalke  staff  9177 Jul  3 05:01 popcnt.c
-rw-r--r--   1 dalke  staff  1816 Jul  3 04:43 popcnt.gcda
-rw-r--r--   1 dalke  staff  7872 Jul  3 04:43 popcnt.gcno

      recompile using the profile data
% cc popcnt.c -O3 -fbranch-probabilities
% ./a.out
FreeBSD version 1   :  3815036 us; cnt=32511665
FreeBSD version 2   :  3477915 us; cnt=32511665
16-bit LUT          :  2251514 us; cnt=32511665
8-bit LUT           :  5291010 us; cnt=32511665
8-bit LUT v2        :  3936142 us; cnt=32511665
8x shift and add    : 21715704 us; cnt=32511665
32x shift and add   : 19708580 us; cnt=32511665
Wikipedia #2        :  5273262 us; cnt=32511665
Wikipedia #3        :  5074212 us; cnt=32511665
gcc popcount        :  6431689 us; cnt=32511665
gcc popcountll      :  7562124 us; cnt=32511665
naive               : 28044028 us; cnt=32511665
Wegner/Kernigan     : 15280455 us; cnt=32511665
Anderson            : 63431368 us; cnt=32511665
HAKMEM 169/X11      :  4209093 us; cnt=32511665
No significant differences over the normal 32-bit code. Nor did I see any real differences if I compiled with the architecture-specific flags:
cc popcnt.c -O3 -march=nocona

Domain applicability

I expect people will come across this page when looking for popcount implementations, see my numbers, and end up using the 16-bit lookup table. That might be the right solution but bear in mind domain applicability. I plan to compute popcount over megabytes. Most people, like the proverbial implementation in X11, work on only a few words.

It takes time to bring the lookup table into memory. On modern machines it's something like tens of instructions. If you're only doing a few popcounts, or a lot of popcounts but not all the time, then it's probably best to use bit-twiddling. It's easier to implement and for that case likely faster.


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB