Diary RSS | All of Andrew's writings | Diary archive
Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure. Code to taste. Best served at room temperature.Python 0.9.1p1 #
Someone on StackOverflow asked: Which version of python added the else clause for for loops?. While the online documentation only goes back to 1.4, I remembered that the original code was posted to the alt.sources newsgroup. A bit of a search and I found copies of the original 21 shar files. (20 shar files, because #2 was withdrawn, then 21 again because of patch #1).
The post series starts with "Python 0.9.1", for example, "Python 0.9.1 part 01/21, and the patch is titled Python 0.9 official patch #1".
I downloaded them, unshar'ed the archive, applied the patch, configured the Makefile correctly, fixed a few compilation problems, and the result was a running version of Python 0.9.1!
[python-0.9.1/src]% ./python >>> for i in range(10): ... pass ... else: ... print '"for" has an "else"!' ... "for" has an "else"! >>>
And very primitive it is too. No copyright message when it starts up. Bad error reporting. Class delaration requires a () even when there are no parents. There's no __init__ special method, so the convention is to call 'Create()' yourself. Uses 'self' and 'this' and other terms instead of only 'self'. Only understands 'single quote strings' and not "double quotes."
Still, it does work. I've put a copy of it at online for your enjoyment. The changes I made are documented in "README.reconstructed". I see that python.org only has Python releases to 1.0.1, so I'll let them know about my reconstruction.
Note: I am not taking any patches. :)
Have fun!
Python for Cheminformatics Training in Leipzig, 27-29 April #
I will be doing another training course for Python in cheminformatics in Leipzig, Germany from 27-29 April. The course is meant for computational chemists who want to be more effective at developing the software they need in their research. All of the examples will be drawn from cheminformatics and closely related fields.
This three day course will cover:
- Day 1: overview of Python and OEChem,
- Day 2: plotting with matplotlib, communicating with Excel, XML processing, calling command-line programs with subprocess, numeric computing with NumPy and R,
- Day 3: SQL databases and web development with Django.
If you're interested in the course, send me an email at dalke@dalkescientific.com. If you know of anyone who might be interested in the course, please forward this to them.
Who am I? My name is Andrew Dalke. I'm a Gothenburg-based software consultant specializing in computational chemistry and biology. I started in this area as a full-time developer in 1995. Some of the public projects I've worked on include VMD, NAMD, Biopython and PyDaylight. I am a member of the Python Software Foundation and an advocate for the use of Python by researchers in the computational life sciences. In addition to writing customized software, I teach courses in using Python for bioinformatics and cheminformatics.
Open source != peer review #
I gave two presentations at the German Conference on Chemoinformatics (GCC) in Goslar, German. One was an update of my EuroSciPy presentation, Python Tools in Computational Chemistry (and Biology). I included more on the history of Python and why I think it became widely used in cheminformatics. At the end I gave some ideas of what I want for the future. I'll elaborate more on that in another posting here.
The second presentation was about some of the difficulties I've seen in doing open source cheminformatics development. I tried a different presentation style: black background with one or only a few words on the slide in white font. It required more practice, but came up pretty nicely I think. I started by writing down everthing I wanted to say. I'll post that text here soon.
Open Source != peer review
One of the slides is titled "Open Source != peer review". I'm breaking that out because it's something I want feedback on, or at least arguments opposing. Here's the short version of that part:
Some argue that doing good computational-based science requires open source. The argument is that scientists need to review the source code in order to verify that it works correctly. How, they argue, can you review someone else's paper if you can't review the source code used to make that paper?
I like open source. (My talk goes into the philosophical differences between "open source" and "free software.") I think there should be support for peer review. But I don't understand why the ability to see the source code, in order to review it for scientific quality, requires the right to redistribute the source code to others.
CHARMm
I gave CHARMm as an example. It's a molecular dynamics program from the Karplus group at Harvard. The academic license costs $600 and all licensees get the source code. People can review the source code, and modify, and I believe even send modifications to others who have a CHARMm license. But it's not open source.
What additional peer review could be done on CHARMm if it was distributed with an open source license? Please bear in mind that the time needed to get up to speed on CHARMm is quite large, and using it for interesting science likely requires good hardware, so $600 isn't that much. The license fees go to supporting CHARMm development, so there are scientific benefits to having the fee.
Yes, I know that free software allows selling the software so it's possible to charge the $600 even for open source code. My talk goes into the difficulties of actually doing that.
My point here is to get feedback about why the right to redistribute software is a requirement for effective peer review, and to tie it down to specific examples. Mine is just one; feel free to use your own.
Clean room development
Once you've done that, please also explain how to solve this problem. Suppose I review someone else's source code, either as part of the peer-review process for a publication or because I want to verify that code I got really does work.
Several months later I write a program which has similar functionality and my implementation looks very much like the code in the first program. Perhaps I deliberately structured it after the first program, or perhaps that form just makes sense. Perhaps I forgot that I had even seen the code in the first program. Perhaps many things.
The author of the first package finds out that my code is similar. Suspiciously similar. Was there a copyright violation? A license violation? What should I do? Change my license? Rewrite the source? Apologize profusely? Claim there was no violation?
Further complications: what if the original source code was submitted in a peer review article and I was a reviewer. If that paper was rejected, so that the original source code was never actually published, then what are the license terms of the software? Eg, it might be "BSD upon publication" but it was never published.
(The peer review system has long figured out how to handle the problem of idea transfer of rejected papers to reviewer, but this essay is about copyright and licenses, not ideas.)
What if there are multiple sections of my code, each vaugely like code in other projects. If I want to be safe and generous I could change my license to match all of them, but even free software licenses can be incompatible. What if I had actually recommended that code structure to others at a conference but there's no paper trail showing the history and we forgot?
The industry solution to this is clean room design. One person or team reviews the code and describes how things work through a specification document. That specification does not contain copyrighted material. Another person or team, who never saw the original source code, takes the specification and implements it.
There's no way we can do full clean-room development in this field. That would mean some people only read code and others only write code, which rather is against the point of doing code review.
If we encourage peer review of source code, which I think we should do, then how do we deal with this issue? How can I be sure that if I review someone's program then in the future they won't accuse me of taking their code and using it in violation of its license agreement? Or what if I did directly take 20 lines of code, figuring it was too small to count? What recourse do they have?
Most of my publicly available software is under the BSD license. Hence if someone uses my software in violation of the license the fix is to add a simple copyright statement to the source code. The violator need make no other changes.
Teaching Python for cheminformatics #
I've started doing training courses for computational chemists who want to learn more about Python. All the examples I use come from cheminformatics, using OpenEye's OEChem toolkit. If you are interested, I'll be teaching a two-day course in the Bay Area on December 4-5, a three-day course in Leipzig on 2-4 March, and I've started planning a course for Boston in early April. See my web site for more details or contact me directly.
Python is quite widely used in cheminformatics, with the major competitor languages being C, C++ and Java. I say that but they aren't really competitors because Python can work together with those languages. Most of the cheminformatics toolkits, after all, are written in one of those languages, and many have Python bindings. You can use PyBel to use OpenBabel from the C implementation of Python, or use Jython to talk to the CDK. OpenBabel even has .Net bindings that you can use from Iron Python.
Last month I did a two day course in Leipzig, hosted through Mike Müller's Python Academy. All of the examples were drawn from cheminformatics, using OEChem as the main cheminformatics toolkit. It's the toolkit I know the best, and I think it's the most powerful and flexible available. I also covered using Python to talk to a database, make plots, communicate with Excel, and other special topics.
The biggest feedback I got was that I tried to cover too much during those two days, and didn't give enough hands-on time. I've did some on-site teaching for AstraZeneca yesterday and another tomorrow, bearing that in mind. In one day I covered the basics of Python and of OEChem. I deliberately did not cover exceptions, classes, or more advanced topics because I don't think they are that important for what most computational chemists do. They use classes but rarely make their own.
My students yesterday were quite happy with the class. There's a few things to tweak - list comprehensions are too complicated for an intro course. I'll see how things go tomorrow.
If you're interested in one of my classes, or want to come to your site to do training, please contact me.
EuroQSAR posters #
I'll be at EuroQSAR 2008 next week in Uppsala. It's just a couple of train trips away after all, and it should be a great way to find new clients and find people for my Python training courses for cheminformatics. If you're there, say hello. I'm curious to know who reads this blog.
I made two posters for the conference. One is about the C-Lab project at AstraZeneca. It's an intranet application for molecular descriptor and model calculations. I mostly worked on the back-end compute engine, which interfaces to a bunch of resources and manages dependencies between them.
The other is about Python for Computational Chemistry (also available in poster size). The point of the poster is that Python is "the best choice of high-level programming language for computational chemistry" and I think I made a good case.
It also has eye candy. I'm quite proud of my "timeline of cheminformatics toolkits" graphic showing how different toolkits are related, when they were in active development, and a bit about how developers have moved between projects.
Update: Here's my EuroSciPy 2008 presentation titled Python Tools in Computational Chemistry (and Biology).
Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me
Read 0 bytes, run out of memory #
I've been working on a file format for chemical fingerprints, influenced by the PNG file format. To make sure I'm doing it right, I wrote a program to dump blocks from PNG files. I made a mistake and my program gave a MemoryError. How did that happen when my test file is only a few K long?
I tracked it down. I don't know if it's a bug. Here's something for you all to ponder over:
BLOCKSIZE = 10*1024*1024
f=open("empty.txt", "w")
f.close()
f=open("empty.txt")
data = []
for i in range(10000):
s = f.read(BLOCKSIZE)
assert len(s) == 0
data.append(s)
That's an empty file, so the read() must return empty strings. The
assert statement verifies that that's the case. But when I run it I get:
Python(18996) malloc: *** vm_allocate(size=10489856) failed (error code=3)
Python(18996) malloc: *** error: can't allocate region
Python(18996) malloc: *** set a breakpoint in szone_error to debug
Traceback (most recent call last):
File "mem_fill.py", line 9, in <module>
s = f.read(BLOCKSIZE)
MemoryError
The reason why is in the C implementation of read
(Objects/fileobject.c). The relevant line i:
v = PyString_FromStringAndSize((char *)NULL, buffersize);
That preallocates space assuming the requested read size will be
correct. In my example code it preallocates 10MB of space even though
the result is 0 bytes long. Since I keep the result around, all of
the preallocated space is also kept around. Repeat that 10,000 times
and my machine quickly runs out of memory. So will yours.
Bug in Python? Correct behavior? You decide. Feel free to make comments if you wish.
Update 9 August: I submitted this as issue3531 in the Python bug tracker. Antoine Pitrou pointed out that there's a string resize at the end of the function, so my assumption on the source of the problem was wrong. I dug into it some more and tracked the problem down to obmalloc.c inside of PyObject_Realloc. The resize ends up calling the C function 'realloc', and it didn't seem to be reallocing.
With those clues I did some searching and found Bob Ippolito's blog post realloc.. doesn't. "Apparently, Darwin's implementation of realloc never frees memory if you ask for a smaller size." There's a thread about this from January 2005 on python-dev titled: "Darwin's realloc(...) implementation never shrinks allocations." There's additional discussion with issue1092502.
I've been looking through various libraries to see if this causes a problem in real life. A denial of service attack is possible but only for formats where the data describes how many bytes to read. The PNG format is one of many. The reader code has to trust that number and use it blindly, but any hardened reader knows those values are under suspicion. For example, the PNG reader in PIL uses ImageFile.py:_safe_read:
# Reads large blocks in a safe way. Unlike fp.read(n), this function
# doesn't trust the user. If the requested size is larger than
# SAFEBLOCK, the file is read block by block.
#
# @param fp File handle. Must implement a read method.
# @param size Number of bytes to read.
# @return A string containing up to size bytes of data.
def _safe_read(fp, size):
if size <= 0:
return ""
if size <= SAFEBLOCK:
return fp.read(size)
data = []
while size > 0:
block = fp.read(min(size, SAFEBLOCK))
if not block:
break
data.append(block)
size = size - len(block)
return string.join(data, "")
which does not end up with un-realloc'ed blocks.
If you aren't using a hardened reader then it's easy to do a standard denial-of-service attack.
I could come up with hypothetical examples, but they are pretty contrived. I tried to write one which sounded at least somewhat reasonable, but I couldn't. Therefore, I don't think this realloc(3C) behavior can cause a serious denial-of-service attack. At worst it seems to make already bad code somewhat more suspectable to attack.
Teaching Python programming for cheminformatics #
I'm announcing two short training classes in Python programming for chemical informatics. I will teach the first in Leipzig, Germany on 6-7 October. It will be hosted by the Python Academy. I'm still working on the details of the second. It will be in the San Francisco Bay Area in early December.
For details about the price, topics and other information, see http://dalkescientific.com/training/ or send me an email.
I work with and develop software tools for computational chemists, mostly in small-molecule chemistry. These are scientists, not programmers, but they use computers every day as part of their research. Most need to do some programming, pehaps to implement a new algorithm, or more likely to handle something that's too tedious or error prone to do manually, like automating analysis of 10,000 compounds.
Most chemists have some training in programming, usually a semester or two of introductory programming classes in college and a lot of training from the school of experience. The latter usually means informal training from labmates, who are also not programming experts.
I've seen many of my chemists friends work hard at getting software to do what they want it to do. Chemists, like most other people who had to spend years of mostly isolated work to get a PhD, know how to perservere. But they would rather do science, not spend time figuring out how to use software.
My training classes are meant for them. I'll cover different aspects of how to be better at Python programming using examples that are directly relevant to doing small molecule in silico research.Testing the bitslice algorithm for popcount #
This is part 7 of an on-going series dealing with molecular fingerprints:
- Molecular fingerprints, background
- Generating molecular fingerprints with OpenBabel
- Computing Tanimoto scores, quickly
- Block Tanimoto calculations
- Faster block Tanimoto calculations
- HAKMEM 169 and other popcount implementations
- Testing the bitslice algorithm for popcount
I want to compute the popcount of molecular fingerprints containing 1024 bits. My algorithm has been to compute the popcount 8, 16, 32, or 64 bits at a time and repeat that for all of the words in the fingerprint. There are a couple of algorithms which compute the popcount of multiple words in parallel by doing a partial popcount of each word and combining the intermediate results to get the result over all words.
Lauradoux Cédric wrote a nice overview on Hamming weight, including explanations of the tree merging and bit-slicing algorithms. In personal email he pointed out this page from comp.arch.arithmetic where someone attributes the algorithm to Robert Harley.
It took me a while to understand it. It's been rather a while since I had to do much thinking about half-adders. About 17 years. But it's not hard. I'm not going to explain it though. It's best in pictures and that will take me too long to do. You can follow those links though.
While he hasn't finished the overview explanation on how to combine the two algorithms, his Hamming weight page includes source.
I've added two bit-slicing implementations to my popcnt.c benchmark. The algorithm from comp.arch.arithmetic processes 7 words at a time and the one from Cédric processes 24 words at a time. Both fall back to other methods for the last few words if the input isn't an exact multiple of 24 or 7.
Cédric implemented several algorithms in his code distributions and included programs to compare their performance. The results suggested that the 24 word bitslice algorithm would be twice as fast as as the 16 bit looup table. I implemented it and found that while it was fast, and in my laptop even faster than the 16 bit lookup table, it was only about 10% faster and not a factor of 2.
% sysctl machdep.cpu.brand_string machdep.cpu.brand_string: Intel(R) Core(TM)2 CPU T7600 @ 2.33GHz % gcc --version i686-apple-darwin8-gcc-4.0.1 (GCC) 4.0.1 (Apple Computer, Inc. build 5367) ... deleted ... % cc -O3 popcnt.c % ./a.out FreeBSD version 1 : 3854904 us; cnt=32511665 FreeBSD version 2 : 3517953 us; cnt=32511665 16-bit LUT : 2127718 us; cnt=32511665 was the fastest 8-bit LUT : 5317244 us; cnt=32511665 8-bit LUT v2 : 3762467 us; cnt=32511665 Wikipedia #2 : 5493756 us; cnt=32511665 Wikipedia #3 : 5295541 us; cnt=32511665 gcc popcount : 6242596 us; cnt=32511665 gcc popcountll : 10402364 us; cnt=32511665 HAKMEM 169/X11 : 4305536 us; cnt=32511665 Bitslice(7) : 2077473 us; cnt=32511665 Bitslice(24) : 1843217 us; cnt=32511665 fastest naive : 23460323 us; cnt=32511665 Wegner/Kernigan : 14976014 us; cnt=32511665 Anderson : 63971600 us; cnt=32511665 8x shift and add : 21728414 us; cnt=32511665 32x shift and add : 19747808 us; cnt=32511665
I dug around for a while and used Toby's comment to try Shark as a profiler. Things made a bit more sense after I remembered to add the '-g' option to gcc so I could get Shark to line up the code and assembly side-by-side. My conclusion is that Cédric's code has too much function call overhead because the 16-bit LUT is being called for every 32 bit number, while the 24 word bitslice code is only called every 24 words. My code doesn't have that overhead, which is why the times are closer to each other.
As before, I wondered how much "-m64" affects things. I'll declare it a tie between the new bitslice code and the old 16-bit lookup table. (While though the numbers aren't the same, the difference is about the same as when I run the same algorithm again.)
% cc -O3 popcnt.c -m64 % ./a.out FreeBSD version 1 : 3902385 us; cnt=32511665 FreeBSD version 2 : 3422655 us; cnt=32511665 16-bit LUT : 1621557 us; cnt=32511665 fastest 8-bit LUT : 2070320 us; cnt=32511665 8-bit LUT v2 : 2094362 us; cnt=32511665 Wikipedia #2 : 2135130 us; cnt=32511665 Wikipedia #3 : 2142576 us; cnt=32511665 gcc popcount : 8128024 us; cnt=32511665 gcc popcountll : 4087390 us; cnt=32511665 HAKMEM 169/X11 : 5805272 us; cnt=32511665 Bitslice(7) : 1728262 us; cnt=32511665 Bitslice(24) : 1698257 us; cnt=32511665 fastest naive : 24990997 us; cnt=32511665 Wegner/Kernigan : 16864887 us; cnt=32511665 Anderson : 12463648 us; cnt=32511665 8x shift and add : 21715482 us; cnt=32511665 32x shift and add : 20003160 us; cnt=32511665
All of this was on one processor and compiler. I decided to try another machine with a slightly different processor and an earlier version of gcc.
%sysctl -w hw.model
hw.model: Intel(R) Core(TM)2 CPU 6420 @ 2.13GHz
%gcc --version
gcc (GCC) 3.4.6 [FreeBSD] 20060305
...
%cc popcnt.c -O3
%./a.out
FreeBSD version 1 : 4270677 us; cnt=32511665
FreeBSD version 2 : 3785047 us; cnt=32511665
16-bit LUT : 2034836 us; cnt=32511665
8-bit LUT : 1935156 us; cnt=32511665 fastest
8-bit LUT v2 : 1987370 us; cnt=32511665
Wikipedia #2 : 5010639 us; cnt=32511665
Wikipedia #3 : 4931266 us; cnt=32511665
gcc popcount : 5793299 us; cnt=32511665
gcc popcountll : 6815575 us; cnt=32511665
HAKMEM 169/X11 : 10788366 us; cnt=32511665
Bitslice(7) : 2467150 us; cnt=32511665
Bitslice(24) : 2469399 us; cnt=32511665
naive : 27722786 us; cnt=32511665
... too slow to worry about ...
In this case I would use the 8-bit lookup table, which is about 20%
faster than the the bitslice code.
Here's the results on an older Pentium 4 (Prescott) CPU 2.80GHz with gcc 3.4.6:
FreeBSD version 1 : 3745509 us; cnt=32500610 FreeBSD version 2 : 3774150 us; cnt=32500610 16-bit LUT : 2762583 us; cnt=32500610 8-bit LUT : 2430362 us; cnt=32500610 fastest 8-bit LUT v2 : 2433684 us; cnt=32500610 Wikipedia #2 : 11509758 us; cnt=32500610 Wikipedia #3 : 8540921 us; cnt=32500610 gcc popcount : 9100055 us; cnt=32500610 gcc popcountll : 10568664 us; cnt=32500610 HAKMEM 169/X11 : 10970609 us; cnt=32500610 Bitslice(7) : 4226871 us; cnt=32500610 Bitslice(24) : 3700020 us; cnt=32500610 naive : 36117662 us; cnt=32500610 Wegner/Kernigan : 46479326 us; cnt=32500610 Anderson : 174285325 us; cnt=32500610 8x shift and add : 90028247 us; cnt=32500610 32x shift and add : 86390410 us; cnt=32500610Bitslice(24) is still the fastest bit-twiddler solution, but quite a bit slower than using an 8-bit lookup table. There is no single best solution. It seems the code is also compiler dependent so I'm installing GCC 4.2 to see how that affects things. That'll probably take a few hours so it's time to sleep.
Okay, fink installed gcc 4.2.2 (the most recent is 4.3.1 but I decided it wasn't worth the time to compile it myself). Here's the numbers on my laptop for 32-bit:
% gcc-4 -O3 popcnt.c % ./a.out FreeBSD version 1 : 4161621 us; cnt=32511665 FreeBSD version 2 : 2930977 us; cnt=32511665 was 3517953 16-bit LUT : 2534380 us; cnt=32511665 was 2127718 8-bit LUT : 4472733 us; cnt=32511665 8-bit LUT v2 : 4499285 us; cnt=32511665 was 3762467 Wikipedia #2 : 5031855 us; cnt=32511665 Wikipedia #3 : 3874931 us; cnt=32511665 was 5295541 gcc popcount : 4808307 us; cnt=32511665 gcc popcountll : 5978393 us; cnt=32511665 HAKMEM 169/X11 : 4315751 us; cnt=32511665 Bitslice(7) : 2111625 us; cnt=32511665 Bitslice(24) : 1443337 us; cnt=32511665 fastest; was 1843217 naive : 51557971 us; cnt=32511665 Wegner/Kernigan : 15616245 us; cnt=32511665 Anderson : 63424235 us; cnt=32511665 8x shift and add : 19179445 us; cnt=32511665 32x shift and add : 19502222 us; cnt=32511665 %and for 64 bit
% gcc-4 -O3 popcnt.c -m64 % ./a.out FreeBSD version 1 : 4192391 us; cnt=32511665 FreeBSD version 2 : 2812570 us; cnt=32511665 was 3422655 16-bit LUT : 1494747 us; cnt=32511665 was 1621557 8-bit LUT : 2182569 us; cnt=32511665 8-bit LUT v2 : 2170191 us; cnt=32511665 was 2094362 Wikipedia #2 : 2001489 us; cnt=32511665 Wikipedia #3 : 1361488 us; cnt=32511665 was 2142576 gcc popcount : 7350367 us; cnt=32511665 gcc popcountll : 3718116 us; cnt=32511665 HAKMEM 169/X11 : 4339839 us; cnt=32511665 Bitslice(7) : 1693849 us; cnt=32511665 Bitslice(24) : 1206318 us; cnt=32511665 fastest; was 1698257 naive : 24914902 us; cnt=32511665 Wegner/Kernigan : 15557531 us; cnt=32511665 Anderson : 8764008 us; cnt=32511665 8x shift and add : 21710211 us; cnt=32511665 32x shift and add : 19583803 us; cnt=32511665 %It looks like things are nearly always faster for 64 bit code, and sometimes faster for 32 bit code.
HAKMEM 169 and other popcount implementations #
This is part 6 of an on-going series dealing with molecular fingerprints:
- Molecular fingerprints, background
- Generating molecular fingerprints with OpenBabel
- Computing Tanimoto scores, quickly
- Block Tanimoto calculations
- Faster block Tanimoto calculations
- HAKMEM 169 and other popcount implementations
- Testing the bitslice algorithm for popcount
HAKMEM
Many other people have worked on doing high speed population counts. In 1972 the AI Laboratory at MIT published "HAKMEM", or "Artificial Intelligence Memo No. 239". It is a collection of algorithms, tricks, observations, and unsolved but interesting problems. One of which is to "Solve Go." Item 169 is a PDP-6/10 assembly way to count the number of ones in a word. It was called "count ones" and is now called popcount.
I first came across HAKMEM Item 169 in 2005 when I last looked at doing fast Tanimoto calculations. I tried it and a few other solutions but concluded that using special Altivec instructions were the fastest solution for my PowerPC-based PowerBook. I don't have a PowerPC any more. I bought a MacBook Pro last year with a 2.33 GHz Intel Core 2 Duo. I've been looking at the MMX instructions quizzically trying to figure out if there's an equivalent solution. I am not an assembly programmer.
The best I could find was an AMD "x86 Code Optimization Guide". It gives a 32-bit integer version using normal Intel assembly and a 64-bit version using MMX which "can do popcounts about twice as fast as the integer version (for an identical number of bits)."
The newer guide (I misplaced the URL) says to use the popcount instruction in SSE4a.
Other popcount references
Computing popcount is apparently very widely used. You can find many people have talked about it:
- Bart
Massey:
popcount_naive: 1.6666e+07 iters in 688 msecs for 41.28 nsecs/iter popcount_8: 1e+08 iters in 995 msecs for 9.95 nsecs/iter popcount_6: 2e+08 iters in 1725 msecs for 8.625 nsecs/iter popcount_hakmem: 2e+08 iters in 1411 msecs for 7.055 nsecs/iter popcount_keane: 1e+09 iters in 6566 msecs for 6.566 nsecs/iter popcount_3: 1e+09 iters in 6270 msecs for 6.27 nsecs/iter popcount_2: 1e+09 iters in 5658 msecs for 5.658 nsecs/iter popcount_mult: 1e+09 iters in 4169 msecs for 4.169 nsecs/iter ... At the suggestion of Bill Trost, I added 8 and 16-bit table-driven popcount implementations. These perform the fastest in the benchmark test-stand, at about 3ns/iteration. They're about the same speed, so one would obviously use the 8-bit version.
- Majek has pointers to code. That's where I found the AMD optimization guide link.
- Lauradoux Cédric's Hamming weight
page has an
overview of the topic. He reports:
and nicely summarizes why few people use the fastest solution:Algorithms Throughput Naive 29.97 MB/s Kernighan 33.34 MB/s Tree 1 122.97 MB/s Tree 2 130.60 MB/s Tree 3 125.57 MB/s Table 8 bits 200.94 MB/s Table 16 bits 244.64 MB/s The fastest solutions use pre-computation. However, you will never find this two implementations in current software product (X11, linux kernel. . . ). The reason of that is simple: the programmers are reluctant to use solutions implying a big amount of memory since it can affect the portability of the applications and their performance. As a consequence, the algorithms based on tree are the most popular.
- There was a reddit thread in 2007 titled: Popcount
- the "NSA instruction" missing from modern CPUs (or at least their
datasheets.). In it, a1k0n posted a link to a popcount benchmark
he wrote. After tweaking the 8-bit version slightly for better
performance I got:
FreeBSD version 1 : 3784823 us; cnt=32511665 FreeBSD version 2 : 3421292 us; cnt=32511665 16-bit LUT : 2068947 us; cnt=32511665 8-bit LUT : 3632031 us; cnt=32511665 8x shift and add : 21401951 us; cnt=32511665 32x shift and add : 19419464 us; cnt=32511665
and he also found that lookups were the fastest solution. - There's a gcc bug
report from April 2008 showing that that __builtin_popcount isn't
even as fast as other bit twiddling solutions. Joseph Zbiciak attached
a benchmark. On my machine it shows that gcc's builtin solution
is about 1/3rd the speed of a better one:
popcount64_1 = 4114 clocks gcc's __builtin_popcount popcount64_2 = 1881 clocks popcount64_3 = 1627 clocks popcount64_4 = 1381 clocks
- There's also a number of popcount algorithms documented by Sean Eron Anderson including descriptions and further links.
- Cédric in private mail pointed out a nice collection of popcount implementations from 1998/1999 including a cute 11 bit lookup table and an implementation of Robert Harley's bit slice solution, which is great when doing popcount across several words at a time.
- Finally, you can fetch the GMP package. It contains hand-written assembly implementations for many different architectures. The Intel ones are based on bit twiddling and do not use a lookup table.
Benchmarking popcount in GMP
Interesting. Since I have gmpy installed on this machine I can figure out just how good gmp's hand-written bit twiddling actually is. Here's what I did, with commentary inserted.
# Read the entire data file into memory
>>> s = open("nci.binary_fp").read()
>>> len(s)
155667200
# Convert each character into hex
>>> s = s.encode("hex")
# Convert the hex number into a gmp integer, using base 16
>>> import gmpy
>>> n = gmpy.mpz(s, 16)
# Do the popcount
>>> gmpy.popcount(n)
136963400
# How long did it take?
>>> import time
>>> t1 = time.time(); gmpy.popcount(n); t2 = time.time()
136963400
>>> t2-t1
0.36236190795898438
>>> print (155667200/256)*(t2-t1), "fingerprints per second"
220343.217182 fingerprints per second
# Verify that I got the right answer by using an 8-bit lookup table
>>> bitcounts = (
... 0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
... 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
... 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
... 2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
... 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
... 2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
... 2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
... 3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8,)
>>> bitcount_table = dict((chr(i), count) for (i,count) in enumerate(bitcounts))
# Double-check the table creation -- this should be 2
>>> bitcount_table["A"]
2
# Do the full popcount; this will take a while
>>> sum(bitcount_table[c] for c in s)
136963400
>>>
GMP takes about 0.36 seconds to compute one popcount. My C code
before loop unwinding using the 16-bit lookup table took 0.41 seconds
to compute two popcounts, or about 0.2 seconds for one popcount. This
is in very good agreement with the numbers from a1k0n's benchmark (3.4
seconds vs. 2.1 seconds).
Another popcount benchmark
So many different algorithms and the benchmarks aren't compatible enough that I could compare them. I decided to expand a1k0n's benchmark and include a few other implementations. The result is popcnt.c. Yes, I didn't even change the name. It implements quite a few algorithms but it's still incomplete. Anyone want to contribute assembly versions using MMX instructions?
Here's what my laptop reports. It's a 2.33 GHz Intel Core 2 Duo and I compiled with gcc 4.0.1:
% cc popcnt.c -O3 % ./a.out FreeBSD version 1 : 3828096 us; cnt=32511665 FreeBSD version 2 : 3474751 us; cnt=32511665 16-bit LUT : 2095066 us; cnt=32511665 8-bit LUT : 5329692 us; cnt=32511665 8-bit LUT v2 : 3691475 us; cnt=32511665 8x shift and add : 21675609 us; cnt=32511665 32x shift and add : 19687827 us; cnt=32511665 Wikipedia #2 : 5264710 us; cnt=32511665 Wikipedia #3 : 5037636 us; cnt=32511665 gcc popcount : 6088138 us; cnt=32511665 gcc popcountll : 7548551 us; cnt=32511665 naive : 23647023 us; cnt=32511665 Wegner/Kernigan : 15026099 us; cnt=32511665 Anderson : 63285941 us; cnt=32511665 HAKMEM 169/X11 : 4205008 us; cnt=32511655The fastest implementation is still the 16-bit lookup table, at 40% faster than that fastest bit-twiddling code. The "8-bit LUT" implementation is by a1k0n. I noticed it was a lot slower than my code so I investigated. The only difference was the order of calculations for a simple addition. The "8-bit LUT v2" is my version. It's interesting to see that that minor change causes a major timing difference. And HAKMEM is holding its own; it's even better than the Wikipedia ones, which I didn't expect.
A Core 2 Duo is a 64-bit processor. I can ask gcc to compile for 64-bits and see if that makes a difference.
% cc popcnt.c -O3 -m64 % ./a.out FreeBSD version 1 : 3882201 us; cnt=32511665 FreeBSD version 2 : 3382603 us; cnt=32511665 16-bit LUT : 1562167 us; cnt=32511665 8-bit LUT : 2013767 us; cnt=32511665 8-bit LUT v2 : 2054639 us; cnt=32511665 8x shift and add : 21679291 us; cnt=32511665 32x shift and add : 20050205 us; cnt=32511665 Wikipedia #2 : 2113876 us; cnt=32511665 Wikipedia #3 : 2102645 us; cnt=32511665 gcc popcount : 8081774 us; cnt=32511665 gcc popcountll : 4217908 us; cnt=32511665 naive : 27717913 us; cnt=32511665 Wegner/Kernigan : 16618378 us; cnt=32511665 Anderson : 12300547 us; cnt=32511665 HAKMEM 169/X11 : 5643968 us; cnt=32511655It does, for some things. The lookup tables are faster (and the 8-bit counts are now the same), and the 64-bit algorithms (Wikipedia, gcc popcountll and Anderson) are much happier. The parallel algorithm used by the Wikipedia implementations are almost as fast as the 8-bit lookup table and only 33% slower than the 16-bit lookup table. While HAKMEM got 25% slower.
Is there any way I can ask gcc to use the 64-bit instructions but still talk to a Python which was compiled for 32 bits?
Profile-directed optimizations with gcc
I recently learned how to do profile-directed optimizations with gcc using the -fprofile-arcs and -fbranch-probabilities options. Sometimes the compiler needs to make a guess on which assembly code is best for a given circumstance. For instance, if a branch for an 'if' statement is usually taken then it might provide hints to the processor pipline to optimistically assume the branch will be taken. It can't guess correctly all the time. Profile-directed optimizations works by asking gcc during compilation to instrument the program, running the program to get timings on how the code is actually used, saving the results to a file, then recompiling while asking gcc to use the instrumentation data.
Is it useful for my case?
compile with instrumentation - these will be slower than usual!
% cc popcnt.c -O3 -fprofile-arcs
% ./a.out
FreeBSD version 1 : 4304773 us; cnt=32511665
FreeBSD version 2 : 4062710 us; cnt=32511665
16-bit LUT : 2835959 us; cnt=32511665
8-bit LUT : 5344514 us; cnt=32511665
8-bit LUT v2 : 4535290 us; cnt=32511665
8x shift and add : 22246178 us; cnt=32511665
32x shift and add : 19915850 us; cnt=32511665
Wikipedia #2 : 5626697 us; cnt=32511665
Wikipedia #3 : 5318770 us; cnt=32511665
gcc popcount : 7005410 us; cnt=32511665
gcc popcountll : 8173944 us; cnt=32511665
naive : 62215630 us; cnt=32511665
Wegner/Kernigan : 34541338 us; cnt=32511665
Anderson : 65256755 us; cnt=32511665
HAKMEM 169/X11 : 4727297 us; cnt=32511665
See the profile data?
% ls -l popcnt.*
-rw-r--r-- 1 dalke staff 9177 Jul 3 05:01 popcnt.c
-rw-r--r-- 1 dalke staff 1816 Jul 3 04:43 popcnt.gcda
-rw-r--r-- 1 dalke staff 7872 Jul 3 04:43 popcnt.gcno
recompile using the profile data
% cc popcnt.c -O3 -fbranch-probabilities
% ./a.out
FreeBSD version 1 : 3815036 us; cnt=32511665
FreeBSD version 2 : 3477915 us; cnt=32511665
16-bit LUT : 2251514 us; cnt=32511665
8-bit LUT : 5291010 us; cnt=32511665
8-bit LUT v2 : 3936142 us; cnt=32511665
8x shift and add : 21715704 us; cnt=32511665
32x shift and add : 19708580 us; cnt=32511665
Wikipedia #2 : 5273262 us; cnt=32511665
Wikipedia #3 : 5074212 us; cnt=32511665
gcc popcount : 6431689 us; cnt=32511665
gcc popcountll : 7562124 us; cnt=32511665
naive : 28044028 us; cnt=32511665
Wegner/Kernigan : 15280455 us; cnt=32511665
Anderson : 63431368 us; cnt=32511665
HAKMEM 169/X11 : 4209093 us; cnt=32511665
No significant differences over the normal 32-bit code. Nor did I see
any real differences if I compiled with the architecture-specific
flags:
cc popcnt.c -O3 -march=nocona
Domain applicability
I expect people will come across this page when looking for popcount implementations, see my numbers, and end up using the 16-bit lookup table. That might be the right solution but bear in mind domain applicability. I plan to compute popcount over megabytes. Most people, like the proverbial implementation in X11, work on only a few words.
It takes time to bring the lookup table into memory. On modern machines it's something like tens of instructions. If you're only doing a few popcounts, or a lot of popcounts but not all the time, then it's probably best to use bit-twiddling. It's easier to implement and for that case likely faster.
Copyright © 2001-2008 Dalke Scientific Software, LLC.



