OpenMP vs. POSIX threads
A few years ago I heard about OpenMP. It's a form of multi-threaded programming meant to make good use of multiprocessor and multicore hardware.
Earlier this year, I read Anatomy of High-Performance 2D Similarity Calculations, which used OpenMP as part of their Tanimoto search algorithm. This summer, Kim Walisch contributed OpenMP variations to my popcount benchmark.
The changes to the code were almost trivial, so I asked Kim if he would help me add OpenMP support to chemfp. He did, and it will be part of the upcoming 1.1 release.
OpenMP is only one of many ways to make effective use of multiple processors. Another common way is through POSIX threads, or its equivalent on Windows. A third is to spawn off a new process and use IPC to communicate with it.
How do these techniques compare to each other?
I decided to use Python 3.2's new concurrent.futures module to handle the multithreaded and multiprocess cases, or rather, the backport to 2.x. This merges the underlying "threading" and "multiprocessing" APIs into a common form, based on the "future" concept for asynchronous programming. I quickly found that the multiprocess API had too much overhead so I won't talk about it.
My test case computes and stores the NxN Tanimoto similarity matrix between a set of fingerprints. I took N=110885 compounds from the ZINC data set and generated 2048-bit fingerprints using RDKit's hash fingerprint. The chemfp fingerprint type is "RDKit-Fingerprint/1 minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=4 useHs=1".
I don't actually save the 12,295,483,225 values. For one, simple symmetry reduces that to 6,147,686,170 values. For another, in the problems I'm interested, I can ignore similarities below a threshold. Instead, chemfp internally uses a sparse matrix format.
For this test I used a simple parallelization. I break up the rows of the matrix into bands, and fill in the parts of the upper-right triangle which are at or above the threshold. In the OpenMP version, all the OpenMP threads work on a single band. In the concurrent.futures version, each thread processes its own band. These differences fall naturally out of the how those two APIs work.
Once I got both implementations working, debugged, and optimized, I could finally do some performance numbers. I wanted to see how OpenMP and pthreads scale over a range of processors and range of threshold values. My desktop has two dual-core CPUs, so I decided to rent time on a "High-CPU Extra Large Instance" Amazon EC2 node. It has 8 nodes of the form:
vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Xeon(R) CPU E5410 @ 2.33GHz stepping : 10 cpu MHz : 2333.336 cache size : 6144 KB fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu de tsc msr pae cx8 sep cmov pat clflush mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good aperfmperf pni ssse3 cx16 sse4_1 hypervisor lahf_lm bogomips : 4666.67 clflush size : 64 cache_alignment : 64 address sizes : 38 bits physical, 48 bits virtual power management:Sadly, this is not a machine which supports the POPCNT instruction, so the chemfp popcount code fell back to Imran Haque's SSSE3-based implementation.
The machine has about 6GB of free memory. I knew from testing on my desktop that I only needed 4.1 GB for the worst case, so I didn't run into memory problems.
I ran my benchmark code through set of combinations of the OpenMP vs. pthreads, with thread counts from 1 to 8, and with threshold values of 1.0, 0.99, 0.97, 0.95, 0.93, 0.9, 0.88, 0.85, 0.8, 0.75, 0.7, 0.65, 0.6, 0.55, and 0.5. I also ran the values several times in order to get some idea of the timing variations. Yes, that took about 20 hours to run.
You can get the raw
data if you're really interested. I used matplotlib to make a
couple of 3D plots. First, here's the overall times:
You can see that the OpenMP code (in red) is usually faster than the pthread code (in blue). The exception is for thresholds of 0.55 and lower. BTW, a threshold of 0.5 finds 285,371,794 matches in the NxN matrix, which means this stores a few gigabytes of data.
To make more sense of this data, here's a plot of the speedup, defined
as T0/T where T0 is the fastest single-threaded time for a given
threshold and T is the fastest time for a given number of
threads. Perfect speedup would give a value of 8 for 8 processors.
The region near threshold=1.0 is so jagged because the search time is less than the variability in the system, and is close to the minimum time resolution size of 1 second.
Most people in this field use thresholds in the range 0.7-1.0. It's obvious from this graph that OpenMP is the right solution. It's almost always faster, and overall it makes much better use of a multi-core machine.
Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me
Copyright © 2001-2013 Andrew Dalke Scientific AB