Diary RSS | All of Andrew's writings | Diary archive
Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure. Code to taste. Best served at room temperature.Python's concurrent.futures #
In this essay I'll describe how to use the concurrent.futures API from Python 3.2. Since I'm still using Python 2.7, I'll use Alex Grönholm's back port instead.
PEP 3148 gives the motivation for the new concurrent module:
Python currently has powerful primitives to construct multi-threaded and multi-process applications but parallelizing simple operations requires a lot of work i.e. explicitly launching processes/threads, constructing a work/results queue, and waiting for completion or some other termination condition (e.g. failure, timeout). It is also difficult to design an application with a global process/thread limit when each component invents its own parallel execution strategy.Basically, using "threading" and "multiprocessing" are harder than they should be.
The guiding problem: analyze web logs
My web site archives the daily server logs. Filenames are of the form "www.20120115.gz". Each access is a single line in "combined log format." Here's an example line:
198.180.131.21 - - [25/Dec/2011:00:47:19 -0500] "GET /writings/diary/diary-rss.xml HTTP/1.1" 304 174 "-" "Mozilla/5.0 (Windows NT 5.1; rv:8.0) Gecko/20100101 Firefox/8.0"It contains the host IP address, date, URL path, referrer information, user agent, and a few more fields.
I have 169 files which I want to analyze. gzcat *.gz | wc -l says there are 1,346,595 records. I'll use this data set to show some examples of how to use concurrent.futures.
Number of accesses per day (single-threaded)
For the start, how many log events are there per day?
import glob
import gzip
for filename in glob.glob("www_logs/www.*.gz"):
with gzip.open(filename) as f:
num_lines = sum(1 for line in f)
print filename.split(".")[1], num_lines
Note: gzip files didn't support context managers until Python 2.7. If
you are on Python 2.6 then you'll get the error message
AttributeError: GzipFile instance has no attribute '__exit__'When I run that I get output which looks like:
20110801 7305 20110802 7594 20110803 7470 20110804 7348 20110805 7504 20110806 4774 20110807 4870 20110808 9815 ... 20120113 18124 20120114 9245 20120115 8100 20120116 14117That's too detailed, and hard to interpret. A graph would be nicer. Here it is:
I seem to get more people during the work week than the weekend, and one of my other essays got on Hacker News in early January.
I made that plot using the matplotlib's "pylab" API:
import glob
import gzip
from pylab import *
import datetime
dates = []
counts = []
for filename in glob.glob("www_logs/www.*.gz"):
with gzip.open(filename) as f:
num_lines = sum(1 for line in f)
date = datetime.datetime.strptime(filename, "www_logs/www.%Y%m%d.gz")
dates.append(date)
counts.append(num_lines)
plot(dates, counts)
ylim(0, max(counts))
title("My website accesses")
show()
That code is a bit ugly, so I'll clean it up a bit and conveniently put it into a form which helps transition to the parallelization code:
import glob
import gzip
import datetime
import time
def count_lines(filename):
with gzip.open(filename) as f:
num_lines = sum(1 for line in f)
date = datetime.datetime.strptime(filename, "www_logs/www.%Y%m%d.gz")
return (date, num_lines)
filenames = glob.glob("www_logs/www.*.gz")
dates = []
counts = []
for filename in filenames:
date, count = count_lines(filename)
dates.append(date)
counts.append(count)
## Believe or not, but this next line does the same as the previous block(!)
# dates, counts = zip(*(count_lines(filename) for filename in filenames))
from pylab import *
plot(dates, counts)
ylim(0, max(counts))
title("My website accesses")
show()
It's slow. Make it faster!
That code takes 5.5 seconds to read the 1.3 million lines. I have a four core machine - surely I can make better use of my hardware!
I'll start with multiple threads. Python has supported threads since the 1990s, but as we all know, CPython has the Global Interpreter Lock which prevents multiple threads from running Python code at the same time. On the other hand, this task is doing file I/O, and gzip uncompression in code which might release the GIL. Perhaps threads will work here?
I'll use a very standard approach. I'll define a set of jobs, and pass that over to a thread pool. Each job takes a filename to process as input, calls the function "count_lines", and returns the timestamp and number of lines in the file.
Here's how you do that with the concurrent.futures API:
import glob
import gzip
import datetime
from concurrent import futures
def count_lines(filename):
with gzip.open(filename) as f:
num_lines = sum(1 for line in f)
date = datetime.datetime.strptime(filename, "www_logs/www.%Y%m%d.gz")
return (date, num_lines)
filenames = glob.glob("www_logs/www.*.gz")
dates = []
counts = []
with futures.ThreadPoolExecutor(max_workers=2) as executor:
for (date, count) in executor.map(count_lines, filenames):
dates.append(date)
counts.append(count)
from pylab import *
plot(dates, counts)
ylim(0, max(counts))
title("My website accesses")
show()
The "ThreadPoolExecutor" creates a thread pool, in this case with two
workers. You can submit as many jobs as you want to this thread pool,
but only two (in this case) will be processed at a time. The thread
pool is also a context manager, and no more jobs can be submitted once
the context is finished.
How are jobs submitted? You can either submit a job using submit() or you can submit a number of jobs using the "map() " idiom, which is what I did here. Remember, this is a Python 3.x API so map() returns an iterator, and not a list like it does in Python 2.x.
What is "map"?
The term "map" comes from functional programming, but functional programming is not emphasized in the Python language. Instead, we more often use a list or generator comprehension, or build a list manually. The following three methods are equivalent:
>>> print [ord(c) for c in "Andrew"] [65, 110, 100, 114, 101, 119]
>>> print map(ord, "Andrew") [65, 110, 100, 114, 101, 119]
>>> result = [] >>> for c in "Andrew": ... result.append(ord(c)) ... >>> print result [65, 110, 100, 114, 101, 119]So "map(count_lines, filenames)" is a roughly the same as:
for filename in filenames:
yield count_lines(filename)
and "executor.map" does the same thing, only it uses a thread in the
thread pool to evaluate the function.
Also, to switch the above code to its almost exact single-threaded version, what you can do is get the Python 2.x iterater version of "map" (in itertools.imap) and rewrite the above as:
import itertools
...
for (date, count) in itertools.imap(count_lines, filenames):
dates.append(date)
counts.append(count)
But is it faster?
No. ;)
With one thread in the thread pool, the task takes 5.5 seconds. The overall time is unchanged from the unthreaded version, as we should expect.
With two worker threads, it takes 7.0 seconds - even longer than with one thread!
Three worker threads takes 7.3 seconds, and four threads takes 7.4. This is not a trend you want to see when you need to parallelize your software.
There are two likely candidates for the slowdown. The GIL is the obvious one, but perhaps my computer doesn't handle parallel disk I/O that well.
What about multiple processes?
What I'll do is switch from the multi-threaded version to the multi-processing version. Instead of using a thread pool, I'll have a process pool, which uses interprocess communications to send the job request to each process and get the results:
import glob
import gzip
import datetime
from concurrent import futures
def count_lines(filename):
with gzip.open(filename) as f:
num_lines = sum(1 for line in f)
date = datetime.datetime.strptime(filename, "www_logs/www.%Y%m%d.gz")
return (date, num_lines)
filenames = glob.glob("www_logs/www.*.gz")
dates = []
counts = []
with futures.ProcessPoolExecutor(max_workers=4) as executor:
for (date, count) in executor.map(count_lines, filenames):
dates.append(date)
counts.append(count)
from pylab import *
plot(dates, counts)
ylim(0, max(counts))
title("My website accesses")
show()
Did you see the difference? I used a "ProcessPoolExecutor" instead of a "ThreadPoolExecutor".
With that small change, a process pool with only one worker finishes in 5.6 seconds, which is a bit slower. That's probably due to the overhead of starting a new process and sending data back and forth.
What's exciting is that two workers finishes in 3.6 seconds, three workers in 2.8 seconds, and four workers in 2.6 seconds. It's obviously not great speedup (perfect scaling would be 5.5, 2.3, 1.8, and 1.1 seconds), but I end up cutting my time in half with relatively little work.
Faster, please
At this point it's safe to assume that most of the gzip+line count code requires the GIL. A quick look at "gzip.py" tells me that, yes, that is the case.
With some non-trivial effort, I could write a specialized C extension to replace the gzip module. That's overkill for this project. Instead, my computer has the usual unix utilities so I'll rewrite the "count_lines" function and let them them do the work instead.
import subprocess
def count_lines(filename):
gzcat = subprocess.Popen(["gzcat", filename],
stdout = subprocess.PIPE)
wc = subprocess.Popen(["wc", "-l"],
stdin = gzcat.stdout,
stdout = subprocess.PIPE)
num_lines = int(wc.stdout.readline())
date = datetime.datetime.strptime(filename, "www_logs/www.%Y%m%d.gz")
return (date, num_lines)
Using this version, my single-threaded time is 3.2 seconds, with two
threads it's 2.0 seconds, three threads is 1.8 seconds, and four
threads is 1.7 seconds.
The respective times with the process pool are 3.3 seconds, 2.1 seconds, 1.8 seconds and 1.8 seconds. This means that very little time in either of these cases is spent in the GIL, and the slightly slower multiprocess times likely reflects extra cost of starting a process and doing interprocess communications (IPC).
What are the top URLs on my site?
Okay, I admit that the previous section was overkill, but it's fun sometimes to try out and compare different alternatives.
I want to mine my logs for more information. What are the top 10 most downloaded URLs?
This is the perfect situation for Python's Counter container. This was added in Python 2.7; see that link for how to support older versions of Python.
I'll start with the simplest single-threaded version; remember that a line in the log file looks like:
198.180.131.21 - - [25/Dec/2011:00:47:19 -0500] "GET /writings/diary/diary-rss.xml HTTP/1.1" 304 174 "-" "Mozilla/5.0 (Windows NT 5.1; rv:8.0) Gecko/20100101 Firefox/8.0"The following analysis code:
import glob
import gzip
from collections import Counter
counter = Counter()
for filename in glob.glob("www_logs/www.*.gz"):
with gzip.open(filename) as f:
for line in f:
# Extract the path field from the log string
request = line.split('"')[1]
path = request.split(" ")[1]
counter[path] += 1
for path, count in counter.most_common(10):
print count, path
takes 8.9 seconds to generate this listing:
170073 /favicon.ico 93354 /writings/diary/diary-rss.xml 81513 /dss.css 78961 /images/toplogo_left.gif 78655 /images/spacer.gif 78526 /images/toplogo_right.gif 74223 /images/news_title.gif 26528 / 25349 /robots.txt 16962 /writings/NBN/python_intro/standard.cssThat's really not exciting information. In a bit, I'll have it only display counts for the information.
A concurrent.futures version
We've determined that Python's gzip reader uses the GIL, so it's pointless to parallelize the above code using threads.
There's another issue. The "counter" is a global data structure, and that can't be shared across multiple Python processes. I'll have to update the algorithm somewhat. I'll let each worker function process a file and create a new counter for that file. Once it's done, I'll send the counter instance back to the main process for further processing.
Here's a worker function which does that.
def count_urls(filename):
counter = Counter()
with gzip.open(filename) as f:
for line in f:
request = line.split('"')[1]
path = request.split(" ")[1]
counter[path] += 1
return counter
The code in the main process has to kick off all of the jobs, collect the counters from each file, merge the counters into one, and report the top hits. The new(ish) Counter object helps make this easy because the "update()" method sums the values for shared keys instead of replacing like it would for a dictionary.
merged_counter = Counter()
filenames = glob.glob("www_logs/www.*.gz")
with futures.ProcessPoolExecutor(max_workers=4) as executor:
for counter in executor.map(count_urls, filenames):
merged_counter.update(counter)
for path, count in merged_counter.most_common(10):
print count, path
(You might be asking "How does it exchange Python objects?" Answer:
Through pickles.)
The above runs in 4.4 seconds, so about 1/2 the time as the single processor version. And after I fixed a bug (I used "counter" my report, not "merged_counter"), I got identical values as the single-threaded version.
4.4 seconds is pretty good. As we saw before, Python's gzip reader is not as fast as calling out to gzcat, so I decided to use a Popen call instead. Also, I changed the code slightly so it only reports paths which end with ".html".
The final code runs in 3.2 seconds, and here it is:
from collections import Counter
from concurrent import futures
import glob
import gzip
import itertools
import subprocess
def count_urls(filename):
counter = Counter()
p = subprocess.Popen(["gzcat", filename],
stdout = subprocess.PIPE)
for line in p.stdout:
request = line.split('"')[1]
path = request.split(" ")[1]
if path.endswith(".html"):
counter[path] += 1
return counter
filenames = glob.glob("www_logs/www.*.gz")
merged_counter = Counter()
with futures.ProcessPoolExecutor(max_workers=4) as executor:
for counter in executor.map(count_urls, filenames):
merged_counter.update(counter)
for path, count in merged_counter.most_common(10):
print count, path
It tells me that the 10 most popular HTML pages from my site are
15830 /Python/PyRSS2Gen.html 13722 /writings/NBN/python_intro/command_line.html 11739 /writings/NBN/threads.html 10663 /writings/NBN/validation.html 6635 /writings/diary/archive/2007/06/01/lolpython.html 4525 /writings/NBN/writing_html.html 3756 /writings/NBN/generators.html 3465 /writings/NBN/parsing_with_ply.html 2958 /writings/diary/archive/2005/04/21/screen_scraping.html 2786 /writings/NBN/blast_parsing.html
Resolving host names from IP addresses
My logs contain bare IP address. I'm curious about where they come from. I write about cheminformatics; are any computers from pharma companies reading my pages? To do that, I need a fully qualified domain name for each IP address. Moreover, I want to save the IP address to domain name mapping so I can use it in other analyses.
Here's how to get the FQDN given an IP address as a string.
>>> import socket
>>> socket.getfqdn("82.94.164.162")
'dinsdale.python.org'
DNS lookups take a surprisingly long time; 0.2 seconds on my desktop, and I understand this is typical. Since I have 117,504 addresses, that may take a few hours. On the other hand, all of that time is spent waiting for the network to respond. This is easily parallelized.
socket.getfqdn() is single-threaded on a Mac
I tried at first to use multiple threads for this, but that didn't work. No matter how many threads I used, the overall time was the same. After a wild goose chase where I suspected that my ISP throttled the number of DNS lookups, I found the problem.
The getfqdn function is a thin wrapper to socket.gethostbyaddr(), which itself is a thin layer on top of the C function "gethostbyaddr()". In most cases, the underlying API may only be called from a single thread. A common solution is to implement a reentrant version, usually named "gethostbyaddr_r", but the OS X developers decided that people should use a different API for that case. ("getaddrinfo ... is a replacement for and provides more flexibility than the gethostbyname(3) and getservbyname(3) functions".) The Python module only calls the single-threaded code, and uses a lock to ensure that only one thread calls it at a time.
The problem is easily solved by using a process pool instead of a thread pool.
Extracting IP addresses from a set of gzip compressed log files
The first step is to get the IP addresses which I want to convert. I only care about unique IP addresses, and don't want to waste time looking up duplicates. The code to extract the IP addresses is straight-forward. Reading the compressed file is not the slow part, so there's no reason to parallelize this or use an external gzip process to speed things up.
def get_unique_ip_addresses(filenames):
# Report only the unique IP addresses in the set
seen = set()
for filename in filenames:
with gzip.open(filename) as gzfile:
for line in gzfile:
# The IP address is the first word in the line
ip_addr = line.split()[0]
if ip_addr not in seen:
seen.add(ip_addr)
yield ip_addr
I don't want to process the entire data set during testing and
debugging. The above returns an iterator, so I use itertools.islice to
get a section of 100 terms, also as an iterator:
filenames = glob.glob("www_logs/www.*.gz")
ip_addresses = itertools.islice(get_unique_ip_addresses(filenames), 1800, 1900)
(I started with the range (0, 1000), but then ran into the
gethostbyaddr reentrancy problems. I didn't want my computer to do a
simple local cache lookup, so I change the range to (1000, 1100), then
(1100, 1200) and so on. This show that it took me a while to figure
out what was wrong!)
Using "executor.submit()" instead of "executor.map"
How am I going to do the parallel call? I could do a simple
with ProcessPoolExecutor(max_workers=20) as executor:
for fqdn in executor.map(socket.getfqdn, ip_addresses):
print fqdn
but then I lose track of the original IP address, and I wanted to
cache the IP address to FQDN mapping for later use. While it might be
possible to use a combination of itertools.tee and itertools.izip, I
decided that "map" wasn't the right call in the first place.
The executor's "map" function guarantees that the result order will be the same as the input order. I don't care about the order. Instead, I'll submit each job using the "submit()" method.
with futures.ProcessPoolExecutor(max_workers=10) as executor:
jobs = []
for ip_addr in ip_addresses:
job = executor.submit(resolve_fqdn, ip_addr)
jobs.append(job)
The submit function returns a "concurrent.futures.Future"
object. For now, there are two important things about it. You can ask
it for its "result()", like this:
ip_addr, fqdn = job.result()The "result()" method blocks until the Promise has a result. Blocking is bad for performance, so how do you know which job promises are actually ready? Use "concurrent.futures.as_completed()" for that:
for job in futures.as_completed(jobs):
ip_addr, fqdn = job.result()
print ip_addr, fqdn
The last part to this puzzle is to have the actual job return the two
element tuple with both the input IP address and the resulting FQDN
def resolve_fqdn(ip_addr):
fqdn = socket.getfqdn(ip_addr)
return ip_addr, fqdn
Put it all together and the code is:
from concurrent import futures
import glob
import gzip
import itertools
import socket
import time
def get_unique_ip_addresses(filenames):
# Report only the unique IP addresses in the set
seen = set()
for filename in filenames:
with gzip.open(filename) as gzfile:
for line in gzfile:
# The IP address is the first word in the line
ip_addr = line.split()[0]
if ip_addr not in seen:
seen.add(ip_addr)
yield ip_addr
def resolve_fqdn(ip_addr):
fqdn = socket.getfqdn(ip_addr)
return ip_addr, fqdn
filenames = glob.glob("www_logs/www.*.gz")
ip_addresses = itertools.islice(get_unique_ip_addresses(filenames), 1800, 1900)
with futures.ProcessPoolExecutor(max_workers=20) as executor:
jobs = []
for ip_addr in ip_addresses:
job = executor.submit(resolve_fqdn, ip_addr)
jobs.append(job)
# Get the completed jobs whenever they are done
for job in futures.as_completed(jobs):
ip_addr, fqdn = job.result()
print ip_addr, fqdn
This processes 100 IP addresses in about 2-8 seconds. The actual time
is highly dependent on DNS response times from servers around the
world. To reduce the variability, I increased the number of IP
addresses I used for my measurements. I found that with 20 processes I
could do about 50 lookups per second, and with 50 processes I could do
about 90 lookups per second. I didn't try a higher number.
Use a dictionary of futures instead of a list
What I did seems somewhat clumsy in that I send the IP address to the process, and the process sends the IP address back to me. I did that because it was easy. The module documentation shows another technique.
You can keep the jobs in a dictionary, where the key is the future object (returned by "submit()"), and its value is the information you want to save. That is, you can rewrite the above as:
with futures.ProcessPoolExecutor(max_workers=20) as executor:
jobs = {}
for ip_addr in ip_addresses:
job = executor.submit(socket.getfqdn, ip_addr)
jobs[job] = ip_addr
# Get the completed jobs whenever they are done
for job in futures.as_completed(jobs):
ip_addr = jobs[job]
fqdn = job.result()
print ip_addr, fqdn
Notice how it doesn't need the "resolve_fqdn" function; it can call
socket.getfqdn directly.
Add a callback to the job future
The conceptual model so far is "create all the jobs" followed by "do something with the results." This works well, except for latency. I only processed 100 IP addresses in my example. I removed the "islice()" call and asked it to process all 117,504 IP addresses in my data set. The code looked like it wasn't working because it wasn't giving output. As it turned out, it was still loading all of the jobs.
The concurrent module uses an asynchronous model, and just like Twisted's Deferred and jQuery's deferred.promise(), there's a way to attach a callback function to a future, which will be called once the answer is ready. Here's how it works:
with futures.ProcessPoolExecutor(max_workers=50) as executor:
for ip_addr in ip_addresses:
job = executor.submit(resolve_fqdn, ip_addr)
job.add_done_callback(print_mapping)
When each job future is ready, the concurrent library will call the
"print_mapping" callback, with the job result as its sole parameter:
def print_mapping(job):
ip_addr, fqdn = job.result()
print ip_addr, fqdn
Technical notes: The callback occurs in the same process which
submitted the job, which is exactly what's needed here. However, the
documentation doesn't say that all of the callbacks will be done from
the same thread, so if you are using a thread pool then you probably
want to use a thread lock around a shared resource. (sys.stdout is a
shared resource, so you would need one around the print statement
here. I'm using a process pool, and the concurrent process pool
implementation uses a single local worker thread, so I don't think I
have to worry about contention. You should verify that.)
Here is the final callback-based code:
from concurrent import futures
import glob
import gzip
import socket
def get_unique_ip_addresses(filenames):
# Report only the unique IP addresses in the set
seen = set()
for filename in filenames:
with gzip.open(filename) as gzfile:
for line in gzfile:
# The IP address is the first word in the line
ip_addr = line.split()[0]
if ip_addr not in seen:
seen.add(ip_addr)
yield ip_addr
def resolve_fqdn(ip_addr):
fqdn = socket.getfqdn(ip_addr)
return ip_addr, fqdn
## A multi-threaded version should use create a resource lock
# import threading
# write_lock = threading.Lock()
def print_mapping(job):
ip_addr, fqdn = job.result()
print ip_addr, fqdn
## A multi-threaded version should use the resource lock
# with write_lock:
# print ip_addr, fqdn
filenames = glob.glob("www_logs/www.*.gz")
with futures.ProcessPoolExecutor(max_workers=50) as executor:
for ip_addr in get_unique_ip_addresses(filenames):
job = executor.submit(resolve_fqdn, ip_addr)
job.add_done_callback(print_mapping)
It processed my 117,504 addresses in 1236 seconds (about 21 minutes),
which means a rate of 95 per second. That's much better than my
original rate of 5 per second!
functools.partial
By the way, just like earlier, there's no absolute need for the worker function to return the ip address. I could have written this as:
job = executor.submit(socket.getfqdn, ip_addr) job.add_done_callback(functools.partial(print_mapping, ip_addr))or even as an ugly-looking lambda function with a default value to get around scoping issues.
In this variation, print_mapping becomes:
def print_mapping(ip_addr, job):
fqdn = job.result()
print ip_addr, fqdn
where the "ip_addr" was stored by the "partial()", and where "job"
comes from the completed promise.
This approach feels more "pure", but I find that methods like this are harder for most people to understand.
Who subscribes to my blog's RSS feed?
A quick check of the list of hostnames shows that no one from AstraZeneca reads my blog from a work machine. Actually, I don't have any accesses from them at all, which is a bit surprising since I know some of them follow what I do. They might use a blog aggregator like Google Reader, or use a home account, or perhaps AZ's data goes through a proxy which doesn't have the name "az" or "astrazeneca" in it.
There are requests from Roche, and Vertex, but no blog subscribers. Who then subscribes to my blog?
Here I print the hostnames for requests which fetch my blog's RSS feed. With 'zgrep' it's fast enough that I'm not going to parallelize the code.
import subprocess
import glob
hostname_table = dict(line.split() for line in open("hostnames"))
filenames = glob.glob("www_logs/www.*.gz")
p = subprocess.Popen(["zgrep", "--no-filename", "/writings/diary/diary-rss.xml"] + filenames,
stdout = subprocess.PIPE)
for line in p.stdout:
ip_addr = line.split()[0]
hostname = hostname_table[ip_addr]
if hostname_table == ip_addr:
# Couldn't find a reverse lookup; ignore
continue
print hostname
A quick look at the output shows a lot of requests from Amazon and
Google, so I removed those, and report the results using:
% python2.7 readers.py | sort | uniq -c | sort -n | grep -v amazon | grep -v google.comSince I have 169 days of log file, I'll say that "avid readers" poll the URL at least once per day. That gives me:
176 modemcable139.154-178-173.mc.videotron.ca 184 62.197.198.100 187 v041222.dynamic.ppp.asahi-net.or.jp 195 modemcable147.252-178-173.mc.videotron.ca 200 94-226-195-151.access.telenet.be 202 217.28.199.236 223 123.124.21.91 223 65.52.56.128 241 5a-m02-d1.data-hotel.net 263 117.218.210-67.q9.net 263 173-11-122-218-sfba.hfc.comcastbusiness.net 274 71-222-225-175.albq.qwest.net 332 modemcable069.85-178-173.mc.videotron.ca 335 no-dns-yet.convergencegroup.co.uk 337 ip-81-210-146-57.unitymediagroup.de 338 embln.embl.de 353 cpe-72-183-122-94.austin.res.rr.com 365 90-227-178-245-no128.tbcn.telia.com 370 138.194.48.143 408 210.96-246-81.adsl-static.isp.belgacom.be 428 k8024-02l.mc.chalmers.se 490 cpe-70-115-243-212.satx.res.rr.com 493 www26006u.sakura.ne.jp 527 219.239.34.54 534 44.186.34.193.bridgep.com 535 5a-m02-d2.data-hotel.net 586 5a-m02-c6.data-hotel.net 666 168-103-109-30.albq.qwest.net 676 y236106.dynamic.ppp.asahi-net.or.jp 698 82-169-211-97.ip.telfort.nl 759 w-192.cust-7150.ip.static.uno.uk.net 1071 adsl-75-23-68-58.dsl.peoril.sbcglobal.net 1217 hekate.eva.mpg.de 1223 211.103.236.94 1307 li147-78.members.linode.com 1342 static24-72-40-170.r.rev.accesscomm.ca 1346 dinsdale.python.org 1398 145.253.161.126 1771 pat1.orbitz.net 2060 artima.com 3164 148.188.1.60 3767 90-224-169-87-no128.tbcn.telia.com 4518 jervis.textdrive.com 5791 cpe-70-114-252-25.austin.res.rr.com 6171 ip21.biogen.com 8264 it18689.research.novo.dk 10919 61.135.216.104I know who comes from one of the Max Planck Institute machines, and a big "hello!" to the readers from Novo Nordisk and Biogen - thanks for subscribing to my blog! "dinsdale.python.org" is the Planet Python aggregator, and artima.com is the Artima Developer Community; another aggregator.
I know more than 60 people read my blog posts within 12 hours of when they are posted, so this tells me that most people read blogs through a web-based aggregator (like Planet Python or Google Reader), and not through a program running on their desktop. I'm glad to know I'm not alone in doing that!
I parallelize an algorithm #
I, with help from Kim Walisch, have been adding OpenMP support to chemfp. This is the first time I've used OpenMP, and I'm pleased to say there are places where it works really well. However, OpenMP, like every other parallization strategy, is no global panacea. It can take a lot of effort to get good scaling, and there are cases where it doesn't feel any easier to use OpenMP than to use pthreads (POSIX threads).
In this essay I'll walk through how I converted a single-threaded algorithm to OpenMP, and compare the results to a version built on an Python async I/O library atop of pthreads.
The single-threaded algorithm
Suppose you have a list of N objects - let's call them "fingerprints" - and a function which compares two fingerprints - call it "tanimoto" - which returns a similarity score value from 0.0 to 1.0. A score of 0.0 means "not similar" and 1.0 means "very similar." The similarity of a fingerprint compared with itself is always 1.0, and the tanimoto function is symmetric, so that tanimoto(x, y) == tanimoto(y, x).
One question you can ask is "which fingerprints are within threshold similarity to the tenth fingerprint?" I'll use the term "neighbor" to include any fingerprint which at least threshold similar to a given fingerprint, so I can restate the above as "what are the threshold neighbors of the tenth fingerprint.
The code for this is not hard:
fingerprint_t fingerprints[] = {array of fingerprint objects};
int query_index = 9; // The tenth fingerprint
assert(0.0 <= threshold && threshold <= 1.0);
for (target_index=0; target_index<N; target_index++) {
if (tanimoto(fingerprint[query_index], fingerprint[target_index]) >= threshold) {
if (query_index != target_index) {
printf("%d is a neighbor\n", target_index);
}
}
}
The main subtlety is the check that I don't report that the first
fingerprint is a neighbor of itself. There are a few ways to handle
that case: here I chose one which is optimized for performance,
assuming relatively view targets are similar enough.
Fingerprint searches are in a high-dimensional space so optimizations like k-d trees, which work for lower dimensional spaces, suffer from the curse of dimensionality. For exact answers, the best you can expect is linear performance. There are clever ways to get sublinear performance, but the worst case is still linear. Still, computers are fast, and can search 100,000 fingerprints in a blink.
Another question you can ask is, what are the neighbor counts for all of the fingerprints in the data set? Here's code which computes that:
fingerprint_t fingerprints[] = {array of fingerprint objects};
int count, query_index, target_index;
int counts[N] = {}; // initialize to 0
assert(0.0 <= threshold && threshold <= 1.0);
for (query_index=0; query_index<N; query_index++) {
count = 0;
for (target_index=0; target_index<N; target_index++) {
if (tanimoto(fingerprint[query_index], fingerprint[target_index]) >= threshold) {
count++;
}
}
/* The counts are too high by one since it includes the diagonal term */
/* and tanimoto(fingerprint[i], fingerprint[i]) == 1.0 */
/* Decrement by one to get the correct answer */
counts[query_index] += count - 1;
}
What this does is go row-by-row through the NxN comparison matrix,
compute the similarities, and add up the number of times where the
similarity is high enough. Since I include the diagonal term in the
counts, and since the similarity along the diagonal is always 1.0, I
have to subtract off one after computing the total row count.
Some might consider it inelegant that I count the self-similarity in (the main loop and subtract one at the end, but it makes the code short and understandable, and while there are N extra calculations, the double loop has a total of N*N calculations, so it's only a small amount of extra work.)
Parallelizing the NxN algorithm
Parallelizing this with OpenMP is dead-simple. I ask the compiler to evalute the row loop in parallel.
fingerprint_t fingerprints[] = {array of fingerprint objects};
int count, query_index, target_index;
int counts[N] = {}; // initialize to 0
assert(0.0 <= threshold && threshold <= 1.0);
#pragma omp parallel for private(count, target_index)
for (query_index=0; query_index<N; query_index++) {
count = 0;
for (target_index=0; target_index<N; target_index++) {
if (tanimoto(fingerprint[query_index], fingerprint[target_index]) >= threshold) {
count++;
}
}
/* The counts are too high by one since it includes the diagonal term */
/* and tanimoto(fingerprint[i], fingerprint[i]) == 1.0 */
counts[query_index] += count - 1;
}
That's it! OpenMP is a great fit to this case. With one new line of
code, I have very good scaleup across many cores.
It's not perfect scaleup. For one, the tanimoto() calculation is fast; fast enough that memory bandwidth and cache performance is an issue. It might be faster to use Z-ordering or other cache-oblivious ordering. That's outside the scope of this essay. For that matter, I hadn't tested this hypothesis because I use another technique which usually gives good cache behavior for the situations I'm most concerned about.
What about symmetry?
That easy parallelization is great, right? Well, I'm missing out on a simple factor of two speedup. The tanimoto function is symmetric, so I only need to compute the upper triangle terms. Here's the single-threaded implementation:
for (query_index=0; query_index<N; query_index++) {
for (target_index=query_index+1; target_index<N; target_index++) {
if (tanimoto(fingerprint[query_index], fingerprint[target_index]) >= threshold) {
counts[query_index]++;
counts[target_index]++;
}
}
}
It looks simple. Too bad it doesn't parallelize well. Increment is not
an atomic operation. If multiple threads execute "counts[i]++" at the
same time, for the same value of i, then it might be that thread 1
reads a value of 4, thread 2 reads a value of 4, thread 1 writes the
incremented value 5, and thread 2 writes its own incremented value of
5. This is bad.
One solution is to tell OpenMP that the increment code is a "critical" section, which means only one thread can execute it at a time. The resulting code is:
#pragma omp parallel for private(target_index)
for (query_index=0; query_index<N; query_index++) {
for (target_index=query_index+1; target_index<N; target_index++) {
if (tanimoto(fingerprint[query_index], fingerprint[target_index]) >= threshold) {
#pragma omp critical (add_count)
counts[query_index]++;
#pragma omp critical (add_count)
counts[target_index]++;
}
}
}
Here, 'add_count' is the symbolic name for a global lock to a critical
section.
I wrote something like this with a very high threshold, and found and almost perfect two-fold speedup. Go OpenMP!
Amdahl's Law strikes again!
The problem is the critical sections are single-threaded. When I lower the threshold, I find more matches, and more threads try to run the single threaded code. This runs directly into Amdahl's Law. The critical section becomes the limiting factor as all the threads contend for the same lock.
I can reduce the contention a bit by keeping track of the row counts in a thread-local variable:
#pragma omp parallel for private(count, target_index)
for (query_index=0; query_index<N; query_index++) {
count = 0;
for (target_index=query_index+1; target_index<N; target_index++) {
if (tanimoto(fingerprint[query_index], fingerprint[target_index]) >= threshold) {
count++;
#pragma omp critical (add_count)
counts[target_index]++;
}
}
/* Correction on 2012-01-12; Commenter scott_s on Hacker News pointed out */
/* that only one thread will ever get here with a given query_index. */
/* There is no chance that multiple threads try to change the same value */
/* This critical section can be removed without affecting correctness. */
/* I leave this code here because it affects the timings. */
/* It does not affect the conclusion that lock contention can make things slow */
#pragma omp critical (add_count)
counts[query_index] += count;
}
This is the simplest seemingly-reasonable parallization of the
upper-triangle algorithm.
How well does this work? My desktop has four cores. I can compare the performance of the original non-symmetric code with the symmetric one:
| Algorithm | Tanimoto thresholds | ||
|---|---|---|---|
| 0.8 | 0.6 | 0.5 | |
| symmetric | 40s | 151s | 660s |
| non-symmetric | 82s | 170s | 207s |
That's terrible. I have one algorithm which is best when there are few similarities, and another which is best when there are many similarities, and because the number of similarities is highly data-dependent, I don't have an easy way to figure out which algorithm to use.
Use many critical sections instead of one
The problem is that my four cores all want to use a single critical section. When one core has the lock, the other threads have to wait. What I can do is increase the number of critical sections. For example, I can have one lock to get access to the even-numbered rows, and another lock to get access to the odd-numbered rows. Here's the corresponding code:
#pragma omp parallel for private(count, target_index)
for (query_index=0; query_index<N; query_index++) {
count = 0;
for (target_index=query_index+1; target_index<N; target_index++) {
if (tanimoto(fingerprint[query_index], fingerprint[target_index]) >= threshold) {
count++;
switch (target_index % 2) {
case 0:
#pragma omp critical (add_count0)
counts[target_index]++;
break;
case 1:
#pragma omp critical (add_count1)
counts[target_index]++;
break;
}
}
}
/* Correction on 2012-01-12; Commenter scott_s on Hacker News pointed out */
/* that only one thread will ever get here with a given query_index. */
/* There is no chance that multiple threads try to change the same value */
/* The following could be replaced with a simple */
/* counts[query_index] += count; */
/* I leave this code here because it affects the timings. */
/* Since this is only called O(N) (instead of O(N*N/2) times), it */
/* should have minimal effect on the overall time. */
switch (query_index % 2) {
case 0:
#pragma omp critical (add_count0)
counts[query_index] += count;
break;
case 1:
#pragma omp critical (add_count1)
counts[query_index] += count;
break;
}
}
}
That's clumsy, but the performance is a bit better. With two critical
sections the times are:
| Algorithm | Tanimoto thresholds | ||
|---|---|---|---|
| 0.8 | 0.6 | 0.5 | |
| symmetric | 39s | 114s | 375s |
| non-symmetric | 82s | 170s | 207s |
What about even more critical sections? I tried a range of values. Here's the table:
| number of critical sections | Tanimoto thresholds | |||||
|---|---|---|---|---|---|---|
| 0.8 | 0.6 | 0.5 | 0.4 | 0.2 | 0.01 | |
| 1 | 40 | 151 | 660 | |||
| 2 | 39 | 114 | 375 | over 37 minutes | ||
| 16 | 40 | 86 | 133 | 299 | ||
| 64 | 41 | 84 | 105 | 137 | 271 | 307 |
| 128 | 40 | 82 | 102 | 131 | 244 | 278 |
| non-symmetric | 82 | 170 | 207 | 240 | 272 | 280 |
As you can see, with 128 critical sections and in the worst case, my code which takes advantage of symmetry is the same speed as the one which doesn't. This likely means that the code acquire and release the critical section lock has about the same amount of overhead at the Tanimoto similarity calculation.
I probably should have tried 256 different locks, but I think this code is ugly enough as it is, and very few people use thresholds below 0.6, much less down to 0.2. I do wonder how the times compare if there are, say, 16 cores, but this it's time to try a different solution.
What about per-thread count arrays?
There is an alternate solution. I could sum up the counts in individual, private/per-thread arrays and merge the final counts. Here's what the code looks like:
/* Correction 2012-01-08: Originally I had omp_get_num_threads() here but */ /* as acq points out on Hacker News, this only returns the number of active */ /* threads while in a parallel section. Otherwise it returns 1. I fixed my */ /* actual code during testing, but that fix didn't make its way over here. */ /* I also tried using omp_get_max_threads() but that wasn't supported on my Mac. */ int *parallel_counts = (int *) calloc(omp_get_max_threads() * N, sizeof(int)); int *per_thread_counts; #pragma omp parallel for private(count, target_index, per_thread_counts) for (query_index=0; query_index<N; query_index++) { per_thread_counts = parallel_counts + (N * omp_get_thread_num() ); count = 0; for (target_index=query_index+1; target_index<N; target_index++) { if (tanimoto(fingerprint[query_index], fingerprint[target_index]) >= threshold) { count++; per_thread_counts[target_index]++; } } per_thread_counts[query_index] += count; } for (query_index=0; query_index<N; query_index++) { count = 0; for (thread=0; thread<omp_get_num_threads(); thread++) { count += parallel_counts[query_index+N*thread]; } counts[query_index] += count; } free(parallel_counts);This requires no locking, and only a very small bit of sequential code which is linear in the number of fingerprints. There's more code, but this algorithm should scale better than the previous algorithm.
Here are the timings:
| Tanimoto thresholds | ||||||
|---|---|---|---|---|---|---|
| method | 0.8 | 0.6 | 0.5 | 0.4 | 0.2 | 0.01 |
| 128 critical sections | 40 | 82 | 102 | 131 | 244 | 278 |
| non-symmetric | 82 | 170 | 207 | 240 | 272 | 280 |
| per-thread counts | 40 | 83 | 100. | 116 | 135 | 137 |
And now using pthreads from Python
You might conclude this still shows a win for OpenMP. The problem is that the above is essentially identical to how I implement this algorithm using pthreads. I'm rather fond of Python's new concurrent.futures module, so I tested out a pthread-only driver to a single-threaded count function implemented in C.
import ctypes
import time
import itertools
from collections import defaultdict
import threading
import chemfp
from chemfp import futures
import _chemfp
def count_tanimoto_hits(all_counts, arena, threshold, row):
thread_id = threading.current_thread().ident
# This implements essentially:
# for (i=row; i<row+1; i++) {
# for (j=row+1; j<N; j++) {
# if tanimoto(fingerprints[i], fingerprints[j]) >= threshold {
# counts[i]++;
# counts[j]++;
# }
# }
_chemfp.count_tanimoto_hits_arena_symmetric(
threshold, arena.num_bits,
arena.start_padding, arena.end_padding, arena.storage_size, arena.arena,
row, row+1, row+1, len(arena),
arena.popcount_indices, all_counts[thread_id])
def find_counts(arena, threshold, num_threads):
# Allocate per-thread storage (based on the thread-id)
def make_empty_counts():
return (ctypes.c_int*len(arena))()
all_counts = defaultdict(make_empty_counts)
# Use a thread-pool with 4 worker threads
with futures.ThreadPoolExecutor(max_workers=4) as executor:
for row in xrange(len(arena)):
executor.submit(count_tanimoto_hits, all_counts, arena, threshold, row)
# Merge the private counts back into one total list of counts
return [sum(cols) for cols in itertools.izip(*all_counts.values())]
arena = chemfp.load_fingerprints("zinc_drug_like.fps")
chemfp.set_num_threads(1) # Don't use multiple OpenMP threads
for threshold in (0.8, 0.6, 0.5, 0.4, 0.2, 0.01):
t1 = time.time()
x = find_counts(arena, threshold, 4)
t2 = time.time()
print threshold, t2-t1, " ", sum(x)
The "chemfp.set_num_threads(1)" case bypasses the OpenMP-based code
and tells "count_tanimoto_hits_arena_symmetric" to use the simple
upper-right triangle implementation. As a result ...
| Tanimoto thresholds | ||||||
|---|---|---|---|---|---|---|
| method | 0.8 | 0.6 | 0.5 | 0.4 | 0.2 | 0.01 |
| 128 critical sections | 40 | 82 | 102 | 131 | 244 | 278 |
| non-symmetric | 82 | 170 | 207 | 240 | 272 | 280 |
| per-thread counts | 40 | 83 | 100. | 116 | 135 | 137 |
| Python/pthreads | 48 | 92 | 112 | 128 | 145 | 149 |
While I did not test it out, I expect that a corresponding C/C++ implementation would have much less performance overhead. I just don't have the experience of using C pthread API, or an async I/O library for C/C++ like Grand Central Dispatch, or C++'s new promises to try to implement that code directly in C. It really is much easier to use OpenMP than to figure out those alternate solutions for C.
Possible bad benchmark comparisons
BTW, what I ended up doing in my Python code was to define a "band" of 100 rows, and let each thread process 100 rows at a time. This should cut the overhead down from 8 seconds to 0.08 seconds, making the pthread code about comparable to the OpenMP code. I didn't test it out though, because my actual code uses a more sophisticated algorithm which also have the effect of improving cache coherency, and there's evidence that banding makes the coherency worse and causes slowdowns while waiting for memory fetches.
Unfortunately, it also looks like the analysis I did the other day had a flaw which causes the pthread benchmark to have bad memory access behavior. In short, the pthreads were randomly assigned bands to process, while the OpenMP version also gets randomly assigned bands, but all of the cores work on tasks in the same band. Hence, better coherency. (It looks like the pthread performance for one test case goes from 48 seconds with randomly assigned bands to 40 seconds with sequentially assigned bands.)I consider this a win for OpenMP. I did random assignments so I could display a progress bar. Assymmetries in the data mean that the first few bands and the last few bands are much easier to process than ones in the middle. With random band assignment, I get a much better estimate of the time to completion. Using OpenMP gives me that estimate without a noticable performance hit. With pthreads, it's much hard to get both performance and a estimate.
Conclusion
Effective parallelization with good scaleup is hard, no matter which technique you use. There are a lot of subtle issues. You need to understand how the technique works and measure your results to make sure you really do understand the problem.
My experience is that OpenMP is an effective technique to help you with your task. In a few cases, a trivially simple addition of a line of code gives instant speedup. It's more likely though that you've got some work ahead of you to make this happen.
If you're already using pthreads, Grand Central Dispatch, or some other multithreading or asyncronous I/O system, then I don't think that OpenMP adds much new. Instead, I think its biggest advantage is that you can make changes to existing code without introducing a new library API, without having to set up your own event loop/reactor, and with compiler-based thread control and syncronization primitives which make a large number of potential bugs of hand-built systems disappear.
My views on OpenMP #
In private email a correspondent observed that OpenMP makes threading very easy, but "it really seems under utilized in the community." (Here, 'community' is 'scientific programming.') I was surprised to find out that I had strong views on the topic.
OpenMP sits between several other pieces of technology, being:
- GPU computing
- cloud computing
- POSIX and other common threading libraries
The new hotness is GPUs. Wes Faler gave a presentation at the recent 28th Chaos Communication Congress on Evolving Custom Communication Protocols. He mentioned they ported C++ code over to the GPU. The unoptimized version was 7 times slower on the GPU than the CPU. However, they do many evaluations using the same function, and because there are so many compute threads in the GPU, the overall time was a factor of 7 faster. Similarly, Haque et al. showed that a 4 core desktop machine, properly tuned, was "only" about 5x slower than a GPU card.
It looks like GPU computing is currently the approach to take if you do a lot of evaluation of similar tasks, assuming you have the GPUs and programming time available. That performance (and the novel way of computing) interests people who might otherwise use OpenMP.
Cloud computing is another hotness. Alex Martelli was recently interviewed by Larry Hastings in Radio Free Python episode #2. At 33:47 Larry asked about Python's global interpreter lock and Alex's reply was:
I hate threading anyway. Multiprocessing is the way to go, and message-passing, not shared memory. That just doesn't scale. I use multithreading so I can use all of my 16 cores, or whatever is the average number of cores in a machine these days. Big furry deal. I've got a few thousand servers waiting for me in the data center and how do I use those with threading?The topic comes up several times in the ensuing discussion.
What good indeed is OpenMP, which might be used for a 16 node machine, if you're working on problems which involve 10,000 distributed servers?
Even single nodes have multiple cores these days, and a good OpenMP implemenation might help make good use of the nodes in that cloud. However, you have to compare OpenMP to traditional POSIX multithreading. OpenMP works for C/C++ and Fortran, but not for Python nor (it seems) Java, nor other languages which support pthreads. You're out of luck if you want to use OpenMP with one of those other languages.
Some things scale up wonderfully well by adding one or two OpenMP directives, but parallelism is rarely as trivial as giving a few hints to the compiler. I think that the non-trivial cases of parallelizing with OpenMP are about as much work as using pthreads, or a system like Grand Central Dispatch. I'll work through an example of doing that in my next essay.
I do believe that OpenMP scales better than these alternatives for some cases, in part because the compiler is doing the work rather than using a library API. My tests so far show that pthreads and OpenMP have about the same scaling with two processors, and I need four or more cores to show a strong OpenMP advantage.
Most desktop/laptop computers just don't yet have 8+ cores. (Alex Martelli said otherwise, but perhaps he's talking about Google's data centers.) Most people develop for their own computers, which lessens the incentive to work on good multicore scaling.
I have a four-core machine, and I'm willing to write a Python extension in C which uses OpenMP. Even then I've run into some difficulties. It took a while but I figured out how to configure Python's setup.py so it includes the right "use OpenMP" flag for each compiler. It includes a hard-coded list of compilers which do and do not support OpenMP. Also, did you know that on a Mac you must run OpenMP tasks in the main thread, and not in a pthread? Otherwise your program crashes; even when you have a single OpenMP thread! I had to figure out a workaround so I could use my library unchanged inside Django.
People are interested in OpenMP development, but some who might use OpenMP are drawn to other technologies. Some tasks are very appropriate for OpenMP, but they are almost as appropriate for other, more common technologies. OpenMP scales well, but most people don't have the hardware where OpenMP shines. Even when they do, they have to work in one of a handful of languages, and in somewhat restricted circumstances.
All these contribute to diminishing OpenMP utilization in the community.
OpenMP vs. POSIX threads #
A few years ago I heard about OpenMP. It's a form of multi-threaded programming meant to make good use of multiprocessor and multicore hardware.
Earlier this year, I read Anatomy of High-Performance 2D Similarity Calculations, which used OpenMP as part of their Tanimoto search algorithm. This summer, Kim Walisch contributed OpenMP variations to my popcount benchmark.
The changes to the code were almost trivial, so I asked Kim if he would help me add OpenMP support to chemfp. He did, and it will be part of the upcoming 1.1 release.
OpenMP is only one of many ways to make effective use of multiple processors. Another common way is through POSIX threads, or its equivalent on Windows. A third is to spawn off a new process and use IPC to communicate with it.
How do these techniques compare to each other?
I decided to use Python 3.2's new concurrent.futures module to handle the multithreaded and multiprocess cases, or rather, the backport to 2.x. This merges the underlying "threading" and "multiprocessing" APIs into a common form, based on the "future" concept for asynchronous programming. I quickly found that the multiprocess API had too much overhead so I won't talk about it.
My test case computes and stores the NxN Tanimoto similarity matrix between a set of fingerprints. I took N=110885 compounds from the ZINC data set and generated 2048-bit fingerprints using RDKit's hash fingerprint. The chemfp fingerprint type is "RDKit-Fingerprint/1 minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=4 useHs=1".
I don't actually save the 12,295,483,225 values. For one, simple symmetry reduces that to 6,147,686,170 values. For another, in the problems I'm interested, I can ignore similarities below a threshold. Instead, chemfp internally uses a sparse matrix format.
For this test I used a simple parallelization. I break up the rows of the matrix into bands, and fill in the parts of the upper-right triangle which are at or above the threshold. In the OpenMP version, all the OpenMP threads work on a single band. In the concurrent.futures version, each thread processes its own band. These differences fall naturally out of the how those two APIs work.
Once I got both implementations working, debugged, and optimized, I could finally do some performance numbers. I wanted to see how OpenMP and pthreads scale over a range of processors and range of threshold values. My desktop has two dual-core CPUs, so I decided to rent time on a "High-CPU Extra Large Instance" Amazon EC2 node. It has 8 nodes of the form:
vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Xeon(R) CPU E5410 @ 2.33GHz stepping : 10 cpu MHz : 2333.336 cache size : 6144 KB fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu de tsc msr pae cx8 sep cmov pat clflush mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good aperfmperf pni ssse3 cx16 sse4_1 hypervisor lahf_lm bogomips : 4666.67 clflush size : 64 cache_alignment : 64 address sizes : 38 bits physical, 48 bits virtual power management:Sadly, this is not a machine which supports the POPCNT instruction, so the chemfp popcount code fell back to Imran Haque's SSSE3-based implementation.
The machine has about 6GB of free memory. I knew from testing on my desktop that I only needed 4.1 GB for the worst case, so I didn't run into memory problems.
I ran my benchmark code through set of combinations of the OpenMP vs. pthreads, with thread counts from 1 to 8, and with threshold values of 1.0, 0.99, 0.97, 0.95, 0.93, 0.9, 0.88, 0.85, 0.8, 0.75, 0.7, 0.65, 0.6, 0.55, and 0.5. I also ran the values several times in order to get some idea of the timing variations. Yes, that took about 20 hours to run.
You can get the raw
data if you're really interested. I used matplotlib to make a
couple of 3D plots. First, here's the overall times:
You can see that the OpenMP code (in red) is usually faster than the
pthread code (in blue). The exception is for thresholds of 0.55 and
lower. BTW, a threshold of 0.5 finds 285,371,794 matches in the NxN
matrix, which means this stores a few gigabytes of data.
To make more sense of this data, here's a plot of the speedup, defined
as T0/T where T0 is the fastest single-threaded time for a given
threshold and T is the fastest time for a given number of
threads. Perfect speedup would give a value of 8 for 8 processors.
The region near threshold=1.0 is so jagged because the search time is
less than the variability in the system, and is close to the minimum
time resolution size of 1 second.
Most people in this field use thresholds in the range 0.7-1.0. It's obvious from this graph that OpenMP is the right solution. It's almost always faster, and overall it makes much better use of a multi-core machine.
ECFP-like fragments in PubChem #
Previously I posted about unique fragments in PubChem. That used my molecular subgraph enumeration algorithm. In this essay I'll report some results from looking at the unique bit counts from RDKit's MorganFingerprint algorith, which is an ECFP variant.
My first graph in the previous essay shows that there are about 2 million unique fragments of size up to 7 in PubChem, and that the second 1/2 of the data files contained few fragments which weren't in the first 1/2. This suggests that there aren't that many substructures of size 7, compared to the number of possible structures of size 7, which is quite curious.
Rather, I expect that the number of unique fragments should level off with enough molecules. In the simplest case, there are 112 elements and 5 elements which can be aromatic, for a total of 117 possible unique atom types. I found 110 of them in my PubChem subset.
Similarly, I found only 1103 unique fragments with two atoms. The breakdown as a function of fragment length is:
- Length 1: 110 unique fragments
- Length 2: 1103 unique fragments
- Length 3: 4209 unique fragments
- Length 4: 19398 unique fragments
- Length 5: 86336 unique fragments
- Length 6: 364342 unique fragments
- Length 7: 1447488 unique fragments
How many possible substructures are there of size 7? Assuming only 10 atom types and ignoring cycles and different bond types gives about 10 million. There are 1+1+2+6+21+112+853=996 connected subgraphs of up to size 7, and I'll guess that 800 of them are chemically accessible. I'll hazard 2**6 different bond types, so about 500 billion possible substructures. That's over three orders of magnitude larger than what I see.
It's curious because it means that any substructure-based fingerprint using up to 7 atoms has at most a small multiple of ~2,000,000 different values. Sure, a hash algorithm might potentially generate values in the range 0-232, but only a few million of them will be actually be generated.
(Some algorithms will generate multiple bits per feature, eg, features with 1 or 2 atoms generates a single bit while features with 3 or more atoms generates two bits. This acts as a simple weighting scheme. There's a perfect correlation between those two bits, so I'm not counting the second one as meaningful.)
Fingerprint statistical models often assume a roughly Bernoulli process, and the number of unique features is unbounded, though increasingly rare. This observation suggests that there is actually an upper limit to the size, which changes the distribution type slightly.
Is this observable in other fingerprints? I used RDKit's MorganFingerprint algorithm, which is a variation of the ECFP "extended connectivity fingerprint." I used radii values of 1, 2, 3, 4, and 5, and with the other parameters at their default. Each step includes increasingly distant information, so should be more diverse.
The following shows the number of unique bits found as a function of the number of molecules processed. The molecules are ranked by PubChem id, although it doesn't include all of the PubChem structures since RDKit couldn't process all of the structures. (It complains about a number of bad valences.) A more complete analysis would randomize the structures to remove local coherence effects.
That graph is almost impossible to understand because the dynamic
range is so large. Log and log-log scales don't help either. The
best solution is to normalize by the maximum number for each
graph. That gives:
This sort of curve is a species discovery curve. It appears to show that the MorganFingerprint(radius=1) saturates at around 100,000 different bit values, and radius=2 might saturate at around 3.5 million unique values. This makes sense, as the radius=2 fingerprint corresponds to about 6-7 atoms, and I found 2 million unique values. (An average branching factor of 2.5 gives 2.5**2 or 6.25 atoms. However, a local branching factor of 3 gives 9 atoms, which adds more unique values to the fingerprint.)
I'll guess that there's under 40 million unique bits for radius=3 but it becomes harder to estimate. As the radius increases, the trend in the diversity of new values clearly gets closer to linear, which means there's less and less saturation. I can't predict the total number of unique values for radius=5 because it's still too flat.
The species growth curve is often fit as A(1-exp(-bx)) or A(log(b*x-1)). The first has a fixed limit, the latter implies there is no upper bound. This case is somewhere in the middle: for good chemical reasons, there's a large but finite number of possible ways to arrange a fixed number of atoms. For the smallest fragment size (1 atom), we are at that limit. For larger sizes, we are nowhere near the chemical limit, and I think a log fit works best.
Equally obvious, I would need to randomize the input order in order to get a smoother curve so I could make that prediction. But the end of the holidays was a couple of days ago and I need to get back to paying work.
Unique fragments in PubChem #
For reasons I'll get into later, I wanted to get an idea of the subgraph distribution of PubChem. That is, given my method for molecular subgraph enumeration, create all subgraphs of up to size 7 atoms and get an idea of how common they are. More specifically, atom uniqueness depends only on the atomic element and aromaticity, as assigned by OEChem, and the unique bond categories are "single-or-aromatic", double, and triple.
Last month I downloaded 2,138 sdf.gz files from PubChem and did structure perception with OpenEye's OEChem. Starting a couple of weeks ago, I use my subgraph enumeration algorithm to process 1,724 of them. For some reason, it stopped at that point. Since it took 7.5 days to process those files, and the data set is already a bit ungainly, I decided to leave the full analysis for another time and to not figure out what happened with the processing.
In the 1,724 files are 21,570,907 PubChem records and my enumeration found 1,925,185 unique substructures.
I kept track of the number of unique fragments per input file and the
running total number of unique fragments over all of the files,
plotted here:

You can see that 50% of the unique fragments are in the first 25% of
the data files and essentially all are found in the first 50% of the
files. (The number does increase after the 1000th file, but it's very
slow.) It's also interesting to see the internal structural diversity
in the different files. I suspect there are some large regions made
from contributed combinitorial libraries.
The unique fragments which exist in the most number of records are:
21387437 C 20195255 O 19959057 c 19892743 cc 19755355 ccc 19457485 cccc 19270867 CC 19015890 ccccc 18599872 cccccc 18488545 c1ccccc1 18386628 N 17672171 Cc 17324074 Ccc 17109361 CN 16985355 Cccc 16533358 C=O 16522121 Ccccc 15993406 Cc(c)c 15759069 Cc(c)cc 15508521 CcccccYou shouldn't be surprised to see that carbon is found in 21,387,437 of the 21,570,907 structures.
I made a distribution plot of the fragments, where the horizontal axis
is rank order (C then O, cc, and so on). I show it at a few different
scales in order to get a better understanding of the
distribution. It's quite obviously *not* a Zipf distribution.

The vertical axis is the count in millions. You can see that the 10,000th most common substructure is in a very small percentage of the structure; it's actually 0.5%.
At the other end of the list, 478,278 fragments (24.8%) exist only once (like C#NF), 251,372 fragments (13.1%) exist twice (like B#[Cr]), and 132,574 fragments (6.89%) exist thrice. Here's the first 20 values as a table,
1 478278 # In other words, 478,278 substructures exist only once in the data set 2 251372 3 132574 4 100665 5 67536 6 57500 7 42959 8 37983 9 31750 10 28684 11 24016 12 23169 13 18695 14 17659 15 15501 16 14717 17 13452 18 12500 19 11394 20 11276and in graphical form.

Inverted index using Python sets #
I am working on a problem which is very similar to an inverted index. What's an inverted index? Suppose I want to find a book on "sharks." I could search every book in the library until I find one on sharks, but that's tedious, and every time I do a new search I would need to re-search all of the books again. (Hopefully I would remember some of what I read the first time!)
Instead, I can invert the task. For every word in every book, make a list of books which contain the word. That also takes a lot of work but I only need to do it once. To search for a book about sharks, I look up the word "shark" and get the small list of candidates books, which I need to search manually. I still need to search them because while 20,000 Leagues Under the Sea mentions sharks, it isn't about sharks, and neither is a book which mentions a "loan shark" or the "land shark" skit.
This list of word to books which contain the word is called an "inverted index."
Compound term searches
I can use it as the basis for more complex queries. I want to find books which mention "whale shark". If the inverted index contains only single words then I get the list of books containing the word "shark" and manually search those for "whale shark", but it would be better if I combined the list of books containing "whale" and the list of books containing "shark" to make a new list of those books containing both "whale" and "shark."
In other words, I find the intersection of those two lists.
An inverted index for letters in words
It's very easy to create an inverted index using Python's "set type." Instead of the usual case of searching a book (or document) for words, I'll show an example of how to search words for letters. On my computer, the file "/usr/share/dict/words" contains 234936 different English words, with one word per line, of which 233614 are unique after I convert everything to lowercase.
I'll turn that into an inverted index where each letter is mapped to the set of words which contain that letter:
import collections
inverted_index = collections.defaultdict(set)
for line in open("/usr/share/dict/words"):
word = line.strip().lower() # ignore case
for letter in word:
inverted_index[letter].add(word)
I'll check the number of inverted indices (there should be one for
each letter), and I'll show the sizes of a couple of them.
>>> len(inverted_index) 26 >>> len(inverted_index["a"]) 144086 >>> len(inverted_index["j"]) 2993This means there are 144086 unique lower-cased words with an "a" or "A" in them, but only 2993 with a "j" or "J". (From here on I'll only mention the lower-case letter even when I mean either lower or upper case.) How many words have both an "a" and a "j" in them?
>>> len(inverted_index["a"] & inverted_index["j"]) 1724
sorted() and heapq.nsmallest()
What are the first 5 of them, alphabetically?>>> sorted(inverted_index["a"] & inverted_index["j"])[:5] ['abject', 'abjectedness', 'abjection', 'abjective', 'abjectly']
I used sorted() because it's a builtin function. However, while it works, it's a bit wasteful to sort the entire list of 1724 items when I only want the first 5. If you need something faster, try heapq.nsmallest (and heapq.nlargest.). It can be faster because it only worries about sorting the needed subset
>>> import heapq >>> heapq.nsmallest(5, inverted_index["a"] & inverted_index["j"]) ['abject', 'abjectedness', 'abjection', 'abjective', 'abjectly']A quick timing test shows that heapq.nsmallest is about 40% faster than sorted()[:5].
Inverted index as a search filter
What about a harder search? Which words contain all 6 vowels (including y) in alphabetical order? The simplest solution is a linear search using a regular expression:
>>> import re
>>> sequential_vowels = re.compile("a.*e.*i.*o.*u.*y")
>>> words = [line.strip() for line in open("/usr/share/dict/words")
... if sequential_vowels.search(line)]
>>> len(words)
8
>>> words
['abstemiously', 'adventitiously', 'auteciously', 'autoeciously', 'facetiously',
'pancreaticoduodenostomy', 'paroeciously', 'sacrilegiously']
This works, but if this type of search will occur often, and if
there's enough memory, and if there's a performance need, then it's
easy to speed up using an inverted index.
Filter using the inverted index
I'll split this search into two stages. The first will filter out the obvious mismatches, and leave a smaller set of candidates for the second stage.
For the first stage, I'll use the inverted index to find the words which contains all of the vowels. The inverted index doesn't know the character order, so once I find the candidates with all of the letters then I'll use the regular expression test from before.
To find the set of words with all of the vowels, I could continue the sequence of "&" binary operators as I did earlier, but that gets to be clumsy when there are six terms. Instead, I'll call set.intersection() with the intersection sets as parameters:
>>> vowels = [inverted_index[c] for c in "aeiouy"] >>> len(set.intersection(*vowels)) 670
The variable "vowels" contains a list of sets, "*vowels" in a function call turns that list parameter into individual parameters, and "set.intersection" creates a new set which is the intersection of all of the sets in vowels.
("set.intersection" used here is actually an unbound method, and it's pretty rare to find Python code where using an unbound method makes sense. The above code is almost identical to "vowels[0].intersection(*vowels[1:])".)
I used two lines above, more for clarity reasons. For myself I would probably put it into a single line:
>>> len(set.intersection(*(inverted_index[c] for c in "aeiouy"))) 670Yes, the inverted index can be done in a single line!
Testing the candidates
The first stage reduced the search space from 233614 words to 670. In the second stage I'll use the regular expression to check which ones contain the vowels in order.
>>> candidates = set.intersection(*(inverted_index[c] for c in "aeiouy")) >>> [word for word in candidates if sequential_vowels.search(word)] ['autoeciously', 'adventitiously', 'facetiously', 'abstemiously', 'sacrilegiously', 'auteciously', 'pancreaticoduodenostomy', 'paroeciously']You can verify that it finds the same matches as before, although because sets are unordered, the result order has changed.
I've shown that the inverted index can be used to make the code more complicated. Is is worthwhile? That is, how much faster is the new code?
My timings show that the old version (which does 233614 regular expression searches) takes 0.092 seconds to run, while the new one takes 0.056 seconds. It's about 40% faster.
Use integers as set elements, not strings
It's easy to go faster still. The core of Python's set itersection works something like this:
new_set = set()
for element in set1:
if element in set2:
new_set.add(element)
This requires a hash comparison for every element in set1. If that
passes then there's an equality test in set2, and if that passes then
there's another hash and possible equality test to insert into
new_set.
The set element are currently strings. String hash and comparisons are very optimized in Python, but integers are even faster. What if I used an index into a word list rather than the word itself? The corresponding code is:
import collections
# Get all of the words into a list of words.
# Ignore words which are the same except for capitalization.
unique_words = set(line.strip().lower() for line in open("/usr/share/dict/words"))
words = sorted(unique_words)
# Map from character to the set of word indicies
inverted_index = collections.defaultdict(set)
for i, word in enumerate(words):
for c in word.lower():
inverted_index[c].add(i)
I can do the inverted index operations just like before, but the
result is a list of indices into the "words" list. For example:
>>> set.intersection(*(inverted_index[c] for c in "jkx")) set([99716, 98492]) >>> for i in set.intersection(*(inverted_index[c] for c in "jkx")): ... print words[i] ... jukebox jackbox
I'll modify the "sequential_vowels" code to use the index-based inverted index instead of the string-based version:
>>> candidates = set.intersection(*(inverted_index[c] for c in "aeiouy")) >>> [words[i] for i in candidates if sequential_vowels.search(words[i])] ['pancreaticoduodenostomy', 'adventitiously', 'abstemiously', 'auteciously', 'sacrilegiously', 'autoeciously', 'facetiously', 'paroeciously']My timings numbers give 0.029 seconds per search, with nearly all of the time spent in the set intersection. Remember that the brute-force linear search takes 0.094 seconds and the original inverted index takes 0.056 seconds, so switching to integer-based indices brings another factor of two performance gain. The overall search with an inverted index is 3x faster than the original regex-based linear search.
Order-dependent performance
Python's set.intersection() actually works more like this:
def intersection(*args):
left = args[0]
# Perform len(args)-1 pairwise-intersections
for right in args[1:]:
# Tests take O(N) time, so minimize N by choosing the smaller set
if len(left) > len(right):
left, right = right, left
# Do the pairwise intersection
result = set()
for element in left:
if element in right:
result.add(element)
left = result # Use as the start for the next intersection
return left
That is, it does a set of pair-wise reductions of list of sets. The
order of the set operations affects the performance! Recall that there
are only 1724 words with "j" in them. If I search for words with "m",
"a", and "j" in them, as
>>> len(set.intersection(*(inverted_index[c] for c in "maj"))) 286then Python computes the intersection of inverted_index["m"] and inverted_index["a"], giving an intermediate set with 41148 hits, which it then intersects with the 1724 "j" elements.
However, if the search order were "jma" then the intermediate set for the intersection of "j" and "m" give only 450 elements, which means only 450 tests against inverted_index["a"].
Both give the same answer, but one requires a lot more work than the other.
Here's evidence of just how big that impact is:
>>> def go(letters):
... t1 = time.time()
... for i in range(1000):
... x = len(set.intersection(*(inverted_index[c] for c in letters)))
... return x, (time.time()-t1)
...
>>> go("jma")
(286, 0.12344503402709961)
>>> go("jam")
(286, 0.2954399585723877)
>>> go("amj")
(286, 6.223098039627075)
>>>
Yes, it's a factor of 50 between the slowest and fastest ordering!
There are some obvious ways to make the Python code faster. The easiest is to process the sets from smallest size to largest. That was proposed in Issue3069 on 2008-06-10, but the patch was not integrated into the Python codebase.
However, that's not necessarily the best strategy. Suppose I want letters with "q", "u", and "c" in them. There are 3619, 75144, and 85679 words with those letters, respectively, so you might think the best sort order is "quc". Testing that hypothesis:
>>> go("quc")
(842, 0.40471911430358887)
>>> go("qcu")
(842, 0.2493000030517578)
shows that "qc" is the more selective pair. This is because "q" and
"u" are highly correlated; there are only
>>> len(inverted_index["q"] - inverted_index["u"]) 1414 words in the data set with a 'q' but without a 'u', while there are 2776 words with a 'q' and no 'c'.
Some years ago I came up with a dynamic algorithm which tries to prefer the set which least matches the reference set.
For the case of three sets, it simplifies to:
def set_intersection3(*input_sets):
N = len(input_sets)
assert N == 3
min_index = min(range(len(input_sets)), key=lambda x: len(input_sets[x]))
best_mismatch = (min_index+1)%N
new_set = set()
for element in input_sets[min_index]:
# This failed to match last time; perhaps it's a mismatch this time?
if element not in input_sets[best_mismatch]:
continue
j = 3-best_mismatch-min_index
# If the element isn't in the set then perhaps this
# set is a better rejection test for the next input element
if element not in input_sets[j]:
best_mismatch = j
else:
# The element is in all of the other sets
new_set.add(element)
return new_set
while the intersection version to sort by size is:
def set_intersection_sorted(*input_sets):
input_sets = sorted(input_sets, key=len)
new_set = set()
for element in input_sets[0]:
if element in input_sets[1]:
if element in input_sets[2]:
new_set.add(element)
return new_set
Here's a head-to-head comparison between the three versions
| pattern | set.intersection | set_intersection3 | set_intersection_sorted |
|---|---|---|---|
| quc | 0.462 | 0.852 | 1.032 |
| qcu | 0.312 | 0.842 | 1.032 |
| ucq | 7.152 | 0.772 | 0.962 |
Of course it's not possible to drop in one of these alternate algorithms because set.intersection can take iterables. These other algorithms would only work for the special case where all of the inputs are sets.
I've submitted this observation as Issue13653. I'm curious to see what will become of it.
Floats and doubles #
In this essay I describe how using floating-point, and especially how mixing float and double values, can cause subtle problems in doing a threshold search.
Suppose you have a function which takes two objects (call them "query" and "target") and returns a similarity score as a float or double.
score = compute_similarity(query, target)In my case, these are cheminformatics fingerprints. Given a set of targets you can ask how many targets are at least "threshold" similar to the given query:
num_targets = 0
for target in targets:
if compute_similarity(query, target) >= threshold:
num_targets += 1
# (This can be be written as:
# num_targets = sum(1 for target in targets
# if compute_similarity(query, target) > threshold)
# but unless you are used to that style it's hard to understand.)
Simple, right?
float or double?
I have code which does this. Up until a few days ago it returned a double. However, the score is the fraction N/M where 0<=N<=M<=2**20. There's no need for 8 bytes of precision when a float is clearly enough. For some of the problem I'm dealing with, that would save a few hundred MB of space.
In theory I could even switch to fixed point, but it's easier to work with a 32-bit float than a 32-bit fixed point integer (much less a 20 bit fixed point integer!).
I converted all of my doubles to float, but my unit tests failed. After I fixed the places where I messed up on the size of allocated space, I still had errors.
One failure came down to a difference in converting "0.9" to double, and "0.9f" to float. Take a look at this simple program:
#includewhich outputs (with many more decimals than needed):#include main() { printf("0.9: float %.20f double %.20f\n", 0.9f, 0.9); printf("0.3: float %.20f double %.20f\n", 0.3f, 0.3); }
0.9: float 0.89999997615814208984 double 0.90000000000000002220 0.3: float 0.30000001192092895508 double 0.29999999999999998890The test calls out to C for the score calculations, but does the threshold tests in Python. Python's floating-point type is based on a C double. When the C code returns ((float)9)/10, the Python/C interface converts that into the Python value 0.89999... . The threshold test compares it to against 0.900000... and rejects it for being too low. It's only low by a smidgeon, but still, too low.
A "smidgeon"
A quick check of all the rational values in my range shows that about 1/2 the time the float value is less than the double, and about 1/2 the time it's the other way. Only about 0.5% of the time are they identical values.
The fundamental problem is that I'm mixing float and double values. If I use only floats or only doubles then this wouldn't be a problem.
One solution I came up with is to lower the input threshold value by a little bit. That is, if the user specifies "0.9", then treat it as the value "0.899999". I know my denominator is at most 2**20 so I could do:
def adjust_input(threshold):
if threshold >= 0.0:
return threshold - 0.5/2**20
return threshold
That "0.5/2**20" is somewhat ad hoc. Even better, I could use the C99 function "nextafter" ("_nextafter" in Visual C) and lower the floating point value by a ulp, which is the smallest possible value. (Python doesn't implement nextafter in the core libraries but there are other solutions.)
Possible side effects
This solution works for the "0.9" case. What are the side effects?
For one, about 50% of the time I will lower a threshold value which is already low enough. This effect will only be visible to people who those who specify an abnormally high number of decimals in the first place.
For another, my code is available as a library, and the parameter takes a double. Do I assume that the input has already been nudged down, or do I apply the nextafter to all of the inputs? I think I need to apply it to all of the inputs.
What about the return values? I expect that users of the library will expect the following to never raise an exception:
for hit in threshold_search(query, targets, threshold=0.9):
if hit.score < threshold:
raise AssertionError
This will fail 1/2 the time because the float value I return from C is
too small, compared to the double. I could make that invariant work by
using nextafter to increase the returned values by a ulp. I don't
think that will cause any problems.
That's a lot of "I think"s. I don't yet have any library users to get feedback. The performance doesn't seem to be affected by using float or double, and I'm not running into memory limitations. I'm therefore going to revert all of my changes and keep things as doubles.
String to float conversion
The mishap occurs because the decimal values 0.9 cannot be written exactly as float or double, which both use base 2. The conversion function (strtod or equivalent) instead returns binary value which is closest to the input decimal number. Similarly, the numbers shown above are the 20-digit decimal numbers closest to the that binary number. This number can be slightly smaller, equal to, or slightly larger than the original input.
You can see this in action, by using an "abnormally high number of decimals." In the following the score between the "nine" case and the "ten" case is exactly 9/10. You can see that the treshold of 0.900000005 and still allows scores of 0.9. This is because 0.900000005 and 0.9 are converted to the same internal number, and because this is my experimental branch which uses floats instead of doubles. If I increase the threshold to 0.900000006 then they are no longer the same, and the threshold test fails.
% cat x.fps FF01 nine FF03 ten % simsearch --queries x.fps x.fps --threshold 0.900000005 #Simsearch/1 #num_bits=16 #type=Tanimoto k=all threshold=0.900000005 #software=chemfp/1.1a1 #queries=x.fps #targets=x.fps 2 nine nine 1.000 ten 0.900 2 ten nine 0.900 ten 1.000 % simsearch --queries x.fps x.fps --threshold 0.900000006 #Simsearch/1 #num_bits=16 #type=Tanimoto k=all threshold=0.900000006 #software=chemfp/1.1a1 #queries=x.fps #targets=x.fps 1 nine nine 1.000 1 ten ten 1.000
Let me say that this is not a serious problem. Other than people like me, who use values like this to probe the internal workings of software, I'm hard-pressed to think of who might be affected by this.
Nonetheless, it's still an intellectual sticking point. If for some reason you specify a threshold of 0.900000004 then I really don't want my software to return scores which I know are 9/10. How can I solve this problem in the abstract?
I've come up with two solutions. In the first, adjust the input (using decimal math) so it's more than 1 ulp away from any rational value which could exist as a score. Easy to say, but I'm not sure how to do that.
In the second, read the input into a rational (like Python's fractions module). Pass the numerator and denominator into the search functions, instead of passing in a floating point value.
My search code score computes the score based on the ratio of two integers, so instead of doing:
# old version
def count_targets(query, targets, threshold):
num_targets = 0
for target in targets:
if compute_similarity(query, target) >= threshold:
num_targets += 1
return num_targets
I can do
# new version
def count_targets(query, targets, threshold_numerator, threshold_denominator):
num_targets = 0
for target in targets:
score_numerator, score_numerator = compute_similarity(query, target)
if ( score_numerator * threshold_denominator >=
threshold_denominator_numerator * score_denominator):
num_targets += 1
return num_targets
This will work, but who wants to use an API where you pass in the
threshold as the numerator and denominator, rather than a single
threshold value?
So yes, I'm going to stay with doubles throughout my entire system. A consistent floating point representation just makes life easier.
f2pypy #
There's a bit of discussion going on about the role of PyPy in scientific computing with Python. I spent a few days of the last week to add more ... shall I say "fuel?" .. to the discussion. I wrote a new back-end to f2py called f2pypy which generates a Python module to a shared library based on using ctypes. The module works (somewhat) with CPython, and does not work with PyPy because there's no way yet to pass a pointer to the array data to a ctypes function. (That's a minor detail which isn't hard to implement.)
What it shows is a real mechanism to get PyPy to support existing Fortran libraries already supported by f2py definition files.
NumPy isn't used in all scientific software
There is definitely place for PyPy in scientific computing even now. There are entire branches of science which have little overlap with the strengths of SciPy. I've been a full-time software developer for computational chemistry for 16 years, and have only used NumPy a few times.
One time I needed to compute the generalized inverse matrix. It was in a command-line program called by another process, of all things, and to my annoyance the "import numpy" on the cluster file system was noticably long. I forgot what the numbers were then, but the current numpy import adds 145 (yes, 145!) modules to sys.modules, and 107 of them start with "numpy." Our Lustre configuration did poorly with file metadata, and I think it was over a second to do the import.
I brought this up on the numpy list. While they made some changes, it was pointed out that I am not their target user. The "import numpy" also does an "import numpy.testing" and "input numpy.ctypeslib" and the other imports so that people could use numpy.submodule without an extra explicit import line, and because most people working with numpy are in working in a long-lived interactive session or job, so the startup performance isn't a problem.
I happen to disagree with their choice. "Explicit is better than implicit" and all that. But my point is not to argue for them to change but to give a specific example of how the goals of the NumPy developers can be different than the goals of other scientific programmers.
How do I use Python in science research?
A lot of what I do involves communicating with command-line executables. These are often written by scientists, and most are designed to be run directly by people, and not be other software. Most of the time is spent in the executable, so it doesn't matter if I'm using CPython or PyPy.
There are several cheminformatics libraries for Python. OpenBabel and OEChem use SWIG bindings, RDKit uses Boost, and I don't know what Indigo and Canvas use. Migrating these to PyPy will be hard. I hope that someone is working on SWIG bindings, but it looks like the PyPy developers don't want to commit to a C ABI. (See below.)
There's also code where there are no libraries, and for those I write the code in Python, and sometimes my own extension for C. For some of these case the 3x and higher performance of PyPy would be great. I also know a lot of ways for my CPython-based code to talk to PyPy-based code.
I used to develop software for bioinformatics and structural biology, and my observations still hold for those fields. One of the Biopython developers, Peter Cock, writes:
Regarding Biopython using NumPy, we're already trying it out under PyPy. Large chunks of Biopython do not use NumPy at all, although there a few problems on PyPy 1.6 (one due to a missing XML library, bug filed), most of that seems to work." [*]He continues with a list of some of what doesn't work.
Support for existing libraries
That said, I know that a lot of people depend Python bindings to existing libraries. These use the C API directly, or through auto-generated interfaces from f2py, Cython, Boost, SWIG, and many more. There's been 10+ years to develop these tools for CPython, and still very little time to adapt them to PyPy.
Relatively few extensions use the ctypes module, which is Python's "other" mechanism for calling external functions. Unlike the C API, this one is also portable across Jython, Iron Python, and PyPy. Obviously, if everyone used ctypes then there wouldn't be a problem. Why don't they?
One is the performance. Calling math.cos() is 8 times faster than doing a LoadLibrary() of libm and calling cos() that way. This is of course the worst case. But that's a CPython limitation. Pypy's ctypes call interface is faster than CPython calling a C extension:
% cat x.py
import ctypes
m = ctypes.cdll.LoadLibrary("/usr/lib/libm.dylib")
cos = m.cos
cos.argtypes = [ctypes.c_double]
cos.restype = ctypes.c_double
% python -mtimeit -s "from x import cos" "cos(0)"
1000000 loops, best of 3: 0.676 usec per loop
% python -mtimeit -s "from math import cos" "cos(0)"
10000000 loops, best of 3: 0.0811 usec per loop
% pypy -mtimeit -s "from x import cos" "cos(0)"
10000000 loops, best of 3: 0.0332 usec per loop
% pypy -mtimeit -s "from math import cos" "cos(0)"
100000000 loops, best of 3: 0.0047 usec per loop
although you can see it's still slower than using a built-in function.
Another reason to not use ctypes is that C/C++ library authors do interesting things with the API. One library I used has public API functions like "dt_charge(atom)" to get the formal charge of an atom, but used a number of #define statements to change those names to the internal name. That example became "dt_e_charge". It also defined certain constants only in the header files. This information isn't in the shared library.
I know at least one vendor which only ships a static library, and not a shared library. Apparently bad LD_LIBRARY_PATHs was such a support headache that they decided it wasn't worth it. (I think they are right.) There's no way to get ctypes to interface to a static library.
A fourth problem is lack of support for C++ templates. That clearly needs a compiler, which ctypes doesn't do.
PyPy needs a (semi-)stable C ABI; can you help?
Based on the above, there will clearly always be a need for compiler-based Python extensions, including PyPy extensions. That means there needs to be some sort of ABI that those extensions can program against.
I don't know what that would look like, and I think the PyPy developers think it's still too early to stablize on it. It may well be; but I think it's because there's no one on the group who wants to work on the task.
They were more than happy last year to show a proof-of-concept interface from PyPy to C++ using the run-time type information added by the Reflex system. (Yeah, I had never heard of it either.) So they have nothing against working with an existing ABI. Do you want to offer one?
I wrote "semi-" in the title because it wasn't until Python 3.2 that CPython got a stable ABI. PyPy notably does have emulation support for some of the CPython 2.x ABI but there are problems. Some modules use the ABI incorrectly, and it works for implementation-specific reasons. (For example, bad reference counts.)
If you are going to work on this, I think it would make sense to target the 3.2 ABI and to include instrumentation to help identify these problems.
The best for me would be if you develop some SWIG/ABI interface. This might just be to produce a bunch of stub functions and a ctypes definition for them. (Hmm, wasn't there a C++ to C SWIG interface?)
f2pypy: Experimental Fortran bindings
The above is talk and hand-waving. Code's also good. There was a PyPy sprint this week and I decided to join in for a few days and prototype an idea I've been thinking about: f2pypy. It's a variation of f2py which generates Python ctypes bindings which PyPy could use to talk with shared libraries implemented in Fortran.
At the end of several days of work, I got f2pypy to generate a Python module based on the "fblas.pyf" code from SciPy. I could import that library in CPython and (for the few functions I tested) get answers which matched the fblas module in SciPy. I could also use pypy to call some of the functions, but PyPy's "numpy" implementation is not mature enough. Its array objects don't yet support the ctypes interface, so I was unable to call out to the shared library. I could only call the scalar-based functions.
The code is definitely incomplete. Even my CPython-based tests fail some of the the "test_blas.py" from SciPy (I don't implement "cblas" and I think one of the tests depends on Fortran order instead of C order.) It's a proof-of-concept which shows that this approach is definitely viable, and it shows some of the difficulties in the approach.
My point though is that it opens new possibilites which aren't available in NumPy. For example, suppose you want to use one of the BLAS functions in your code. Every Mac includes a copy of BLAS as a built-in library. Instead of making people install SciPy, what about shipping the ctypes module description instead, and using that interface? You can ship pure Python code and still take advantage of platform-optimized libraries!
I earlier highlighted the performance problems in CPython's ctypes interface. But this is PyPy. They already have cross-module optimizations for Python calling Python. There's no reason why those can't apply to ctypes-based functions. (Or perhaps it's already there? I've not tested that.)
How does it work?
Fortran bindings are nice because they don't have the same preprocessor tricks that I mentioned earlier. Pearu Peterson wrote the excellent f2py package starting some 10+ years ago. It has several ways to work with Fortran code. The one I used was to start with a "pyf" definition file and generate Python code using a new back-end.
I figured out how to get SciPy to generate the pyf file for the BLAS library. (The SciPy source uses a template language during the build process to generate the actual code.) I used f2py's "crackfortran" module to parse the pfy file and get the AST. It's a small tree so perhaps I should call it an abstract syntax bush.
The f2py code generate the Python/C extension code based on the AST. My f2pypy code is basically another back-end, which generates ctypes-based code in Python.
The trickiest part was support for C code. Some of the pyf definition lines contain embedded C code. Here I've gathered three examples:
integer optional, intent(in),check(incx>0||incx<0) :: incx = 1 integer optional,intent(in),depend(x,incx,offx,y,incy,offy) :: n = (len(x)-offx)/abs(incx) callstatement (*f2py_func)((trans?(trans==2?"C":"T"):"N"),&m,&n,&alpha,a,&m,x+offx,&incx,&beta,y+offy,&incy)I used Fredrick Lundh's wonderful essay on Simple Top-Down Parsing in Python to build a simple C expression parser, which builds another AST. With a bit of AST manipulation, and symbol table knowledge (I need to know which inputs are scalars and which are vectors), I could generate output strings like:
def srot(..., incx = None, ...):
...
if incx is None:
incx = _ct.c_int(1)
else:
incx = _ct.c_int(incx)
if not ((((incx.value) > 0) or ((incx.value) < 0))):
raise ValueError('(incx>0||incx<0) failed for argument incx: incx=%s' % incx.value)
and the more complicated:
_api_cgemv((("c") if (((trans.value) == 2)) else ("t")) if ((trans.value)) else
("n"), (m), (n), (alpha), a.ctypes.data_as(_ct.POINTER(_complex_float)), (m),
(x if ((offx.value)) == 0 else x[(offx.value):]).ctypes.data_as(_ct.POINTER(_complex_float)),
(incx), (beta),
(y if ((offy.value)) == 0 else y[(offy.value):]).ctypes.data_as(_ct.POINTER(_complex_float)),
(incy))
I definitely do not generate optimized code. I decided to work completely in terms of ctypes scalars and numpy arrays, even for the check() statements. PyPy doesn't optimize that yet, and I think someone else could do a better job by only doing the conversion as part of the call to the Fortran code.
Usage
To generate the new module on a Mac (I don't know the shared library name for other OS installations):
$PYTHON -m f2pypy tests/fblas.pyf -l vecLib --skip cdotu,zdotu,cdotc,zdotcThis generates "fblas.py". I have some test code for that module
% python test_fblas.py
python test_fblas.py
...........F...
======================================================================
FAIL: test_srot_overwrite (__main__.CBlasTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test_fblas.py", line 116, in test_srot_overwrite
assert x is x2
AssertionError
----------------------------------------------------------------------
Ran 15 tests in 0.006s
FAILED (failures=1)
This says that "numpy.array(.. copy=False)" makes a new reference,
while the internal code f2py uses passes back the same object, so a
real implementation will need to handle that detail.
Here's the same output from pypy:
% pypy test_fblas.py
.EE.EEEFEEEE.E.
======================================================================
ERROR: test_dnrm2 (__main__.CBlasTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test_fblas.py", line 166, in test_dnrm2
E(m.dnrm2(x), float((numpy.array([1+1+16+81], "d")**0.5)))
File "/Users/dalke/cvses/f2pypy/fblas.py", line 1192, in dnrm2
return _api_dnrm2((n), (x if ((offx.value)) == 0 else x[(offx.value):]).ctypes.data_as(_ct.POINTER(_ct.c_double)), (incx))
AttributeError: 'numarray' object has no attribute 'ctypes'
======================================================================
ERROR: test_drot (__main__.CBlasTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test_fblas.py", line 153, in test_drot
E(m.drot(1,2,3,4), (numpy.array(11.0, dtype="d"), numpy.array(2.0, dtype="d")))
File "/Users/dalke/cvses/f2pypy/fblas.py", line 181, in drot
y = _np.array(y, 'd', copy=overwrite_y)
TypeError: __new__() got an unexpected keyword argument 'copy'
....
See the dots up there with the "E"rrors? That 4 of the scalar-based
tests pass. What fails is the vector-based code. The "ctypes" gets the
pointer to the numpy array data, and PyPy doesn't support the "copy"
parameter of numpy.array.
Still, it does pass some tests!
Future
I don't use Fortran modules. I don't use f2py. I don't use numarray. I will not be involved in this project for the future. (I do a lot of integration work, and I do a lot of parsing and AST transformations, so that part of this effort was a very pretty good fit!)
I did this because I wanted to show that PyPy can support traditional numeric software libraries and that there is a relatively doable path for migration from existing numpy code to "numpypy" code.
I will not be maintaining the project in the future. If you want to take it on, feel free. I've contributed it to the PyPy project, and it has its own repository. Feel free also to leave a comment or ask me questions.
Copyright © 2001-2010 Dalke Scientific Software, LLC.


