<?xml version="1.0" encoding="iso-8859-1"?>
<rss version="2.0"><channel><title>Andrew Dalke's writings</title><link>http://www.dalkescientific.com/writings/diary/index.html</link><description>Writings from the software side of bioinformatics and
  chemical informatics, with a heaping of Python thrown in for good
  measure.  Code to taste.  Best served at room temperature.</description><lastBuildDate>Fri, 20 Jan 2012 09:11:39 GMT</lastBuildDate><generator>PyRSS2Gen-1.0.0</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>Python's concurrent.futures</title><link>http://www.dalkescientific.com/writings/diary/archive/2012/01/19/concurrent.futures.html</link><description>&lt;P&gt;

In this essay I'll describe how to use the concurrent.futures API from
Python 3.2. Since I'm still using Python 2.7, I'll use Alex
Gr&amp;ouml;nholm's &lt;a href="http://pypi.python.org/pypi/futures"&gt;back
port&lt;/a&gt; instead.

&lt;/P&gt;&lt;P&gt;

&lt;a href="http://www.python.org/dev/peps/pep-3148/"&gt;PEP 3148&lt;/a&gt; gives
the motivation for the new concurrent module:

&lt;blockquote&gt;
Python currently has powerful primitives to construct multi-threaded
and multi-process applications but parallelizing simple operations
requires a lot of work i.e. explicitly launching processes/threads,
constructing a work/results queue, and waiting for completion or some
other termination condition (e.g. failure, timeout). It is also
difficult to design an application with a global process/thread limit
when each component invents its own parallel execution strategy.
&lt;/blockquote&gt;

Basically, using "threading" and "multiprocessing" are harder than
they should be.


&lt;/P&gt;
&lt;h2&gt;The guiding problem: analyze web logs&lt;/h2&gt;
&lt;P&gt;

My web site archives the daily server logs. Filenames are of the form
"www.20120115.gz". Each access is a single line in "&lt;a
href="http://httpd.apache.org/docs/2.2/logs.html"&gt;combined log
format&lt;/a&gt;." Here's an example line:

&lt;pre class="code"&gt;
198.180.131.21 - - [25/Dec/2011:00:47:19 -0500] "GET /writings/diary/diary-rss.xml HTTP/1.1" 304 174 "-" "Mozilla/5.0 (Windows NT 5.1; rv:8.0) Gecko/20100101 Firefox/8.0"
&lt;/pre&gt;

It contains the host IP address, date, URL path, referrer information,
user agent, and a few more fields.

&lt;/P&gt;&lt;P&gt;

I have 169 files which I want to analyze. &lt;tt&gt;gzcat *.gz | wc -l&lt;/tt&gt;
says there are 1,346,595 records. I'll use this data set to show some
examples of how to use concurrent.futures.

&lt;/P&gt;
&lt;h2&gt;Number of accesses per day (single-threaded)&lt;/h2&gt;
&lt;P&gt;

For the start, how many log events are there per day?

&lt;pre class="code"&gt;
import glob
import gzip

for filename in glob.glob("www_logs/www.*.gz"):
    with gzip.open(filename) as f:
        num_lines = sum(1 for line in f)
    print filename.split(".")[1], num_lines
&lt;/pre&gt;

Note: gzip files didn't support context managers until Python 2.7. If
you are on Python 2.6 then you'll get the error message

&lt;pre class="code"&gt;
AttributeError: GzipFile instance has no attribute '__exit__'
&lt;/pre&gt;

When I run that I get output which looks like:

&lt;pre class="code"&gt;
20110801 7305
20110802 7594
20110803 7470
20110804 7348
20110805 7504
20110806 4774
20110807 4870
20110808 9815
...
20120113 18124
20120114 9245
20120115 8100
20120116 14117
&lt;/pre&gt;

That's too detailed, and hard to interpret. A graph would be
nicer. Here it is:

&lt;/P&gt;&lt;P&gt;

&lt;img src="http://dalkescientific.com/writings/diary/concurrent_accesses.png" /&gt;

&lt;/P&gt;&lt;P&gt;

I seem to get more people during the work week than the weekend, and
one of my other essays got on Hacker News in early January.

&lt;/P&gt;&lt;P&gt;

I made that plot using the matplotlib's "pylab" API:

&lt;pre class="code"&gt;
import glob
import gzip

from pylab import *
import datetime

dates = []
counts = []

for filename in glob.glob("www_logs/www.*.gz"):
    with gzip.open(filename) as f:
        num_lines = sum(1 for line in f)
    date = datetime.datetime.strptime(filename, "www_logs/www.%Y%m%d.gz")

    dates.append(date)
    counts.append(num_lines)

plot(dates, counts)
ylim(0, max(counts))
title("My website accesses")
show()
&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

That code is a bit ugly, so I'll clean it up a bit and conveniently
put it into a form which helps transition to the parallelization code:

&lt;pre class="code"&gt;
import glob
import gzip
import datetime
import time

def count_lines(filename):
    with gzip.open(filename) as f:
        num_lines = sum(1 for line in f)
    date = datetime.datetime.strptime(filename, "www_logs/www.%Y%m%d.gz")
    return (date, num_lines)

filenames = glob.glob("www_logs/www.*.gz")

dates = []
counts = []
for filename in filenames:
    date, count = count_lines(filename)
    dates.append(date)
    counts.append(count)

## Believe or not, but this next line does the same as the previous block(!)
# dates, counts = zip(*(count_lines(filename) for filename in filenames))

from pylab import *
plot(dates, counts)
ylim(0, max(counts))
title("My website accesses")
show()
&lt;/pre&gt;


&lt;/P&gt;
&lt;h2&gt;It's slow. Make it faster!&lt;/h2&gt;
&lt;P&gt;

That code takes 5.5 seconds to read the 1.3 million lines. I have a
four core machine - surely I can make better use of my hardware!

&lt;/P&gt;&lt;P&gt;

I'll start with multiple threads. Python has supported threads since
the 1990s, but as we all know, CPython has the Global Interpreter Lock
which prevents multiple threads from running Python code at the same
time. On the other hand, this task is doing file I/O, and gzip
uncompression in code which might release the GIL. Perhaps threads
will work here?

&lt;/P&gt;&lt;P&gt;

I'll use a very standard approach. I'll define a set of jobs, and pass
that over to a thread pool. Each job takes a filename to process as
input, calls the function "count_lines", and returns the timestamp and
number of lines in the file.

&lt;/P&gt;&lt;P&gt;

Here's how you do that with the concurrent.futures API:

&lt;pre class="code"&gt;
import glob
import gzip
import datetime
&lt;b&gt;from concurrent import futures&lt;/b&gt;

def count_lines(filename):
    with gzip.open(filename) as f:
        num_lines = sum(1 for line in f)
    date = datetime.datetime.strptime(filename, "www_logs/www.%Y%m%d.gz")
    return (date, num_lines)

filenames = glob.glob("www_logs/www.*.gz")

dates = []
counts = []
&lt;b&gt;with futures.ThreadPoolExecutor(max_workers=2) as executor:
    for (date, count) in executor.map(count_lines, filenames):&lt;/b&gt;
        dates.append(date)
        counts.append(count)

from pylab import *
plot(dates, counts)
ylim(0, max(counts))
title("My website accesses")
show()

&lt;/pre&gt;

The "ThreadPoolExecutor" creates a thread pool, in this case with two
workers. You can submit as many jobs as you want to this thread pool,
but only two (in this case) will be processed at a time. The thread
pool is also a context manager, and no more jobs can be submitted once
the context is finished.


&lt;/P&gt;&lt;P&gt;

How are jobs submitted? You can either submit a job using submit() or
you can submit a number of jobs using the "map() " idiom, which is
what I did here. Remember, this is a Python 3.x API so map() returns
an iterator, and not a list like it does in Python 2.x.

&lt;/P&gt;
&lt;h3&gt;What is "map"?&lt;/h3&gt;
&lt;P&gt;

The term "map" comes from functional programming, but functional
programming is not emphasized in the Python language. Instead, we more
often use a list or generator comprehension, or build a list
manually. The following three methods are equivalent:


&lt;pre class="code"&gt;
&amp;gt;&amp;gt;&amp;gt; print [ord(c) for c in "Andrew"]
[65, 110, 100, 114, 101, 119]
&lt;/pre&gt;
&lt;pre class="code"&gt;
&amp;gt;&amp;gt;&amp;gt; print map(ord, "Andrew")
[65, 110, 100, 114, 101, 119]
&lt;/pre&gt;
&lt;pre class="code"&gt;
&amp;gt;&amp;gt;&amp;gt; result = []
&amp;gt;&amp;gt;&amp;gt; for c in "Andrew":
...   result.append(ord(c))
... 
&amp;gt;&amp;gt;&amp;gt; print result
[65, 110, 100, 114, 101, 119]
&lt;/pre&gt;

So "map(count_lines, filenames)" is a roughly the same as:

&lt;pre class="code"&gt;
  for filename in filenames:
    yield count_lines(filename)
&lt;/pre&gt;

and "executor.map" does the same thing, only it uses a thread in the
thread pool to evaluate the function.

&lt;/P&gt;&lt;P&gt;

Also, to switch the above code to its almost exact single-threaded
version, what you can do is get the Python 2.x iterater version of
"map" (in itertools.imap) and rewrite the above as:

&lt;pre class="code"&gt;
import itertools
 ...
for (date, count) in itertools.imap(count_lines, filenames):
    dates.append(date)
    counts.append(count)
&lt;/pre&gt;


&lt;/P&gt;
&lt;h2&gt;But is it faster?&lt;/h2&gt;
&lt;P&gt;

No. &lt;tt&gt;;)&lt;/tt&gt;

&lt;/P&gt;&lt;P&gt;

With one thread in the thread pool, the task takes 5.5 seconds. The
overall time is unchanged from the unthreaded version, as we should
expect.

&lt;/P&gt;&lt;P&gt;

With two worker threads, it takes 7.0 seconds - even longer than with
one thread!

&lt;/P&gt;&lt;P&gt;

Three worker threads takes 7.3 seconds, and four threads takes
7.4. This is not a trend you want to see when you need to parallelize
your software.

&lt;/P&gt;&lt;P&gt;

There are two likely candidates for the slowdown. The GIL is the
obvious one, but perhaps my computer doesn't handle parallel disk I/O
that well.

&lt;/P&gt;
&lt;h2&gt;What about multiple processes?&lt;/h2&gt;
&lt;P&gt;

What I'll do is switch from the multi-threaded version to the
multi-processing version. Instead of using a thread pool, I'll have a
process pool, which uses interprocess communications to send the job
request to each process and get the results:

&lt;pre class="code"&gt;
import glob
import gzip
import datetime
from concurrent import futures

def count_lines(filename):
    with gzip.open(filename) as f:
        num_lines = sum(1 for line in f)
    date = datetime.datetime.strptime(filename, "www_logs/www.%Y%m%d.gz")
    return (date, num_lines)

filenames = glob.glob("www_logs/www.*.gz")

dates = []
counts = []
with futures.&lt;b&gt;ProcessPoolExecutor&lt;/b&gt;(max_workers=4) as executor:
    for (date, count) in executor.map(count_lines, filenames):
        dates.append(date)
        counts.append(count)
    
from pylab import *
plot(dates, counts)
ylim(0, max(counts))
title("My website accesses")
show()
&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

Did you see the difference? I used a "ProcessPoolExecutor" instead of
a "ThreadPoolExecutor".

&lt;/P&gt;&lt;P&gt;

With that small change, a process pool with only one worker finishes
in 5.6 seconds, which is a bit slower. That's probably due to the
overhead of starting a new process and sending data back and forth.

&lt;/P&gt;&lt;P&gt;

What's exciting is that two workers finishes in 3.6 seconds, three
workers in 2.8 seconds, and four workers in 2.6 seconds. It's
obviously not great speedup (perfect scaling would be 5.5, 2.3, 1.8,
and 1.1 seconds), but I end up cutting my time in half with relatively
little work.

&lt;/P&gt;
&lt;h2&gt;Faster, please&lt;/h2&gt;
&lt;P&gt;

At this point it's safe to assume that most of the gzip+line count
code requires the GIL. A quick look at "gzip.py" tells me that, yes,
that is the case.

&lt;/P&gt;&lt;P&gt;

With some non-trivial effort, I could write a specialized C extension
to replace the gzip module. That's overkill for this project. Instead,
my computer has the usual unix utilities so I'll rewrite the
"count_lines" function and let them them do the work instead.

&lt;pre class="code"&gt;
import subprocess

def count_lines(filename):
    gzcat = subprocess.Popen(["gzcat", filename],
                             stdout = subprocess.PIPE)
    wc = subprocess.Popen(["wc", "-l"],
                          stdin = gzcat.stdout,
                          stdout = subprocess.PIPE)
    num_lines = int(wc.stdout.readline())
    date = datetime.datetime.strptime(filename, "www_logs/www.%Y%m%d.gz")
    return (date, num_lines)
&lt;/pre&gt;

Using this version, my single-threaded time is 3.2 seconds, with two
threads it's 2.0 seconds, three threads is 1.8 seconds, and four
threads is 1.7 seconds.

&lt;/P&gt;&lt;P&gt;

The respective times with the process pool are 3.3 seconds, 2.1
seconds, 1.8 seconds and 1.8 seconds. This means that very little time
in either of these cases is spent in the GIL, and the slightly slower
multiprocess times likely reflects extra cost of starting a process
and doing interprocess communications (IPC).

&lt;/P&gt;
&lt;h2&gt;What are the top URLs on my site?&lt;/h2&gt;
&lt;P&gt;

Okay, I admit that the previous section was overkill, but it's fun
sometimes to try out and compare different alternatives. 

&lt;/P&gt;&lt;P&gt;

I want to mine my logs for more information. What are the top 10 most
downloaded URLs?

&lt;/P&gt;&lt;P&gt;

This is the perfect situation for Python's &lt;a
href="http://docs.python.org/library/collections.html#collections.Counter"&gt;Counter&lt;/a&gt;
container. This was added in Python 2.7; see that link for how to
support older versions of Python.

&lt;/P&gt;&lt;P&gt;

I'll start with the simplest single-threaded version; remember that a line in the log file looks like:

&lt;pre class="code"&gt;
198.180.131.21 - - [25/Dec/2011:00:47:19 -0500] "GET /writings/diary/diary-rss.xml HTTP/1.1" 304 174 "-" "Mozilla/5.0 (Windows NT 5.1; rv:8.0) Gecko/20100101 Firefox/8.0"
&lt;/pre&gt;

The following analysis code:

&lt;pre class="code"&gt;
import glob
import gzip
from collections import Counter

counter = Counter()
for filename in glob.glob("www_logs/www.*.gz"):
    with gzip.open(filename) as f:
        for line in f:
            # Extract the path field from the log string
            request = line.split('"')[1]
            path = request.split(" ")[1]
            counter[path] += 1

for path, count in counter.most_common(10):
    print count, path
&lt;/pre&gt;

takes 8.9 seconds to generate this listing:

&lt;pre class="code"&gt;
170073 /favicon.ico
93354 /writings/diary/diary-rss.xml
81513 /dss.css
78961 /images/toplogo_left.gif
78655 /images/spacer.gif
78526 /images/toplogo_right.gif
74223 /images/news_title.gif
26528 /
25349 /robots.txt
16962 /writings/NBN/python_intro/standard.css
&lt;/pre&gt;

That's really not exciting information. In a bit, I'll have it only
display counts for the  information.

&lt;/P&gt;
&lt;h3&gt;A concurrent.futures version&lt;/h3&gt;
&lt;P&gt;

We've determined that Python's gzip reader uses the GIL, so it's
pointless to parallelize the above code using threads.

&lt;/P&gt;&lt;P&gt;

There's another issue. The "counter" is a global data structure, and
that can't be shared across multiple Python processes. I'll have to
update the algorithm somewhat. I'll let each worker function process
a file and create a new counter for that file. Once it's done, I'll
send the counter instance back to the main process for further
processing.

&lt;/P&gt;&lt;P&gt;

Here's a worker function which does that.

&lt;pre class="code"&gt;
def count_urls(filename):
    counter = Counter()
    with gzip.open(filename) as f:
        for line in f:
            request = line.split('"')[1]
            path = request.split(" ")[1]
            counter[path] += 1
    return counter
&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

The code in the main process has to kick off all of the jobs, collect
the counters from each file, merge the counters into one, and report
the top hits. The new(ish) Counter object helps make this easy because
the "update()" method sums the values for shared keys instead of
replacing like it would for a dictionary.

&lt;pre class="code"&gt;
merged_counter = Counter()
filenames = glob.glob("www_logs/www.*.gz")

with futures.ProcessPoolExecutor(max_workers=4) as executor:
    for counter in executor.map(count_urls, filenames):
        merged_counter.update(counter)

for path, count in merged_counter.most_common(10):
    print count, path
&lt;/pre&gt;

(You might be asking "How does it exchange Python objects?"  Answer:
Through &lt;a href="http://docs.python.org/library/pickle.html"&gt;pickles&lt;/a&gt;.)

&lt;/P&gt;&lt;P&gt;

The above runs in 4.4 seconds, so about 1/2 the time as the single
processor version. And after I fixed a bug (I used "counter" my
report, not "merged_counter"), I got identical values as the
single-threaded version.

&lt;/P&gt;&lt;P&gt;

4.4 seconds is pretty good. As we saw before, Python's gzip reader is
not as fast as calling out to gzcat, so I decided to use a Popen call
instead. Also, I changed the code slightly so it only reports paths
which end with ".html".

&lt;/P&gt;&lt;P&gt;

The final code runs in 3.2 seconds, and here it is:

&lt;pre class="code"&gt;
from collections import Counter
from concurrent import futures
import glob
import gzip
import itertools
import subprocess

def count_urls(filename):
    counter = Counter()
    p = subprocess.Popen(["gzcat", filename],
                         stdout = subprocess.PIPE)
    for line in p.stdout:
        request = line.split('"')[1]
        path = request.split(" ")[1]
        if path.endswith(".html"):
            counter[path] += 1
    return counter

filenames = glob.glob("www_logs/www.*.gz")

merged_counter = Counter()
with futures.ProcessPoolExecutor(max_workers=4) as executor:
    for counter in executor.map(count_urls, filenames):
        merged_counter.update(counter)

for path, count in merged_counter.most_common(10):
    print count, path
&lt;/pre&gt;

It tells me that the 10 most popular HTML pages from my site are

&lt;pre class="code"&gt;
15830 /Python/PyRSS2Gen.html
13722 /writings/NBN/python_intro/command_line.html
11739 /writings/NBN/threads.html
10663 /writings/NBN/validation.html
6635 /writings/diary/archive/2007/06/01/lolpython.html
4525 /writings/NBN/writing_html.html
3756 /writings/NBN/generators.html
3465 /writings/NBN/parsing_with_ply.html
2958 /writings/diary/archive/2005/04/21/screen_scraping.html
2786 /writings/NBN/blast_parsing.html
&lt;/pre&gt;

&lt;/P&gt;
&lt;h2&gt;Resolving host names from IP addresses&lt;/h2&gt;
&lt;P&gt;

My logs contain bare IP address. I'm curious about where they come
from. I write about cheminformatics; are any computers from pharma
companies reading my pages? To do that, I need a &lt;a
href="http://en.wikipedia.org/wiki/Fully_qualified_domain_name"&gt;fully
qualified domain name&lt;/a&gt; for each IP address. Moreover, I want to
save the IP address to domain name mapping so I can use it in other
analyses.

&lt;/P&gt;&lt;P&gt;

Here's how to get the FQDN given an IP address as a string.

&lt;pre class="code"&gt;
&amp;gt;&amp;gt;&amp;gt; import socket
&amp;gt;&amp;gt;&amp;gt; socket.getfqdn("82.94.164.162")
'dinsdale.python.org'
&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

DNS lookups take a surprisingly long time; 0.2 seconds on my desktop,
and I understand this is typical. Since I have 117,504 addresses, that
may take a few hours. On the other hand, all of that time is spent
waiting for the network to respond. This is easily parallelized.

&lt;/P&gt;
&lt;h3&gt;socket.getfqdn() is single-threaded on a Mac&lt;/h3&gt;
&lt;P&gt;

I tried at first to use multiple threads for this, but that didn't
work. No matter how many threads I used, the overall time was the
same. After a wild goose chase where I suspected that my ISP throttled
the number of DNS lookups, I found the problem.

&lt;/P&gt;&lt;P&gt;

The getfqdn function is a thin wrapper to socket.gethostbyaddr(),
which itself is a thin layer on top of the C function
"gethostbyaddr()". In most cases, the underlying API may only be
called from a single thread. A common solution is to implement a
reentrant version, usually named "gethostbyaddr_r", but the OS X
developers decided that people should use a different API for that
case. ("getaddrinfo ... is a replacement for and provides more
flexibility than the gethostbyname(3) and getservbyname(3)
functions".)  The Python module only calls the single-threaded code,
and uses a lock to ensure that only one thread calls it at a time.

&lt;/P&gt;&lt;P&gt;

The problem is easily solved by using a process pool instead of a
thread pool.

&lt;/P&gt;
&lt;h3&gt;Extracting IP addresses from a set of gzip compressed log files&lt;/h3&gt;
&lt;P&gt;

The first step is to get the IP addresses which I want to convert. I
only care about unique IP addresses, and don't want to waste time
looking up duplicates. The code to extract the IP addresses is
straight-forward. Reading the compressed file is not the slow part, so
there's no reason to parallelize this or use an external gzip process
to speed things up.

&lt;pre class="code"&gt;
def get_unique_ip_addresses(filenames):
    # Report only the unique IP addresses in the set
    seen = set()
    
    for filename in filenames:
        with gzip.open(filename) as gzfile:
            for line in gzfile:
                # The IP address is the first word in the line
                ip_addr = line.split()[0]
                if ip_addr not in seen:
                    seen.add(ip_addr)
                    yield ip_addr
&lt;/pre&gt;


I don't want to process the entire data set during testing and
debugging. The above returns an iterator, so I use itertools.islice to
get a section of 100 terms, also as an iterator:

&lt;pre class="code"&gt;
filenames = glob.glob("www_logs/www.*.gz")
ip_addresses = itertools.islice(get_unique_ip_addresses(filenames), 1800, 1900)
&lt;/pre&gt;

(I started with the range (0, 1000), but then ran into the
gethostbyaddr reentrancy problems. I didn't want my computer to do a
simple local cache lookup, so I change the range to (1000, 1100), then
(1100, 1200) and so on. This show that it took me a while to figure
out what was wrong!)

&lt;/P&gt;
&lt;h3&gt;Using "executor.submit()" instead of "executor.map"&lt;/h3&gt;
&lt;P&gt;

How am I going to do the parallel call? I could do a simple

&lt;pre class="code"&gt;
with ProcessPoolExecutor(max_workers=20) as executor:
  for fqdn in executor.map(socket.getfqdn, ip_addresses):
    print fqdn
&lt;/pre&gt;

but then I lose track of the original IP address, and I wanted to
cache the IP address to FQDN mapping for later use. While it might be
possible to use a combination of itertools.tee and itertools.izip, I
decided that "map" wasn't the right call in the first place.

&lt;/P&gt;&lt;P&gt;

The executor's "map" function guarantees that the result order will be
the same as the input order. I don't care about the order. Instead,
I'll submit each job using the "submit()" method.


&lt;pre class="code"&gt;
with futures.ProcessPoolExecutor(max_workers=10) as executor:
    jobs = []
    for ip_addr in ip_addresses:
        job = executor.submit(resolve_fqdn, ip_addr)
        jobs.append(job)
&lt;/pre&gt;

The submit function returns a "&lt;a
href="http://docs.python.org/dev/library/concurrent.futures.html#concurrent.futures.Future"&gt;concurrent.futures.Future&lt;/a&gt;"
object. For now, there are two important things about it. You can ask
it for its "result()", like this:

&lt;pre class="code"&gt;
ip_addr, fqdn = job.result()
&lt;/pre&gt;

The "result()" method blocks until the Promise has a result. Blocking
is bad for performance, so how do you know which job promises are
actually ready?  Use "&lt;a
href="http://docs.python.org/dev/library/concurrent.futures.html#concurrent.futures.as_completed"&gt;concurrent.futures.as_completed()&lt;/a&gt;"
for that:

&lt;pre class="code"&gt;
for job in futures.as_completed(jobs):
    ip_addr, fqdn = job.result()
    print ip_addr, fqdn
&lt;/pre&gt;

The last part to this puzzle is to have the actual job return the two
element tuple with both the input IP address and the resulting FQDN

&lt;pre class="code"&gt;
def resolve_fqdn(ip_addr):
    fqdn = socket.getfqdn(ip_addr)
    return ip_addr, fqdn
&lt;/pre&gt;

Put it all together and the code is:

&lt;pre class="code"&gt;
from concurrent import futures
import glob
import gzip
import itertools
import socket
import time

def get_unique_ip_addresses(filenames):
    # Report only the unique IP addresses in the set
    seen = set()
    
    for filename in filenames:
        with gzip.open(filename) as gzfile:
            for line in gzfile:
                # The IP address is the first word in the line
                ip_addr = line.split()[0]
                if ip_addr not in seen:
                    seen.add(ip_addr)
                    yield ip_addr

def resolve_fqdn(ip_addr):
    fqdn = socket.getfqdn(ip_addr)
    return ip_addr, fqdn


filenames = glob.glob("www_logs/www.*.gz")
ip_addresses = itertools.islice(get_unique_ip_addresses(filenames), 1800, 1900)

with futures.ProcessPoolExecutor(max_workers=20) as executor:
    jobs = []
    for ip_addr in ip_addresses:
        job = executor.submit(resolve_fqdn, ip_addr)
        jobs.append(job)

    # Get the completed jobs whenever they are done
    for job in futures.as_completed(jobs):
        ip_addr, fqdn = job.result()
        print ip_addr, fqdn
&lt;/pre&gt;

This processes 100 IP addresses in about 2-8 seconds. The actual time
is highly dependent on DNS response times from servers around the
world. To reduce the variability, I increased the number of IP
addresses I used for my measurements. I found that with 20 processes I
could do about 50 lookups per second, and with 50 processes I could do
about 90 lookups per second. I didn't try a higher number.

&lt;/P&gt;
&lt;h3&gt;Use a dictionary of futures instead of a list&lt;/h3&gt;
&lt;P&gt;

What I did seems somewhat clumsy in that I send the IP address to the
process, and the process sends the IP address back to me. I did that
because it was easy. The module documentation shows another technique.

&lt;/P&gt;&lt;P&gt;

You can keep the jobs in a dictionary, where the key is the future
object (returned by "submit()"), and its value is the information you
want to save. That is, you can rewrite the above as:

&lt;pre class="code"&gt;
with futures.ProcessPoolExecutor(max_workers=20) as executor:
    jobs = {}
    for ip_addr in ip_addresses:
        job = executor.submit(socket.getfqdn, ip_addr)
        jobs[job] = ip_addr

    # Get the completed jobs whenever they are done
    for job in futures.as_completed(jobs):
        ip_addr = jobs[job]
        fqdn = job.result()
        print ip_addr, fqdn
&lt;/pre&gt;

Notice how it doesn't need the "resolve_fqdn" function; it can call
socket.getfqdn directly.

&lt;/P&gt;
&lt;h2&gt;Add a callback to the job future&lt;/h2&gt;
&lt;P&gt;

The conceptual model so far is "create all the jobs" followed by "do
something with the results." This works well, except for latency. I
only processed 100 IP addresses in my example. I removed the
"islice()" call and asked it to process all 117,504 IP addresses in
my data set. The code looked like it wasn't working because it wasn't
giving output. As it turned out, it was still loading all of the jobs.

&lt;/P&gt;&lt;P&gt;

The concurrent module uses an &lt;i&gt;asynchronous&lt;/i&gt; model, and just like
Twisted's Deferred and jQuery's deferred.promise(), there's a way to
attach a callback function to a future, which will be called once the
answer is ready. Here's how it works:

&lt;pre class="code"&gt;
with futures.ProcessPoolExecutor(max_workers=50) as executor:
    for ip_addr in ip_addresses:
        job = executor.submit(resolve_fqdn, ip_addr)
        job.add_done_callback(print_mapping)
&lt;/pre&gt;

When each job future is ready, the concurrent library will call the
"print_mapping" callback, with the job result as its sole parameter:

&lt;pre class="code"&gt;
def print_mapping(job):
    ip_addr, fqdn = job.result()
    print ip_addr, fqdn
&lt;/pre&gt;

Technical notes: The callback occurs in the same process which
submitted the job, which is exactly what's needed here. However, the
documentation doesn't say that all of the callbacks will be done from
the same thread, so if you are using a thread pool then you probably
want to use a thread lock around a shared resource. (sys.stdout is a
shared resource, so you would need one around the print statement
here. I'm using a process pool, and the concurrent process pool
implementation uses a single local worker thread, so I don't think I
have to worry about contention. You should verify that.)

&lt;/P&gt;&lt;P&gt;

Here is the final callback-based code:

&lt;pre class="code"&gt;
from concurrent import futures
import glob
import gzip
import socket

def get_unique_ip_addresses(filenames):
    # Report only the unique IP addresses in the set
    seen = set()
    
    for filename in filenames:
        with gzip.open(filename) as gzfile:
            for line in gzfile:
                # The IP address is the first word in the line
                ip_addr = line.split()[0]
                if ip_addr not in seen:
                    seen.add(ip_addr)
                    yield ip_addr

def resolve_fqdn(ip_addr):
    fqdn = socket.getfqdn(ip_addr)
    return ip_addr, fqdn

## A multi-threaded version should use create a resource lock
# import threading
# write_lock = threading.Lock()

def print_mapping(job):
    ip_addr, fqdn = job.result()
    print ip_addr, fqdn

    ## A multi-threaded version should use the resource lock
    # with write_lock:
    #   print ip_addr, fqdn

filenames = glob.glob("www_logs/www.*.gz")
with futures.ProcessPoolExecutor(max_workers=50) as executor:
    for ip_addr in get_unique_ip_addresses(filenames):
        job = executor.submit(resolve_fqdn, ip_addr)
        job.add_done_callback(print_mapping)
&lt;/pre&gt;

It processed my 117,504 addresses in 1236 seconds (about 21 minutes),
which means a rate of 95 per second. That's much better than my
original rate of 5 per second!

&lt;/P&gt;
&lt;h3&gt;functools.partial&lt;/h3&gt;
&lt;P&gt;

By the way, just like earlier, there's no absolute need for the worker
function to return the ip address. I could have written this as:

&lt;pre class="code"&gt;
job = executor.submit(socket.getfqdn, ip_addr)
job.add_done_callback(functools.partial(print_mapping, ip_addr))
&lt;/pre&gt;

or even as an ugly-looking lambda function with a default value to get
around scoping issues.

&lt;/P&gt;&lt;P&gt;

In this variation, print_mapping becomes:

&lt;pre class="code"&gt;
def print_mapping(ip_addr, job):
    fqdn = job.result()
    print ip_addr, fqdn
&lt;/pre&gt;

where the "ip_addr" was stored by the "partial()", and where "job"
comes from the completed promise.

&lt;/P&gt;&lt;P&gt;

This approach feels more "pure", but I find that methods like this are
harder for most people to understand.

&lt;/P&gt;
&lt;h2&gt;Who subscribes to my blog's RSS feed?&lt;/h2&gt;
&lt;P&gt;

A quick check of the list of hostnames shows that no one from
AstraZeneca reads my blog from a work machine. Actually, I don't have
any accesses from them at all, which is a bit surprising since I know
some of them follow what I do. They might use a blog aggregator like
Google Reader, or use a home account, or perhaps AZ's data goes
through a proxy which doesn't have the name "az" or "astrazeneca" in
it.

&lt;/P&gt;&lt;P&gt;

There are requests from Roche, and Vertex, but no blog
subscribers. Who then subscribes to my blog?

&lt;/P&gt;&lt;P&gt;

Here I print the hostnames for requests which fetch my blog's RSS
feed. With 'zgrep' it's fast enough that I'm not going to parallelize
the code.

&lt;pre class="code"&gt;
import subprocess
import glob

hostname_table = dict(line.split() for line in open("hostnames"))

filenames = glob.glob("www_logs/www.*.gz")
p = subprocess.Popen(["zgrep", "--no-filename", "/writings/diary/diary-rss.xml"] + filenames,
                     stdout = subprocess.PIPE)

for line in p.stdout:
    ip_addr = line.split()[0]
    hostname = hostname_table[ip_addr]
    if hostname_table == ip_addr:
        # Couldn't find a reverse lookup; ignore
        continue
    print hostname
&lt;/pre&gt;

A quick look at the output shows a lot of requests from Amazon and
Google, so I removed those, and report the results using:

&lt;pre class="code"&gt;
% python2.7 readers.py | sort | uniq -c | sort -n | grep -v amazon | grep -v google.com
&lt;/pre&gt;

Since I have 169 days of log file, I'll say that "avid readers" poll
the URL at least once per day. That gives me:

&lt;pre class="code"&gt;
 176 modemcable139.154-178-173.mc.videotron.ca
 184 62.197.198.100
 187 v041222.dynamic.ppp.asahi-net.or.jp
 195 modemcable147.252-178-173.mc.videotron.ca
 200 94-226-195-151.access.telenet.be
 202 217.28.199.236
 223 123.124.21.91
 223 65.52.56.128
 241 5a-m02-d1.data-hotel.net
 263 117.218.210-67.q9.net
 263 173-11-122-218-sfba.hfc.comcastbusiness.net
 274 71-222-225-175.albq.qwest.net
 332 modemcable069.85-178-173.mc.videotron.ca
 335 no-dns-yet.convergencegroup.co.uk
 337 ip-81-210-146-57.unitymediagroup.de
 338 embln.embl.de
 353 cpe-72-183-122-94.austin.res.rr.com
 365 90-227-178-245-no128.tbcn.telia.com
 370 138.194.48.143
 408 210.96-246-81.adsl-static.isp.belgacom.be
 428 k8024-02l.mc.chalmers.se
 490 cpe-70-115-243-212.satx.res.rr.com
 493 www26006u.sakura.ne.jp
 527 219.239.34.54
 534 44.186.34.193.bridgep.com
 535 5a-m02-d2.data-hotel.net
 586 5a-m02-c6.data-hotel.net
 666 168-103-109-30.albq.qwest.net
 676 y236106.dynamic.ppp.asahi-net.or.jp
 698 82-169-211-97.ip.telfort.nl
 759 w-192.cust-7150.ip.static.uno.uk.net
1071 adsl-75-23-68-58.dsl.peoril.sbcglobal.net
1217 hekate.eva.mpg.de
1223 211.103.236.94
1307 li147-78.members.linode.com
1342 static24-72-40-170.r.rev.accesscomm.ca
1346 dinsdale.python.org
1398 145.253.161.126
1771 pat1.orbitz.net
2060 artima.com
3164 148.188.1.60
3767 90-224-169-87-no128.tbcn.telia.com
4518 jervis.textdrive.com
5791 cpe-70-114-252-25.austin.res.rr.com
6171 ip21.biogen.com
8264 it18689.research.novo.dk
10919 61.135.216.104
&lt;/pre&gt;

I know who comes from one of the Max Planck Institute machines, and a
big "hello!" to the readers from Novo Nordisk and Biogen - thanks for
subscribing to my blog! "dinsdale.python.org" is the &lt;a
href="http://planet.python.org/"&gt;Planet Python&lt;/a&gt; aggregator, and
artima.com is the &lt;a href="http://www.artima.com/index.jsp"&gt;Artima
Developer Community&lt;/a&gt;; another aggregator.

&lt;/P&gt;&lt;P&gt;

I know more than 60 people read my blog posts within 12 hours of when
they are posted, so this tells me that most people read blogs through
a web-based aggregator (like Planet Python or Google Reader), and not
through a program running on their desktop. I'm glad to know I'm not
alone in doing that!

&lt;/P&gt;&lt;P&gt;

</description><guid isPermaLink="true">http://www.dalkescientific.com/writings/diary/archive/2012/01/19/concurrent.futures.html</guid><pubDate>Thu, 19 Jan 2012 12:00:00 GMT</pubDate></item><item><title>I parallelize an algorithm</title><link>http://www.dalkescientific.com/writings/diary/archive/2012/01/17/I_parallelize_an_algorithm.html</link><description>&lt;P&gt;

I, with help from Kim Walisch, have been adding &lt;a
href="http://en.wikipedia.org/wiki/OpenMP"&gt;OpenMP&lt;/a&gt; support to &lt;a
href="http://code.google.com/p/chem-fingerprints/"&gt;chemfp&lt;/a&gt;. This is
the first time I've used OpenMP, and I'm pleased to say there are
places where it works really well. However, OpenMP, like every other
parallization strategy, is no global panacea. It can take a lot of
effort to get good scaling, and there are cases where it doesn't feel
any easier to use OpenMP than to use pthreads (POSIX threads).

&lt;/P&gt;&lt;P&gt;

In this essay I'll walk through how I converted a single-threaded
algorithm to OpenMP, and compare the results to a version built on an
Python async I/O library atop of pthreads.

&lt;/P&gt;
&lt;h2&gt;The single-threaded algorithm&lt;/h2&gt;
&lt;P&gt;

Suppose you have a list of N objects - let's call them "fingerprints"
- and a function which compares two fingerprints - call it "tanimoto"
- which returns a similarity score value from 0.0 to 1.0. A score of
0.0 means "not similar" and 1.0 means "very similar." The similarity
of a fingerprint compared with itself is always 1.0, and the tanimoto
function is symmetric, so that tanimoto(x, y) == tanimoto(y, x).

&lt;/P&gt;&lt;P&gt;

One question you can ask is "which fingerprints are within
&lt;i&gt;threshold&lt;/i&gt; similarity to the tenth fingerprint?" I'll use the
term "neighbor" to include any fingerprint which at least threshold
similar to a given fingerprint, so I can restate the above as "what
are the &lt;i&gt;threshold&lt;/i&gt; neighbors of the tenth fingerprint.

&lt;/P&gt;&lt;P&gt;

The code for this is not hard:

&lt;pre class="code"&gt;
  fingerprint_t fingerprints[] = {array of fingerprint objects};
  int query_index = 9;   // The tenth fingerprint
  assert(0.0 &lt;= threshold &amp;&amp; threshold &lt;= 1.0);
  
  for (target_index=0; target_index&amp;lt;N; target_index++) {
    if (tanimoto(fingerprint[query_index], fingerprint[target_index]) &amp;gt;= threshold) {
      if (query_index != target_index) {
          printf("%d is a neighbor\n", target_index);
      }
    }
  }
&lt;/pre&gt;

The main subtlety is the check that I don't report that the first
fingerprint is a neighbor of itself. There are a few ways to handle
that case: here I chose one which is optimized for performance,
assuming relatively view targets are similar enough.

&lt;/P&gt;&lt;P&gt;

Fingerprint searches are in a high-dimensional space so optimizations
like k-d trees, which work for lower dimensional spaces, suffer from
the &lt;a
href="http://en.wikipedia.org/wiki/Curse_of_dimensionality"&gt;curse of
dimensionality&lt;/a&gt;. For exact answers, the best you can expect is
linear performance. There are clever ways to get sublinear
performance, but the worst case is still linear. Still, computers are
fast, and can search 100,000 fingerprints in a blink.

&lt;/P&gt;&lt;P&gt;

Another question you can ask is, what are the neighbor counts for all
of the fingerprints in the data set? Here's code which computes that:

&lt;pre class="code"&gt;
  fingerprint_t fingerprints[] = {array of fingerprint objects};
  int count, query_index, target_index;
  int counts[N] = {}; // initialize to 0
  assert(0.0 &amp;lt;= threshold &amp;&amp; threshold &amp;lt;= 1.0);
  
  for (query_index=0; query_index&amp;lt;N; query_index++) {
    count = 0;
    for (target_index=0; target_index&amp;lt;N; target_index++) {
      if (tanimoto(fingerprint[query_index], fingerprint[target_index]) &amp;gt;= threshold) {
        count++;
      }
    }
    /* The counts are too high by one since it includes the diagonal term */
    /* and tanimoto(fingerprint[i], fingerprint[i]) == 1.0 */
    /* Decrement by one to get the correct answer */
    counts[query_index] += count - 1;
  }
&lt;/pre&gt;

What this does is go row-by-row through the NxN comparison matrix,
compute the similarities, and add up the number of times where the
similarity is high enough. Since I include the diagonal term in the
counts, and since the similarity along the diagonal is always 1.0, I
have to subtract off one after computing the total row count.

&lt;/P&gt;&lt;P&gt;

Some might consider it inelegant that I count the self-similarity in
(the main loop and subtract one at the end, but it makes the code
short and understandable, and while there are N extra calculations,
the double loop has a total of N*N calculations, so it's only a small
amount of extra work.)


&lt;/P&gt;
&lt;h2&gt;Parallelizing the NxN algorithm&lt;/h2&gt;
&lt;P&gt;

Parallelizing this with OpenMP is dead-simple. I ask the compiler to
evalute the row loop in parallel.

&lt;pre class="code"&gt;
  fingerprint_t fingerprints[] = {array of fingerprint objects};
  int count, query_index, target_index;
  int counts[N] = {}; // initialize to 0
  assert(0.0 &amp;lt;= threshold &amp;&amp; threshold &amp;lt;= 1.0);
  
  &lt;b&gt;#pragma omp parallel for private(count, target_index)&lt;/b&gt;
  for (query_index=0; query_index&amp;lt;N; query_index++) {
    count = 0;
    for (target_index=0; target_index&amp;lt;N; target_index++) {
      if (tanimoto(fingerprint[query_index], fingerprint[target_index]) &amp;gt;= threshold) {
        count++;
      }
    }
    /* The counts are too high by one since it includes the diagonal term */
    /* and tanimoto(fingerprint[i], fingerprint[i]) == 1.0 */
    counts[query_index] += count - 1;
  }
&lt;/pre&gt;

That's it! OpenMP is a great fit to this case. With one new line of
code, I have very good scaleup across many cores.

&lt;/P&gt;&lt;P&gt;

It's not perfect scaleup. For one, the tanimoto() calculation is fast;
fast enough that memory bandwidth and cache performance is an
issue. It might be faster to use Z-ordering or other cache-oblivious
ordering. That's outside the scope of this essay. For that matter, I
hadn't tested this hypothesis because I use another technique which
usually gives good cache behavior for the situations I'm most
concerned about.

&lt;/P&gt;
&lt;h2&gt;What about symmetry?&lt;/h2&gt;
&lt;P&gt;

That easy parallelization is great, right? Well, I'm missing out on a
simple factor of two speedup. The tanimoto function is symmetric, so I
only need to compute the upper triangle terms. Here's the
single-threaded implementation:

&lt;pre class="code"&gt;
  for (query_index=0; query_index&amp;lt;N; query_index++) {
    for (target_index=query_index+1; target_index&amp;lt;N; target_index++) {
      if (tanimoto(fingerprint[query_index], fingerprint[target_index]) &amp;gt;= threshold) {
        counts[query_index]++;
        counts[target_index]++;
      }
    }
  }
&lt;/pre&gt;

It looks simple. Too bad it doesn't parallelize well. Increment is not
an atomic operation. If multiple threads execute "counts[i]++" at the
same time, for the same value of i, then it might be that thread 1
reads a value of 4, thread 2 reads a value of 4, thread 1 writes the
incremented value 5, and thread 2 writes its own incremented value of
5. This is bad.

&lt;/P&gt;&lt;P&gt;

One solution is to tell OpenMP that the increment code is a "critical"
section, which means only one thread can execute it at a time. The
resulting code is:

&lt;pre class="code"&gt;
  &lt;b&gt;#pragma omp parallel for private(target_index)&lt;/b&gt;
  for (query_index=0; query_index&amp;lt;N; query_index++) {
    for (target_index=query_index+1; target_index&amp;lt;N; target_index++) {
      if (tanimoto(fingerprint[query_index], fingerprint[target_index]) &amp;gt;= threshold) {
        &lt;b&gt;#pragma omp critical (add_count)&lt;/b&gt;
        counts[query_index]++;
        &lt;b&gt;#pragma omp critical (add_count)&lt;/b&gt;
        counts[target_index]++;
      }
    }
  }
&lt;/pre&gt;

Here, 'add_count' is the symbolic name for a global lock to a critical
section.

&lt;/P&gt;&lt;P&gt;

I wrote something like this with a very high threshold, and found and
almost perfect two-fold speedup. Go OpenMP!

&lt;/P&gt;
&lt;h2&gt;Amdahl's Law strikes again!&lt;/h2&gt;
&lt;P&gt;

The problem is the critical sections are single-threaded. When I lower
the threshold, I find more matches, and more threads try to run the
single threaded code. This runs directly into &lt;a
href="http://en.wikipedia.org/wiki/Amdahl's_law"&gt;Amdahl's Law&lt;/a&gt;. The
critical section becomes the limiting factor as all the threads
contend for the same lock.

&lt;/P&gt;&lt;P&gt;

I can reduce the contention a bit by keeping track of the row counts
in a thread-local variable:

&lt;/P&gt;&lt;P&gt;

&lt;pre class="code"&gt;
  #pragma omp parallel for private(count, target_index)
  for (query_index=0; query_index&amp;lt;N; query_index++) {
    &lt;b&gt;count = 0;&lt;/b&gt;
    for (target_index=query_index+1; target_index&amp;lt;N; target_index++) {
      if (tanimoto(fingerprint[query_index], fingerprint[target_index]) &amp;gt;= threshold) {
        count++;
        #pragma omp critical (add_count)
        counts[target_index]++;
      }
    }
    /* Correction on 2012-01-12; Commenter &lt;a href="http://news.ycombinator.com/item?id=3475825"&gt;scott_s on Hacker News pointed out&lt;/a&gt; */
    /* that only one thread will ever get here with a given query_index. */
    /* There is no chance that multiple threads try to change the same value */
    /* This critical section can be removed without affecting correctness. */
    /* I leave this code here because it affects the timings. */
    /* It does not affect the conclusion that lock contention can make things slow */
    &lt;b&gt;#pragma omp critical (add_count)&lt;/b&gt;
    &lt;b&gt;counts[query_index] += count;&lt;/b&gt;
  }
&lt;/pre&gt;

This is the simplest seemingly-reasonable parallization of the
upper-triangle algorithm.

&lt;/P&gt;&lt;P&gt;

How well does this work? My desktop has four cores. I can compare the
performance of the original non-symmetric code with the symmetric one:

&lt;center&gt;
&lt;table&gt;
 &lt;tr&gt;&lt;th rowspan="2"&gt;Algorithm&lt;/th&gt;&lt;th colspan="3"&gt;Tanimoto thresholds&lt;/th&gt;&lt;/tr&gt;
 &lt;tr&gt;&lt;th&gt;0.8&lt;/th&gt;&lt;th&gt;0.6&lt;/th&gt;&lt;th&gt;0.5&lt;/th&gt;&lt;/tr&gt;
 &lt;tr&gt;&lt;td&gt;symmetric&lt;/td&gt;&lt;td&gt;40s&lt;/td&gt;&lt;td&gt;151s&lt;/td&gt;&lt;td&gt;660s&lt;/td&gt;&lt;/tr&gt;
 &lt;tr&gt;&lt;td&gt;non-symmetric&lt;/td&gt;&lt;td&gt;82s&lt;/td&gt;&lt;td&gt;170s&lt;/td&gt;&lt;td&gt;207s&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;
&lt;/center&gt;

When the threshold is high (0.8), I get the expected factor-of-two
performance boost. This means there is very little contention.  The
factor-of-two performance mostly dissapears for the medium high
threshold of 0.6, and for the threshold of 0.5 the overall run-time is
much slower than the original NxN algorithm. Indeed, I eventually gave
up trying to determine run-time for even lower thresholds.

&lt;/P&gt;&lt;P&gt;

That's terrible. I have one algorithm which is best when there are few
similarities, and another which is best when there are many
similarities, and because the number of similarities is highly
data-dependent, I don't have an easy way to figure out which algorithm
to use.

&lt;/P&gt;
&lt;h2&gt;Use many critical sections instead of one&lt;/h2&gt;
&lt;P&gt;

The problem is that my four cores all want to use a single critical
section.  When one core has the lock, the other threads have to
wait. What I can do is increase the number of critical sections. For
example, I can have one lock to get access to the even-numbered rows,
and another lock to get access to the odd-numbered rows. Here's the
corresponding code:

&lt;pre class="code"&gt;
  #pragma omp parallel for private(count, target_index)
  for (query_index=0; query_index&amp;lt;N; query_index++) {
    count = 0;
    for (target_index=query_index+1; target_index&amp;lt;N; target_index++) {
      if (tanimoto(fingerprint[query_index], fingerprint[target_index]) &amp;gt;= threshold) {
          count++;
&lt;b&gt;        switch (target_index % 2) {
          case 0:
            #pragma omp critical (add_count0)
            counts[target_index]++;
            break;
          case 1:
            #pragma omp critical (add_count1)
            counts[target_index]++;
            break;
          }&lt;/b&gt;
      }
    }
    /* Correction on 2012-01-12; Commenter &lt;a href="http://news.ycombinator.com/item?id=3475825"&gt;scott_s on Hacker News&lt;/a&gt; pointed out */
    /* that only one thread will ever get here with a given query_index. */
    /* There is no chance that multiple threads try to change the same value */
    /* The following could be replaced with a simple */
    /*     counts[query_index] += count;   */
    /* I leave this code here because it affects the timings. */
    /* Since this is only called O(N) (instead of O(N*N/2) times), it */
    /* should have minimal effect on the overall time. */
&lt;b&gt;    switch (query_index % 2) {
      case 0:
        #pragma omp critical (add_count0)
        counts[query_index] += count;
        break;
      case 1:
        #pragma omp critical (add_count1)
        counts[query_index] += count;
        break;
      }&lt;/b&gt;
    }
  }
&lt;/pre&gt;

That's clumsy, but the performance is a bit better. With two critical
sections the times are:

&lt;center&gt;
&lt;table&gt;
 &lt;tr&gt;&lt;th rowspan="2"&gt;Algorithm&lt;/th&gt;&lt;th colspan="3"&gt;Tanimoto thresholds&lt;/th&gt;&lt;/tr&gt;
 &lt;tr&gt;&lt;th&gt;0.8&lt;/th&gt;&lt;th&gt;0.6&lt;/th&gt;&lt;th&gt;0.5&lt;/th&gt;&lt;/tr&gt;
 &lt;tr&gt;&lt;td&gt;symmetric&lt;/td&gt;&lt;td&gt;39s&lt;/td&gt;&lt;td&gt;114s&lt;/td&gt;&lt;td&gt;375s&lt;/td&gt;&lt;/tr&gt;
 &lt;tr&gt;&lt;td&gt;non-symmetric&lt;/td&gt;&lt;td&gt;82s&lt;/td&gt;&lt;td&gt;170s&lt;/td&gt;&lt;td&gt;207s&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;
&lt;/center&gt;

so symmetric code is faster, but there are cases where the
non-symmetric code is faster still.

&lt;/P&gt;&lt;P&gt;

What about even more critical sections? I tried a range of values. Here's the table:

&lt;center&gt;
&lt;table&gt;
 &lt;tr&gt;&lt;th rowspan="2"&gt;number of&lt;br&gt;critical sections&lt;/th&gt;&lt;th colspan="6"&gt;Tanimoto thresholds&lt;/th&gt;&lt;/tr&gt;
 &lt;tr&gt;&lt;th&gt;0.8&lt;/th&gt;&lt;th&gt;0.6&lt;/th&gt;&lt;th&gt;0.5&lt;/th&gt;&lt;th&gt;0.4&lt;/th&gt;&lt;th&gt;0.2&lt;/th&gt;&lt;th&gt;0.01&lt;/th&gt;&lt;/tr&gt;
 &lt;tr&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;40&lt;/td&gt;&lt;td&gt;151&lt;/td&gt;&lt;td&gt;660&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;
 &lt;tr&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;39&lt;/td&gt;&lt;td&gt;114&lt;/td&gt;&lt;td&gt;375&lt;/td&gt;&lt;td colspan="3"&gt;over 37 minutes&lt;/td&gt;&lt;/tr&gt;
 &lt;tr&gt;&lt;td&gt;16&lt;/td&gt;&lt;td&gt;40&lt;/td&gt;&lt;td&gt;86&lt;/td&gt;&lt;td&gt;133&lt;/td&gt;&lt;td&gt;299&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;
 &lt;tr&gt;&lt;td&gt;64&lt;/td&gt;&lt;td&gt;41&lt;/td&gt;&lt;td&gt;84&lt;/td&gt;&lt;td&gt;105&lt;/td&gt;&lt;td&gt;137&lt;/td&gt;&lt;td&gt;271&lt;/td&gt;&lt;td&gt;307&lt;/td&gt;&lt;/tr&gt;
 &lt;tr&gt;&lt;td&gt;128&lt;/td&gt;&lt;td&gt;40&lt;/td&gt;&lt;td&gt;82&lt;/td&gt;&lt;td&gt;102&lt;/td&gt;&lt;td&gt;131&lt;/td&gt;&lt;td&gt;244&lt;/td&gt;&lt;td&gt;278&lt;/td&gt;&lt;/tr&gt;
 &lt;tr&gt;&lt;th&gt;non-symmetric&lt;/th&gt;&lt;td&gt;82&lt;/td&gt;&lt;td&gt;170&lt;/td&gt;&lt;td&gt;207&lt;/td&gt;&lt;td&gt;240&lt;/td&gt;&lt;td&gt;272&lt;/td&gt;&lt;td&gt;280&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;
&lt;/center&gt;

The value of 0.01 makes the search algorithm calculate nearly all of
the possible comparisons, so it's an effective worst-case for my
search space. (I can't use 0.0 because my implementation has special
support for that case; it knows that all fingerprints have N-1
neighbors.)

&lt;/P&gt;&lt;P&gt;

As you can see, with 128 critical sections and in the worst case, my
code which takes advantage of symmetry is the same speed as the one
which doesn't. This likely means that the code acquire and release the
critical section lock has about the same amount of overhead at the
Tanimoto similarity calculation.

&lt;/P&gt;&lt;P&gt;

I probably should have tried 256 different locks, but I think this
code is ugly enough as it is, and very few people use thresholds below
0.6, much less down to 0.2. I do wonder how the times compare if there
are, say, 16 cores, but this it's time to try a different solution.

&lt;/P&gt;
&lt;h2&gt;What about per-thread count arrays?&lt;/h2&gt;
&lt;P&gt;

There is an alternate solution. I could sum up the counts in
individual, private/per-thread arrays and merge the final
counts. Here's what the code looks like:

&lt;pre class="code"&gt;
  /* Correction 2012-01-08: Originally I had omp_get_num_threads() here but */
  /* as &lt;a href="http://news.ycombinator.com/item?id=3476042"&gt;acq points out on Hacker News&lt;/a&gt;, this only returns the number of active */
  /* threads while in a parallel section. Otherwise it returns 1. I fixed my */
  /* actual code during testing, but that fix didn't make its way over here. */
  /* I also tried using omp_get_max_threads() but that wasn't supported on my Mac. */
  int *parallel_counts = (int *) calloc(omp_get_max_threads() * N, sizeof(int));
  int *per_thread_counts;

  #pragma omp parallel for private(count, target_index, per_thread_counts)
  for (query_index=0; query_index&amp;lt;N; query_index++) {
    per_thread_counts = parallel_counts + (N * omp_get_thread_num() );
    count = 0;
    for (target_index=query_index+1; target_index&amp;lt;N; target_index++) {
      if (tanimoto(fingerprint[query_index], fingerprint[target_index]) &amp;gt;= threshold) {
        count++;
        per_thread_counts[target_index]++;
      }
    }
    per_thread_counts[query_index] += count;
 }
 for (query_index=0; query_index&amp;lt;N; query_index++) {
   count = 0;
   for (thread=0; thread&amp;lt;omp_get_num_threads(); thread++) {
     count += parallel_counts[query_index+N*thread];
   }
   counts[query_index] += count;
 }
 free(parallel_counts);
&lt;/pre&gt;

This requires no locking, and only a very small bit of sequential code
which is linear in the number of fingerprints. There's more code, but
this algorithm should scale better than the previous algorithm.

&lt;/P&gt;&lt;P&gt;

Here are the timings:
&lt;center&gt;
&lt;table&gt;
 &lt;tr&gt;&lt;th&gt;&amp;nbsp;&lt;/th&gt;&lt;th colspan="6"&gt;Tanimoto thresholds&lt;/th&gt;&lt;/tr&gt;
 &lt;tr&gt;&lt;th&gt;method&lt;/th&gt;&lt;th&gt;0.8&lt;/th&gt;&lt;th&gt;0.6&lt;/th&gt;&lt;th&gt;0.5&lt;/th&gt;&lt;th&gt;0.4&lt;/th&gt;&lt;th&gt;0.2&lt;/th&gt;&lt;th&gt;0.01&lt;/th&gt;&lt;/tr&gt;
 &lt;tr&gt;&lt;td&gt;128 critical sections&lt;/td&gt;&lt;td&gt;40&lt;/td&gt;&lt;td&gt;82&lt;/td&gt;&lt;td&gt;102&lt;/td&gt;&lt;td&gt;131&lt;/td&gt;&lt;td&gt;244&lt;/td&gt;&lt;td&gt;278&lt;/td&gt;&lt;/tr&gt;
 &lt;tr&gt;&lt;td&gt;non-symmetric&lt;/td&gt;&lt;td&gt;82&lt;/td&gt;&lt;td&gt;170&lt;/td&gt;&lt;td&gt;207&lt;/td&gt;&lt;td&gt;240&lt;/td&gt;&lt;td&gt;272&lt;/td&gt;&lt;td&gt;280&lt;/td&gt;&lt;/tr&gt;
 &lt;tr&gt;&lt;td&gt;per-thread counts&lt;/td&gt;&lt;td&gt;40&lt;/td&gt;&lt;td&gt;83&lt;/td&gt;&lt;td&gt;100.&lt;/td&gt;&lt;td&gt;116&lt;/td&gt;&lt;td&gt;135&lt;/td&gt;&lt;td&gt;137&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;
&lt;/center&gt;

There's the factor of two I was looking for!

&lt;/P&gt;
&lt;h2&gt;And now using pthreads from Python&lt;/h2&gt;
&lt;P&gt;

You might conclude this still shows a win for OpenMP. The problem is
that the above is essentially identical to how I implement this
algorithm using pthreads. I'm rather fond of Python's new &lt;a
href="http://docs.python.org/dev/library/concurrent.futures.html"&gt;concurrent.futures&lt;/a&gt;
module, so I tested out a pthread-only driver to a single-threaded
count function implemented in C.

&lt;pre class="code"&gt;
import ctypes
import time
import itertools
from collections import defaultdict
import threading

import chemfp
from chemfp import futures
import _chemfp


def count_tanimoto_hits(all_counts, arena, threshold, row):
    thread_id = threading.current_thread().ident
    # This implements essentially:
    # for (i=row; i&amp;lt;row+1; i++) {
    #   for (j=row+1; j&amp;lt;N; j++) {
    #   if tanimoto(fingerprints[i], fingerprints[j]) &gt;= threshold {
    #     counts[i]++;
    #     counts[j]++;
    #   }
    # }
    _chemfp.count_tanimoto_hits_arena_symmetric(
        threshold, arena.num_bits,
        arena.start_padding, arena.end_padding, arena.storage_size, arena.arena,
        row, row+1, row+1, len(arena),
        arena.popcount_indices, all_counts[thread_id])

def find_counts(arena, threshold, num_threads):
    # Allocate per-thread storage (based on the thread-id)
    def make_empty_counts():
        return (ctypes.c_int*len(arena))()
    all_counts = defaultdict(make_empty_counts)
    
    # Use a thread-pool with 4 worker threads
    with futures.ThreadPoolExecutor(max_workers=4) as executor:
        for row in xrange(len(arena)):
            executor.submit(count_tanimoto_hits, all_counts, arena, threshold, row)

    # Merge the private counts back into one total list of counts
    return [sum(cols) for cols in itertools.izip(*all_counts.values())]


arena = chemfp.load_fingerprints("zinc_drug_like.fps")

chemfp.set_num_threads(1) # Don't use multiple OpenMP threads

for threshold in (0.8, 0.6, 0.5, 0.4, 0.2, 0.01):
    t1 = time.time()
    x = find_counts(arena, threshold, 4)
    t2 = time.time()
    print threshold, t2-t1, "   ", sum(x)
&lt;/pre&gt;

The "chemfp.set_num_threads(1)" case bypasses the OpenMP-based code
and tells "count_tanimoto_hits_arena_symmetric" to use the simple
upper-right triangle implementation. As a result ...

&lt;center&gt;
&lt;table&gt;
 &lt;tr&gt;&lt;th&gt;&amp;nbsp;&lt;/th&gt;&lt;th colspan="6"&gt;Tanimoto thresholds&lt;/th&gt;&lt;/tr&gt;
 &lt;tr&gt;&lt;th&gt;method&lt;/th&gt;&lt;th&gt;0.8&lt;/th&gt;&lt;th&gt;0.6&lt;/th&gt;&lt;th&gt;0.5&lt;/th&gt;&lt;th&gt;0.4&lt;/th&gt;&lt;th&gt;0.2&lt;/th&gt;&lt;th&gt;0.01&lt;/th&gt;&lt;/tr&gt;
 &lt;tr&gt;&lt;td&gt;128 critical sections&lt;/td&gt;&lt;td&gt;40&lt;/td&gt;&lt;td&gt;82&lt;/td&gt;&lt;td&gt;102&lt;/td&gt;&lt;td&gt;131&lt;/td&gt;&lt;td&gt;244&lt;/td&gt;&lt;td&gt;278&lt;/td&gt;&lt;/tr&gt;
 &lt;tr&gt;&lt;td&gt;non-symmetric&lt;/td&gt;&lt;td&gt;82&lt;/td&gt;&lt;td&gt;170&lt;/td&gt;&lt;td&gt;207&lt;/td&gt;&lt;td&gt;240&lt;/td&gt;&lt;td&gt;272&lt;/td&gt;&lt;td&gt;280&lt;/td&gt;&lt;/tr&gt;
 &lt;tr&gt;&lt;td&gt;per-thread counts&lt;/td&gt;&lt;td&gt;40&lt;/td&gt;&lt;td&gt;83&lt;/td&gt;&lt;td&gt;100.&lt;/td&gt;&lt;td&gt;116&lt;/td&gt;&lt;td&gt;135&lt;/td&gt;&lt;td&gt;137&lt;/td&gt;&lt;/tr&gt;
 &lt;tr&gt;&lt;td&gt;Python/pthreads&lt;/td&gt;&lt;td&gt;48&lt;/td&gt;&lt;td&gt;92&lt;/td&gt;&lt;td&gt;112&lt;/td&gt;&lt;td&gt;128&lt;/td&gt;&lt;td&gt;145&lt;/td&gt;&lt;td&gt;149&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;
&lt;/center&gt;

The pthread timings looks similar to those for OpenMP, except with a
roughly 8 second (and near constant-time) overhead. This is likely the
cost of starting N=110885 different jobs for the worker threads in
Python. I confirmed by using a threshold of 0.95. The per-thread
OpenMP algorithm takes 9.2 seconds while the pthread version takes
16.5 seconds, or about 7 seconds, as predicted.

&lt;/P&gt;&lt;P&gt;

While I did not test it out, I expect that a corresponding C/C++
implementation would have much less performance overhead. I just don't
have the experience of using C pthread API, or an async I/O library
for C/C++ like &lt;a
href="http://en.wikipedia.org/wiki/Grand_Central_Dispatch"&gt;Grand
Central Dispatch&lt;/a&gt;, or C++'s new promises to try to implement that
code directly in C. It really is much easier to use OpenMP than to
figure out those alternate solutions for C.

&lt;/P&gt;
&lt;h2&gt;Possible bad benchmark comparisons&lt;/h2&gt;
&lt;P&gt;

BTW, what I ended up doing in my Python code was to define a "band" of
100 rows, and let each thread process 100 rows at a time. This should
cut the overhead down from 8 seconds to 0.08 seconds, making the
pthread code about comparable to the OpenMP code.  I didn't test it
out though, because my actual code uses a more sophisticated algorithm
which also have the effect of improving cache coherency, and there's
evidence that banding makes the coherency worse and causes slowdowns
while waiting for memory fetches.

&lt;P&gt;&lt;/P&gt;

Unfortunately, it also looks like &lt;a
href="http://www.dalkescientific.com/writings/diary/archive/2012/01/13/openmp_vs_posix_threads.html"&gt;the
analysis I did the other day&lt;/a&gt; had a flaw which causes the pthread
benchmark to have bad memory access behavior. In short, the pthreads
were randomly assigned bands to process, while the OpenMP version also
gets randomly assigned bands, but all of the cores work on tasks in
the same band. Hence, better coherency. (It looks like the pthread
performance for one test case goes from 48 seconds with randomly
assigned bands to 40 seconds with sequentially assigned bands.)

&lt;/P&gt;&lt;P&gt;

I consider this a win for OpenMP. I did random assignments so I could
display a progress bar. Assymmetries in the data mean that the first
few bands and the last few bands are much easier to process than ones
in the middle. With random band assignment, I get a much better
estimate of the time to completion. Using OpenMP gives me that
estimate without a noticable performance hit. With pthreads, it's much
hard to get both performance and a estimate.

&lt;/P&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;P&gt;

Effective parallelization with good scaleup is hard, no matter which
technique you use. There are a lot of subtle issues. You need to
understand how the technique works and measure your results to make
sure you really do understand the problem.

&lt;/P&gt;&lt;P&gt;

My experience is that OpenMP is an effective technique to help you
with your task. In a few cases, a trivially simple addition of a line
of code gives instant speedup. It's more likely though that you've got
some work ahead of you to make this happen.

&lt;/P&gt;&lt;P&gt;

If you're already using pthreads, Grand Central Dispatch, or some
other multithreading or asyncronous I/O system, then I don't think
that OpenMP adds much new. Instead, I think its biggest advantage is
that you can make changes to existing code without introducing a new
library API, without having to set up your own event loop/reactor, and
with compiler-based thread control and syncronization primitives which
make a large number of potential bugs of hand-built systems disappear.

&lt;/P&gt;

</description><guid isPermaLink="true">http://www.dalkescientific.com/writings/diary/archive/2012/01/17/I_parallelize_an_algorithm.html</guid><pubDate>Tue, 17 Jan 2012 12:00:00 GMT</pubDate></item><item><title>My views on OpenMP</title><link>http://www.dalkescientific.com/writings/diary/archive/2012/01/16/my_views_on_openmp.html</link><description>&lt;P&gt;

In private email a correspondent observed that OpenMP makes threading
very easy, but "it really seems under utilized in the community."
(Here, 'community' is 'scientific programming.') I was surprised to
find out that I had strong views on the topic.

&lt;/P&gt;&lt;P&gt;

OpenMP sits between several other pieces of technology, being:

&lt;ul&gt;
 &lt;li&gt;GPU computing&lt;/li&gt;
 &lt;li&gt;cloud computing&lt;/li&gt;
 &lt;li&gt;POSIX and other common threading libraries&lt;/li&gt;
&lt;/ul&gt;

&lt;/P&gt;&lt;P&gt;

The new hotness is GPUs. Wes Faler gave a presentation at the recent
28th Chaos Communication Congress on &lt;a
href="http://events.ccc.de/congress/2011/Fahrplan/events/4818.en.html"&gt;Evolving
Custom Communication Protocols&lt;/a&gt;. He mentioned they ported C++ code
over to the GPU. The unoptimized version was 7 times slower on the GPU
than the CPU. However, they do many evaluations using the same
function, and because there are so many compute threads in the GPU,
the overall time was a factor of 7 faster. Similarly, &lt;a
href="http://dx.doi.org/10.1021/ci200235e"&gt;Haque et al.&lt;/a&gt;
showed that a 4 core desktop machine, properly tuned, was "only" about
5x slower than a GPU card.

&lt;/P&gt;&lt;P&gt;

It looks like GPU computing is currently the approach to take if you
do a lot of evaluation of similar tasks, assuming you have the GPUs
and programming time available. That performance (and the novel way of
computing) interests people who might otherwise use OpenMP.

&lt;/P&gt;&lt;P&gt;

Cloud computing is another hotness. Alex Martelli was recently
interviewed by Larry Hastings in &lt;a
href="http://radiofreepython.com/episodes/2/"&gt;Radio Free Python
episode #2&lt;/a&gt;. At 33:47 Larry asked about Python's &lt;a
href="http://en.wikipedia.org/wiki/Global_Interpreter_Lock"&gt;global
interpreter lock&lt;/a&gt; and Alex's reply was:

&lt;blockquote&gt;

I hate threading anyway. Multiprocessing is the way to go, and
message-passing, not shared memory. That just doesn't scale. I use
multithreading so I can use all of my 16 cores, or whatever is the
average number of cores in a machine these days. Big furry deal. I've
got a few thousand servers waiting for me in the data center and how
do I use those with threading?

&lt;/blockquote&gt;

The topic comes up several times in the ensuing discussion.

&lt;/P&gt;&lt;P&gt;

What good indeed is OpenMP, which might be used for a 16 node machine,
if you're working on problems which involve 10,000 distributed
servers?

&lt;/P&gt;&lt;P&gt;

Even single nodes have multiple cores these days, and a good OpenMP
implemenation might help make good use of the nodes in that
cloud. However, you have to compare OpenMP to traditional POSIX
multithreading. OpenMP works for C/C++ and Fortran, but not for Python
nor (it seems) Java, nor other languages which support pthreads.
You're out of luck if you want to use OpenMP with one of those other
languages.

&lt;/P&gt;&lt;P&gt;

Some things scale up wonderfully well by adding one or two OpenMP
directives, but parallelism is rarely as trivial as giving a few hints
to the compiler. I think that the non-trivial cases of parallelizing
with OpenMP are about as much work as using pthreads, or a system like
&lt;a href="http://en.wikipedia.org/wiki/Grand_Central_Dispatch"&gt;Grand
Central Dispatch&lt;/a&gt;. I'll work through an example of doing that in my
next essay.

&lt;/P&gt;&lt;P&gt;

I do believe that OpenMP scales better than these alternatives for
some cases, in part because the compiler is doing the work rather than
using a library API. My tests so far show that pthreads and OpenMP
have about the same scaling with two processors, and I need four or
more cores to show a strong OpenMP advantage.

&lt;/P&gt;&lt;P&gt;

Most desktop/laptop computers just don't yet have 8+ cores. (Alex
Martelli said otherwise, but perhaps he's talking about Google's data
centers.)  Most people develop for their own computers, which lessens
the incentive to work on good multicore scaling.

&lt;/P&gt;&lt;P&gt;

I have a four-core machine, and I'm willing to write a Python
extension in C which uses OpenMP. Even then I've run into some
difficulties. It took a while but I figured out how to configure
Python's setup.py so it includes the right "use OpenMP" flag for each
compiler. It includes a hard-coded list of compilers which do and do
not support OpenMP. Also, did you know that on a Mac you must run
OpenMP tasks in the main thread, and not in a pthread? Otherwise your
program crashes; even when you have a single OpenMP thread! I had to
figure out a workaround so I could use my library unchanged inside
Django.

&lt;/P&gt;&lt;P&gt;

People are interested in OpenMP development, but some who might use
OpenMP are drawn to other technologies. Some tasks are very
appropriate for OpenMP, but they are almost as appropriate for other,
more common technologies. OpenMP scales well, but most people don't
have the hardware where OpenMP shines. Even when they do, they have to
work in one of a handful of languages, and in somewhat restricted
circumstances.

&lt;/P&gt;&lt;P&gt;

All these contribute to diminishing OpenMP utilization in the community.

&lt;/P&gt;

</description><guid isPermaLink="true">http://www.dalkescientific.com/writings/diary/archive/2012/01/16/my_views_on_openmp.html</guid><pubDate>Mon, 16 Jan 2012 12:00:00 GMT</pubDate></item><item><title>OpenMP vs. POSIX threads</title><link>http://www.dalkescientific.com/writings/diary/archive/2012/01/13/openmp_vs_posix_threads.html</link><description>&lt;P&gt;

A few years ago I heard about &lt;a
href="http://en.wikipedia.org/wiki/OpenMP"&gt;OpenMP&lt;/a&gt;. It's a form of
multi-threaded programming meant to make good use of multiprocessor
and multicore hardware.

&lt;/P&gt;&lt;P&gt;

Earlier this year, I read &lt;a
href="http://dx.doi.org/10.1021/ci200235e"&gt;Anatomy of High-Performance 2D
Similarity Calculations&lt;/a&gt;, which used OpenMP as part of their
Tanimoto search algorithm. This summer, Kim Walisch contributed OpenMP variations to my
&lt;a href="http://www.dalkescientific.com/writings/diary/archive/2011/11/02/faster_popcount_update.html"&gt;popcount
benchmark&lt;/a&gt;.

&lt;/P&gt;&lt;P&gt;

The changes to the code were almost trivial, so I asked Kim if he
would help me add OpenMP support to &lt;a
href="http://code.google.com/p/chem-fingerprints/"&gt;chemfp&lt;/a&gt;. He did,
and it will be part of the upcoming 1.1 release.

&lt;/P&gt;&lt;P&gt;

OpenMP is only one of many ways to make effective use of multiple
processors. Another common way is through POSIX threads, or its
equivalent on Windows. A third is to spawn off a new process and use
IPC to communicate with it.

&lt;/P&gt;&lt;P&gt;

How do these techniques compare to each other?

&lt;/P&gt;&lt;P&gt;

I decided to use Python 3.2's new &lt;a
href="http://docs.python.org/dev/library/concurrent.futures.html"&gt;concurrent.futures&lt;/a&gt;
module to handle the multithreaded and multiprocess cases, or rather,
the &lt;a href="http://pypi.python.org/pypi/futures"&gt;backport to
2.x&lt;/a&gt;. This merges the underlying "threading" and "multiprocessing"
APIs into a common form, based on the "&lt;a
href="http://en.wikipedia.org/wiki/Futures_and_promises"&gt;future&lt;/a&gt;"
concept for asynchronous programming. I quickly found that the
multiprocess API had too much overhead so I won't talk about it.

&lt;/P&gt;&lt;P&gt;

My test case computes and stores the NxN Tanimoto similarity matrix
between a set of fingerprints. I took N=110885 compounds from the ZINC
data set and generated 2048-bit fingerprints using RDKit's hash
fingerprint. The chemfp fingerprint type is "RDKit-Fingerprint/1
minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=4 useHs=1".

&lt;/P&gt;&lt;P&gt;

I don't actually save the 12,295,483,225 values. For one, simple
symmetry reduces that to 6,147,686,170 values. For another, in the
problems I'm interested, I can ignore similarities below a
threshold. Instead, chemfp internally uses a sparse matrix format.

&lt;/P&gt;&lt;P&gt;

For this test I used a simple parallelization. I break up the rows of
the matrix into bands, and fill in the parts of the upper-right
triangle which are at or above the threshold. In the OpenMP version,
all the OpenMP threads work on a single band. In the
concurrent.futures version, each thread processes its own band. These
differences fall naturally out of the how those two APIs work.

&lt;/P&gt;&lt;P&gt;

Once I got both implementations working, debugged, and optimized, I
could finally do some performance numbers. I wanted to see how OpenMP
and pthreads scale over a range of processors and range of threshold
values. My desktop has two dual-core CPUs, so I decided to rent time
on a "High-CPU Extra Large Instance" Amazon EC2 node. It has 8 nodes
of the form:

&lt;pre class="code"&gt;
vendor_id	: GenuineIntel
cpu family	: 6
model		: 23
model name	: Intel(R) Xeon(R) CPU           E5410  @ 2.33GHz
stepping	: 10
cpu MHz		: 2333.336
cache size	: 6144 KB
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu de tsc msr pae cx8 sep cmov pat clflush mmx fxsr sse sse2 ss ht syscall
 nx lm constant_tsc rep_good aperfmperf pni ssse3 cx16 sse4_1 hypervisor lahf_lm
bogomips	: 4666.67
clflush size	: 64
cache_alignment	: 64
address sizes	: 38 bits physical, 48 bits virtual
power management:
&lt;/pre&gt;

Sadly, this is &lt;i&gt;not&lt;/i&gt; a machine which supports the POPCNT
instruction, so the chemfp popcount code fell back to Imran Haque's SSSE3-based implementation.

&lt;/P&gt;&lt;P&gt;

The machine has about 6GB of free memory. I knew from testing on my
desktop that I only needed 4.1 GB for the worst case, so I didn't run
into memory problems.

&lt;/P&gt;&lt;P&gt;

I ran my benchmark code through set of combinations of the OpenMP
vs. pthreads, with thread counts from 1 to 8, and with threshold
values of 1.0, 0.99, 0.97, 0.95, 0.93, 0.9, 0.88, 0.85, 0.8, 0.75,
0.7, 0.65, 0.6, 0.55, and 0.5. I also ran the values several times in
order to get some idea of the timing variations. Yes, that took about
20 hours to run.

&lt;/P&gt;&lt;P&gt;

You can get the &lt;a
href="http://dalkescientific.com/writings/diary/timings.amazon_no_popcnt.csv"&gt;raw
data&lt;/a&gt; if you're really interested. I used matplotlib to make a
couple of 3D plots. First, here's the overall times:

&lt;br /&gt;
&lt;img src="http://dalkescientific.com/writings/diary/ssse3_times.png"&gt;
&lt;br /&gt;

You can see that the OpenMP code (in red) is usually faster than the
pthread code (in blue). The exception is for thresholds of 0.55 and
lower. BTW, a threshold of 0.5 finds 285,371,794 matches in the NxN
matrix, which means this stores a few gigabytes of data.

&lt;/P&gt;&lt;P&gt;

To make more sense of this data, here's a plot of the speedup, defined
as T0/T where T0 is the fastest single-threaded time for a given
threshold and T is the fastest time for a given number of
threads. Perfect speedup would give a value of 8 for 8 processors.


&lt;br /&gt;
&lt;img src="http://dalkescientific.com/writings/diary/ssse3_scalings.png"&gt;
&lt;br /&gt;

The region near threshold=1.0 is so jagged because the search time is
less than the variability in the system, and is close to the minimum
time resolution size of 1 second.

&lt;/P&gt;&lt;P&gt;

Most people in this field use thresholds in the range 0.7-1.0. It's
obvious from this graph that OpenMP is the right solution. It's almost
always faster, and overall it makes much better use of a multi-core
machine.

&lt;/P&gt;

</description><guid isPermaLink="true">http://www.dalkescientific.com/writings/diary/archive/2012/01/13/openmp_vs_posix_threads.html</guid><pubDate>Fri, 13 Jan 2012 12:00:00 GMT</pubDate></item><item><title>ECFP-like fragments in PubChem</title><link>http://www.dalkescientific.com/writings/diary/archive/2012/01/03/unique_ecfp_like_fragments.html</link><description>&lt;P&gt;

Previously I posted about &lt;a
href="http://www.dalkescientific.com/writings/diary/archive/2011/12/25/unique_fragments_in_pubchem.html"&gt;unique
fragments in PubChem&lt;/a&gt;. That used my &lt;a
href="http://www.dalkescientific.com/writings/diary/archive/2011/01/13/faster_subgraph_enumeration.html"&gt;molecular
subgraph enumeration&lt;/a&gt; algorithm. In this essay I'll report some
results from looking at the unique bit counts from RDKit's
MorganFingerprint algorith, which is an ECFP variant.

&lt;/P&gt;&lt;P&gt;

My first graph in the previous essay shows that there are about 2
million unique fragments of size up to 7 in PubChem, and that the
second 1/2 of the data files contained few fragments which weren't in
the first 1/2. This suggests that there aren't that many substructures
of size 7, compared to the number of possible structures of size 7,
which is quite curious.

&lt;/P&gt;&lt;P&gt;

Rather, I expect that the number of unique fragments should level off
with enough molecules. In the simplest case, there are 112 elements
and 5 elements which can be aromatic, for a total of 117 possible
unique atom types. I found 110 of them in my PubChem subset.

&lt;/P&gt;&lt;P&gt;

Similarly, I found only 1103 unique fragments with two atoms. The
breakdown as a function of fragment length is:

&lt;ul&gt;
 &lt;li&gt;Length 1: &lt;tt&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;110&lt;/tt&gt; unique fragments
 &lt;li&gt;Length 2: &lt;tt&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;1103&lt;/tt&gt; unique fragments
 &lt;li&gt;Length 3: &lt;tt&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;4209&lt;/tt&gt; unique fragments
 &lt;li&gt;Length 4: &lt;tt&gt;&amp;nbsp;&amp;nbsp;19398&lt;/tt&gt; unique fragments
 &lt;li&gt;Length 5: &lt;tt&gt;&amp;nbsp;&amp;nbsp;86336&lt;/tt&gt; unique fragments
 &lt;li&gt;Length 6: &lt;tt&gt;&amp;nbsp;364342&lt;/tt&gt; unique fragments
 &lt;li&gt;Length 7: &lt;tt&gt;1447488&lt;/tt&gt; unique fragments
&lt;/ul&gt;

(However, there are 2199 fragments which my SMILES atom count code
didn't parse. I'll need to figure out what caused it to fail.)

&lt;/P&gt;&lt;P&gt;

How many possible substructures are there of size 7? Assuming only 10
atom types and ignoring cycles and different bond types gives about 10
million. There are &lt;a
href="http://oeis.org/A001349"&gt;1+1+2+6+21+112+853=996&lt;/a&gt; connected
subgraphs of up to size 7, and I'll guess that 800 of them are
chemically accessible. I'll hazard 2**6 different bond types, so about
500 billion possible substructures. That's over three orders of
magnitude larger than what I see.

&lt;/P&gt;&lt;P&gt;

It's curious because it means that any substructure-based fingerprint
using up to 7 atoms has at most a small multiple of ~2,000,000
different values. Sure, a hash algorithm might potentially generate
values in the range 0-2&lt;sup&gt;32&lt;/sup&gt;, but only a few million of them
will be actually be generated.

&lt;/P&gt;&lt;P&gt;

(Some algorithms will generate multiple bits per feature, eg, features
with 1 or 2 atoms generates a single bit while features with 3 or more
atoms generates two bits. This acts as a simple weighting
scheme. There's a perfect correlation between those two bits, so I'm
not counting the second one as meaningful.)

&lt;/P&gt;&lt;P&gt;

Fingerprint statistical models often assume a roughly Bernoulli
process, and the number of unique features is unbounded, though
increasingly rare. This observation suggests that there is actually an
upper limit to the size, which changes the distribution type slightly.

&lt;/P&gt;&lt;P&gt;

Is this observable in other fingerprints? I used RDKit's
MorganFingerprint algorithm, which is a variation of the ECFP
"extended connectivity fingerprint." I used radii values of 1, 2, 3,
4, and 5, and with the other parameters at their default. Each step
includes increasingly distant information, so should be more diverse.


&lt;/P&gt;&lt;P&gt;

The following shows the number of unique bits found as a function of
the number of molecules processed. The molecules are ranked by PubChem
id, although it doesn't include all of the PubChem structures since
RDKit couldn't process all of the structures. (It complains about a
number of bad valences.) A more complete analysis would randomize the
structures to remove local coherence effects.

&lt;/P&gt;&lt;P&gt;

&lt;br /&gt;
&lt;img src="http://dalkescientific.com/writings/diary/unique_morgan_counts.png"&gt;
&lt;br /&gt;

That graph is almost impossible to understand because the dynamic
range is so large. Log and log-log scales don't help either. The
best solution is to normalize by the maximum number for each
graph. That gives:

&lt;br /&gt;
&lt;img src="http://dalkescientific.com/writings/diary/normalized_unique_morgan_counts.png"&gt;
&lt;br /&gt;

&lt;/P&gt;&lt;P&gt;

This sort of curve is a &lt;a
href="http://en.wikipedia.org/wiki/Species_discovery_curve"&gt;species
discovery curve&lt;/a&gt;. It appears to show that the
MorganFingerprint(radius=1) saturates at around 100,000 different bit
values, and radius=2 might saturate at around 3.5 million unique
values. This makes sense, as the radius=2 fingerprint corresponds to
about 6-7 atoms, and I found 2 million unique values. (An average
branching factor of 2.5 gives 2.5**2 or 6.25 atoms. However, a local
branching factor of 3 gives 9 atoms, which adds more unique values to
the fingerprint.)

&lt;/P&gt;&lt;P&gt;

I'll guess that there's under 40 million unique bits for radius=3 but
it becomes harder to estimate. As the radius increases, the trend in
the diversity of new values clearly gets closer to linear, which means
there's less and less saturation. I can't predict the total number of
unique values for radius=5 because it's still too flat.

&lt;/P&gt;&lt;P&gt;

The species growth curve is often fit as A(1-exp(-bx)) or
A(log(b*x-1)). The first has a fixed limit, the latter implies there
is no upper bound. This case is somewhere in the middle: for good
chemical reasons, there's a large but finite number of possible ways
to arrange a fixed number of atoms. For the smallest fragment size (1
atom), we are at that limit. For larger sizes, we are nowhere near the
chemical limit, and I think a log fit works best.

&lt;/P&gt;&lt;P&gt;

Equally obvious, I would need to randomize the input order in order to
get a smoother curve so I could make that prediction. But the end of
the holidays was a couple of days ago and I need to get back to paying
work.

&lt;/P&gt;

</description><guid isPermaLink="true">http://www.dalkescientific.com/writings/diary/archive/2012/01/03/unique_ecfp_like_fragments.html</guid><pubDate>Tue, 03 Jan 2012 12:00:00 GMT</pubDate></item><item><title>Unique fragments in PubChem</title><link>http://www.dalkescientific.com/writings/diary/archive/2011/12/25/unique_fragments_in_pubchem.html</link><description>&lt;P&gt;

For reasons I'll get into later, I wanted to get an idea of the
subgraph distribution of PubChem. That is, given my method for &lt;a
href="http://www.dalkescientific.com/writings/diary/archive/2011/01/13/faster_subgraph_enumeration.html"&gt;molecular
subgraph enumeration&lt;/a&gt;, create all subgraphs of up to size 7 atoms
and get an idea of how common they are. More specifically, atom
uniqueness depends only on the atomic element and aromaticity, as
assigned by OEChem, and the unique bond categories are
"single-or-aromatic", double, and triple.

&lt;/P&gt;&lt;P&gt;

Last month I downloaded 2,138 sdf.gz files from PubChem and did
structure perception with OpenEye's OEChem. Starting a couple of weeks
ago, I use my subgraph enumeration algorithm to process 1,724 of
them. For some reason, it stopped at that point. Since it took 7.5
days to process those files, and the data set is already a bit
ungainly, I decided to leave the full analysis for another time and to
not figure out what happened with the processing.


&lt;/P&gt;&lt;P&gt;

In the 1,724 files are 21,570,907 PubChem records and my enumeration
found 1,925,185 unique substructures.

&lt;/P&gt;&lt;P&gt;

I kept track of the number of unique fragments per input file and the
running total number of unique fragments over all of the files,
plotted here:


&lt;br /&gt;&lt;img src="http://dalkescientific.com/writings/diary/pubchem_unique_per_filename.png"&gt;&lt;br /&gt;


You can see that 50% of the unique fragments are in the first 25% of
the data files and essentially all are found in the first 50% of the
files. (The number does increase after the 1000th file, but it's very
slow.) It's also interesting to see the internal structural diversity
in the different files. I suspect there are some large regions made
from contributed combinitorial libraries.

&lt;/P&gt;&lt;P&gt;

The unique fragments which exist in the most number of records are:

&lt;pre class="code"&gt;
21387437 C
20195255 O
19959057 c
19892743 cc
19755355 ccc
19457485 cccc
19270867 CC
19015890 ccccc
18599872 cccccc
18488545 c1ccccc1
18386628 N
17672171 Cc
17324074 Ccc
17109361 CN
16985355 Cccc
16533358 C=O
16522121 Ccccc
15993406 Cc(c)c
15759069 Cc(c)cc
15508521 Cccccc
&lt;/pre&gt;

You shouldn't be surprised to see that carbon is found in 21,387,437
of the 21,570,907 structures.

&lt;/P&gt;&lt;P&gt;

I made a distribution plot of the fragments, where the horizontal axis
is rank order (C then O, cc, and so on). I show it at a few different
scales in order to get a better understanding of the
distribution. It's quite obviously *not* a Zipf distribution.

&lt;br /&gt;&lt;img src="http://dalkescientific.com/writings/diary/pubchem_fragment_distribution.png"&gt;&lt;br /&gt;


&lt;/P&gt;&lt;P&gt;

The vertical axis is the count in millions. You can see that the
10,000th most common substructure is in a very small percentage of the
structure; it's actually 0.5%.

&lt;/P&gt;&lt;P&gt;

At the other end of the list, 478,278 fragments (24.8%) exist only
once (like C#NF), 251,372 fragments (13.1%) exist twice (like B#[Cr]),
and 132,574 fragments (6.89%) exist thrice. Here's the first 20 values
as a table,

&lt;pre class="code"&gt;
1 478278  # In other words, 478,278 substructures exist only once in the data set
2 251372
3 132574
4 100665
5 67536
6 57500
7 42959
8 37983
9 31750
10 28684
11 24016
12 23169
13 18695
14 17659
15 15501
16 14717
17 13452
18 12500
19 11394
20 11276
&lt;/pre&gt;

and in graphical form.

&lt;br /&gt;&lt;img src="http://dalkescientific.com/writings/diary/substructure_uniqueness.png"&gt;&lt;br /&gt;

&lt;/P&gt;
</description><guid isPermaLink="true">http://www.dalkescientific.com/writings/diary/archive/2011/12/25/unique_fragments_in_pubchem.html</guid><pubDate>Sun, 25 Dec 2011 12:00:00 GMT</pubDate></item><item><title>Inverted index using Python sets</title><link>http://www.dalkescientific.com/writings/diary/archive/2011/12/23/inverted_index.html</link><description>&lt;P&gt;

I am working on a problem which is very similar to an inverted
index. What's an
&lt;a href="http://en.wikipedia.org/wiki/Inverted_index"&gt;inverted index&lt;/a&gt;?
Suppose I want to find a book on "sharks." I could search every book
in the library until I find one on sharks, but that's tedious, and
every time I do a new search I would need to re-search all of the
books again. (Hopefully I would remember some of what I read the first
time!)

&lt;/P&gt;&lt;P&gt;

Instead, I can invert the task. For every word in every book, make a
list of books which contain the word. That also takes a lot of work
but I only need to do it once. To search for a book about sharks, I
look up the word "shark" and get the small list of candidates books,
which I need to search manually. I still need to search them because
while &lt;i&gt;&lt;a href="http://www.gutenberg.org/ebooks/2488"&gt;20,000 Leagues
Under the Sea&lt;/a&gt;&lt;/i&gt; mentions sharks, it isn't about sharks, and
neither is a book which mentions a "loan shark" or the "&lt;a
href="http://en.wikipedia.org/wiki/Land_Shark_(Saturday_Night_Live)"&gt;land
shark&lt;/a&gt;" skit.

&lt;/P&gt;&lt;P&gt;

This list of word to books which contain the word is called an
"inverted index."

&lt;/P&gt;
&lt;h2&gt;Compound term searches&lt;/h2&gt;
&lt;P&gt;

I can use it as the basis for more complex queries. I want to find
books which mention "whale shark". If the inverted index contains only
single words then I get the list of books containing the word "shark"
and manually search those for "whale shark", but it would be better if
I combined the list of books containing "whale" and the list of books
containing "shark" to make a new list of those books containing both
"whale" and "shark."

&lt;/P&gt;&lt;P&gt;

In other words, I find the intersection of those two lists.

&lt;/P&gt;
&lt;h2&gt;An inverted index for letters in words&lt;/h2&gt;
&lt;P&gt;

It's very easy to create an inverted index using Python's
"&lt;a href="http://docs.python.org/library/stdtypes.html#set"&gt;set type&lt;/a&gt;."
Instead of the usual case of searching a book (or document) for words,
I'll show an example of how to search words for letters. On my
computer, the file
"&lt;a href="http://en.wikipedia.org/wiki/Words_(Unix)"&gt;/usr/share/dict/words&lt;/a&gt;"
contains 234936 different English words, with one word per line, of
which 233614 are unique after I convert everything to lowercase.

&lt;/P&gt;&lt;P&gt;

I'll turn that into an inverted index where each letter is mapped to
the set of words which contain that letter:

&lt;pre class="code"&gt;
import collections
inverted_index = collections.defaultdict(set)
for line in open("/usr/share/dict/words"):
    word = line.strip().lower()  # ignore case
    for letter in word:
        inverted_index[letter].add(word)
&lt;/pre&gt;

I'll check the number of inverted indices (there should be one for
each letter), and I'll show the sizes of a couple of them.

&lt;pre class="code"&gt;
&amp;gt;&amp;gt;&amp;gt; len(inverted_index)
26
&amp;gt;&amp;gt;&amp;gt; len(inverted_index["a"])
144086
&amp;gt;&amp;gt;&amp;gt; len(inverted_index["j"])
2993
&lt;/pre&gt;

This means there are 144086 unique lower-cased words with an "a" or
"A" in them, but only 2993 with a "j" or "J". (From here on I'll only
mention the lower-case letter even when I mean either lower or upper
case.)  How many words have both an "a" and a "j" in them?

&lt;pre class="code"&gt;
&amp;gt;&amp;gt;&amp;gt; len(inverted_index["a"] &amp; inverted_index["j"])
1724
&lt;/pre&gt;

&lt;h3&gt;sorted() and heapq.nsmallest()&lt;/h3&gt;
What are the first 5 of them, alphabetically?

&lt;pre class="code"&gt;
&amp;gt;&amp;gt;&amp;gt; sorted(inverted_index["a"] &amp; inverted_index["j"])[:5]
['abject', 'abjectedness', 'abjection', 'abjective', 'abjectly']
&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

I used
&lt;a href="http://docs.python.org/library/functions.html#sorted"&gt;sorted()&lt;/a&gt;

because it's a builtin function. However, while it works, it's a bit
wasteful to sort the entire list of 1724 items when I only want the
first 5. If you need something faster, try
&lt;a href="http://docs.python.org/library/heapq.html#heapq.nsmallest"&gt;heapq.nsmallest&lt;/a&gt;
(and
&lt;a href="http://docs.python.org/library/heapq.html#heapq.nlargest"&gt;heapq.nlargest&lt;/a&gt;.).
It can be faster because it only worries about sorting the needed subset

&lt;pre class="code"&gt;
&amp;gt;&amp;gt;&amp;gt; import heapq
&amp;gt;&amp;gt;&amp;gt; heapq.nsmallest(5, inverted_index["a"] &amp; inverted_index["j"])
['abject', 'abjectedness', 'abjection', 'abjective', 'abjectly']
&lt;/pre&gt;

A quick timing test shows that heapq.nsmallest is about 40% faster than sorted()[:5].

&lt;/P&gt;
&lt;h2&gt;Inverted index as a search filter&lt;/h2&gt;
&lt;P&gt;

What about a harder search? Which words contain all 6 vowels
(including y) in alphabetical order? The simplest solution is a linear
search using a regular expression:

&lt;pre class="code"&gt;
&amp;gt;&amp;gt;&amp;gt; import re
&amp;gt;&amp;gt;&amp;gt; sequential_vowels = re.compile("a.*e.*i.*o.*u.*y")
&amp;gt;&amp;gt;&amp;gt; words = [line.strip() for line in open("/usr/share/dict/words")
...                             if sequential_vowels.search(line)]
&amp;gt;&amp;gt;&amp;gt; len(words)
8
&amp;gt;&amp;gt;&amp;gt; words
['abstemiously', 'adventitiously', 'auteciously', 'autoeciously', 'facetiously',
'pancreaticoduodenostomy', 'paroeciously', 'sacrilegiously']
&lt;/pre&gt;

This works, but if this type of search will occur often, and if
there's enough memory, and if there's a performance need, then it's
easy to speed up using an inverted index.

&lt;/P&gt;
&lt;h3&gt;Filter using the inverted index&lt;/h3&gt;
&lt;P&gt;

I'll split this search into two stages. The first will filter out the
obvious mismatches, and leave a smaller set of candidates for the
second stage.

&lt;/P&gt;&lt;P&gt;

For the first stage, I'll use the inverted index to find the words
which contains all of the vowels. The inverted index doesn't know the
character order, so once I find the candidates with all of the letters
then I'll use the regular expression test from before.

&lt;/P&gt;&lt;P&gt;

To find the set of words with all of the vowels, I could continue the
sequence of "&amp;" binary operators as I did earlier, but that gets to be
clumsy when there are six terms. Instead, I'll call
&lt;a href="http://docs.python.org/library/stdtypes.html#set.intersection"&gt;set.intersection()&lt;/a&gt;
with the intersection sets as parameters:

&lt;pre class="code"&gt;
&amp;gt;&amp;gt;&amp;gt; vowels = [inverted_index[c] for c in "aeiouy"]
&amp;gt;&amp;gt;&amp;gt; len(set.intersection(*vowels))
670
&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

The variable "vowels" contains a list of sets, "*vowels" in a function
call turns that list parameter into individual parameters, and
"set.intersection" creates a new set which is the intersection of all
of the sets in vowels.

&lt;/P&gt;&lt;P&gt;

("set.intersection" used here is actually an unbound method, and it's
pretty rare to find Python code where using an unbound method makes
sense. The above code is almost identical to
"vowels[0].intersection(*vowels[1:])".)

&lt;/P&gt;&lt;P&gt;

I used two lines above, more for clarity reasons. For myself I would
probably put it into a single line:

&lt;pre class="code"&gt;
&amp;gt;&amp;gt;&amp;gt; len(set.intersection(*(inverted_index[c] for c in "aeiouy")))
670
&lt;/pre&gt;

Yes, the inverted index can be done in a single line!

&lt;/P&gt;
&lt;h3&gt;Testing the candidates&lt;/h3&gt;
&lt;P&gt;

The first stage reduced the search space from 233614 words to 670. In
the second stage I'll use the regular expression to check which ones
contain the vowels in order.

&lt;pre class="code"&gt;
&amp;gt;&amp;gt;&amp;gt; candidates = set.intersection(*(inverted_index[c] for c in "aeiouy"))
&amp;gt;&amp;gt;&amp;gt; [word for word in candidates if sequential_vowels.search(word)]
['autoeciously', 'adventitiously', 'facetiously', 'abstemiously', 'sacrilegiously',
'auteciously', 'pancreaticoduodenostomy', 'paroeciously']
&lt;/pre&gt;

You can verify that it finds the same matches as before, although
because sets are unordered, the result order has changed.

&lt;/P&gt;&lt;P&gt;

I've shown that the inverted index can be used to make the code more
complicated. Is is worthwhile? That is, how much faster is the new
code?

&lt;/P&gt;&lt;P&gt;

My timings show that the old version (which does 233614 regular
expression searches) takes 0.092 seconds to run, while the new one
takes 0.056 seconds. It's about 40% faster.

&lt;/P&gt;
&lt;h2&gt;Use integers as set elements, not strings&lt;/h2&gt;
&lt;P&gt;

It's easy to go faster still. The core of Python's set itersection
works something like this:

&lt;pre class="code"&gt;
new_set = set()
for element in set1:
    if element in set2:
        new_set.add(element)
&lt;/pre&gt;

This requires a hash comparison for every element in set1. If that
passes then there's an equality test in set2, and if that passes then
there's another hash and possible equality test to insert into
new_set.

&lt;/P&gt;&lt;P&gt;

The set element are currently strings. String hash and comparisons are
very optimized in Python, but integers are even faster. What if I used
an index into a word list rather than the word itself? The
corresponding code is:

&lt;pre class="code"&gt;
import collections

# Get all of the words into a list of words.
# Ignore words which are the same except for capitalization.
unique_words = set(line.strip().lower() for line in open("/usr/share/dict/words"))
words = sorted(unique_words)

# Map from character to the set of word indicies
inverted_index = collections.defaultdict(set)
for i, word in enumerate(words):
    for c in word.lower():
      inverted_index[c].add(i)
&lt;/pre&gt;

I can do the inverted index operations just like before, but the
result is a list of indices into the "words" list. For example:

&lt;pre class="code"&gt;
&amp;gt;&amp;gt;&amp;gt; set.intersection(*(inverted_index[c] for c in "jkx"))
set([99716, 98492])
&amp;gt;&amp;gt;&amp;gt; for i in set.intersection(*(inverted_index[c] for c in "jkx")):
...   print words[i]
... 
jukebox
jackbox
&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

I'll modify the "sequential_vowels" code to use the index-based
inverted index instead of the string-based version:

&lt;pre class="code"&gt;
&amp;gt;&amp;gt;&amp;gt; candidates = set.intersection(*(inverted_index[c] for c in "aeiouy"))
&amp;gt;&amp;gt;&amp;gt; [words[i] for i in candidates if sequential_vowels.search(words[i])]
['pancreaticoduodenostomy', 'adventitiously', 'abstemiously', 'auteciously', 'sacrilegiously', 'autoeciously', 'facetiously', 'paroeciously']
&lt;/pre&gt;

My timings numbers give 0.029 seconds per search, with nearly all of
the time spent in the set intersection. Remember that the brute-force
linear search takes 0.094 seconds and the original inverted index
takes 0.056 seconds, so switching to integer-based indices brings
another factor of two performance gain. The overall search with an
inverted index is 3x faster than the original regex-based linear
search.

&lt;/P&gt;
&lt;h3&gt;Order-dependent performance&lt;/h3&gt;
&lt;P&gt;

Python's set.intersection() actually works more like this:

&lt;pre class="code"&gt;
def intersection(*args):
  left = args[0]
  # Perform len(args)-1 pairwise-intersections
  for right in args[1:]:

    # Tests take O(N) time, so minimize N by choosing the smaller set
    if len(left) &gt; len(right):
      left, right = right, left

    # Do the pairwise intersection
    result = set()
    for element in left:
      if element in right:
        result.add(element)
    
    left = result  # Use as the start for the next intersection

  return left
&lt;/pre&gt;

That is, it does a set of pair-wise reductions of list of sets. The
order of the set operations affects the performance! Recall that there
are only 1724 words with "j" in them. If I search for words with "m",
"a", and "j" in them, as 

&lt;pre class="code"&gt;
&amp;gt;&amp;gt;&amp;gt; len(set.intersection(*(inverted_index[c] for c in "maj")))
286
&lt;/pre&gt;

then Python computes the intersection of inverted_index["m"] and
inverted_index["a"], giving an intermediate set with 41148 hits, which
it then intersects with the 1724 "j" elements.

&lt;/P&gt;&lt;P&gt;

However, if the search order were "jma" then the intermediate set for
the intersection of "j" and "m" give only 450 elements, which means
only 450 tests against inverted_index["a"].

&lt;/P&gt;&lt;P&gt;

Both give the same answer, but one requires a lot more work than the
other.

&lt;/P&gt;&lt;P&gt;

Here's evidence of just how big that impact is:

&lt;pre class="code"&gt;
&amp;gt;&amp;gt;&amp;gt; def go(letters):
...   t1 = time.time()
...   for i in range(1000):
...     x = len(set.intersection(*(inverted_index[c] for c in letters)))
...   return x, (time.time()-t1)
... 
&amp;gt;&amp;gt;&amp;gt; go("jma")
(286, 0.12344503402709961)
&amp;gt;&amp;gt;&amp;gt; go("jam")
(286, 0.2954399585723877)
&amp;gt;&amp;gt;&amp;gt; go("amj")
(286, 6.223098039627075)
&amp;gt;&amp;gt;&amp;gt; 
&lt;/pre&gt;

Yes, it's a factor of 50 between the slowest and fastest ordering!

&lt;/P&gt;&lt;P&gt;

There are some obvious ways to make the Python code faster. The
easiest is to process the sets from smallest size to largest. That was
proposed in &lt;a href="http://bugs.python.org/issue3069"&gt;Issue3069&lt;/a&gt;
on 2008-06-10, but the patch was not integrated into the Python
codebase.

&lt;/P&gt;&lt;P&gt;

However, that's not necessarily the best strategy. Suppose I want
letters with "q", "u", and "c" in them. There are 3619, 75144, and
85679 words with those letters, respectively, so you might think the
best sort order is "quc". Testing that hypothesis:

&lt;pre class="code"&gt;
&amp;gt;&amp;gt;&amp;gt; go("quc")
(842, 0.40471911430358887)
&amp;gt;&amp;gt;&amp;gt; go("qcu")
(842, 0.2493000030517578)
&lt;/pre&gt;

shows that "qc" is the more selective pair. This is because "q" and
"u" are highly correlated; there are only 

&lt;pre class="code"&gt;
&amp;gt;&amp;gt;&amp;gt; len(inverted_index["q"] - inverted_index["u"])
14
&lt;/pre&gt;

14 words in the data set with a 'q' but without a 'u', while there are
2776 words with a 'q' and no 'c'.

&lt;/P&gt;&lt;P&gt;

Some years ago I came up with a &lt;a
href="http://www.dalkescientific.com/writings/diary/archive/2005/03/02/faster_fingerprint_substructure_tests.html"&gt;dynamic
algorithm&lt;/a&gt; which tries to prefer the set which least matches the
reference set.

&lt;/P&gt;&lt;P&gt;

For the case of three sets, it simplifies to:

&lt;pre class="code"&gt;
def set_intersection3(*input_sets):
    N = len(input_sets)
    assert N == 3
    min_index = min(range(len(input_sets)), key=lambda x: len(input_sets[x]))
    best_mismatch = (min_index+1)%N

    new_set = set()
    for element in input_sets[min_index]:
        # This failed to match last time; perhaps it's a mismatch this time?
        if element not in input_sets[best_mismatch]:
            continue

        j = 3-best_mismatch-min_index
        # If the element isn't in the set then perhaps this
        # set is a better rejection test for the next input element
        if element not in input_sets[j]:
            best_mismatch = j
        else:
            # The element is in all of the other sets
            new_set.add(element)
    return new_set
&lt;/pre&gt;

while the intersection version to sort by size is:

&lt;pre class="code"&gt;
def set_intersection_sorted(*input_sets):
    input_sets = sorted(input_sets, key=len)
    new_set = set()
    for element in input_sets[0]:
        if element in input_sets[1]:
            if element in input_sets[2]:
                new_set.add(element)
    return new_set
&lt;/pre&gt;

Here's a head-to-head comparison between the three versions

&lt;style&gt;
 #intersection tr th {
    border-bottom: 1px solid black;
 }
 #intersection td {
   border-right: 1px solid grey;
   text-align: center;
 }
 #intersection th {
   border-right: 1px solid grey;
   text-align: center;
 }
 #intersection td + td + td + td {
   border-right: none;
 }
 #intersection th + th + th + th {
   border-right: none;
 }
&lt;/style&gt;

&lt;center&gt;
&lt;table id="intersection" cellspacing="0"&gt;
&lt;tr style="border-bottom: 1px solid black"&gt;&lt;th&gt;&amp;nbsp;pattern&amp;nbsp;&lt;/th&gt;&lt;th&gt;&amp;nbsp;set.intersection&amp;nbsp;&lt;/th&gt;&lt;th&gt;&amp;nbsp;set_intersection3&amp;nbsp;&lt;/th&gt;&lt;th&gt;&amp;nbsp;set_intersection_sorted&amp;nbsp;&lt;/th&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;quc&lt;/td&gt;&lt;td&gt;0.462&lt;/td&gt;&lt;td&gt;0.852&lt;/td&gt;&lt;td&gt;1.032&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;qcu&lt;/td&gt;&lt;td&gt;0.312&lt;/td&gt;&lt;td&gt;0.842&lt;/td&gt;&lt;td&gt;1.032&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;ucq&lt;/td&gt;&lt;td&gt;7.152&lt;/td&gt;&lt;td&gt;0.772&lt;/td&gt;&lt;td&gt;0.962&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;
&lt;/center&gt;

which shows how (for this case) my dynamic algorithm is less sensitive
to the initial query choice and faster than the sorted version. In
fact, if each of the three orderings is equally possible, then the
average search time for the C code is 1/2 the speed of the Python
code.

&lt;/P&gt;&lt;P&gt;

Of course it's not possible to drop in one of these alternate
algorithms because set.intersection can take iterables. These other
algorithms would only work for the special case where all of the
inputs are sets.

&lt;/P&gt;&lt;P&gt;

I've submitted this observation as &lt;a href="http://bugs.python.org/issue13653"&gt;Issue13653&lt;/a&gt;. I'm curious to see what will become of it.

&lt;/P&gt;

</description><guid isPermaLink="true">http://www.dalkescientific.com/writings/diary/archive/2011/12/23/inverted_index.html</guid><pubDate>Fri, 23 Dec 2011 12:00:00 GMT</pubDate></item><item><title>Floats and doubles</title><link>http://www.dalkescientific.com/writings/diary/archive/2011/11/14/floats_and_doubles.html</link><description>&lt;P&gt;

In this essay I describe how using floating-point, and especially how
mixing float and double values, can cause subtle problems in doing a
threshold search.

&lt;/P&gt;&lt;P&gt;
Suppose you have a function which takes two objects (call them "query"
and "target") and returns a similarity score as a float or double.

&lt;pre class="code"&gt;
score = compute_similarity(query, target)
&lt;/pre&gt;

In my case, these are cheminformatics fingerprints. Given a set of
targets you can ask how many targets are at least "threshold" similar
to the given query:

&lt;pre class="code"&gt;
num_targets = 0
for target in targets:
  if compute_similarity(query, target) &gt;= threshold:
    num_targets += 1

# (This can be be written as:
#   num_targets = sum(1 for target in targets
#                        if compute_similarity(query, target) &gt; threshold)
# but unless you are used to that style it's hard to understand.)
&lt;/pre&gt;

Simple, right?

&lt;/P&gt;
&lt;h2&gt;float or double?&lt;/h2&gt;
&lt;P&gt;

I have &lt;a href="http://code.google.com/p/chem-fingerprints/"&gt;code
which does this&lt;/a&gt;. Up until a few days ago it returned a
double. However, the score is the fraction N/M where
0&amp;lt;=N&amp;lt;=M&amp;lt;=2**20. There's no need for 8 bytes of precision when
a &lt;a href="http://en.wikipedia.org/wiki/Single_precision"&gt;float&lt;/a&gt; is
clearly enough. For some of the problem I'm dealing with, that would
save a few hundred MB of space.

&lt;/P&gt;&lt;P&gt;

In theory I could even switch to fixed point, but it's easier to work
with a 32-bit float than a 32-bit fixed point integer (much less a 20
bit fixed point integer!).

&lt;/P&gt;&lt;P&gt;

I converted all of my doubles to float, but my unit tests
failed. After I fixed the places where I messed up on the size of
allocated space, I still had errors.

&lt;/P&gt;&lt;P&gt;

One failure came down to a difference in converting "0.9" to double,
and "0.9f" to float. Take a look at this simple program:

&lt;pre class="code"&gt;
#include &lt;stdio.h&gt;
#include &lt;math.h&gt;
main() {
  printf("0.9: float %.20f double %.20f\n", 0.9f, 0.9);
  printf("0.3: float %.20f double %.20f\n", 0.3f, 0.3);
}
&lt;/pre&gt;

which outputs (with many more decimals than needed):

&lt;pre class="code"&gt;
0.9: float 0.89999997615814208984 double 0.90000000000000002220
0.3: float 0.30000001192092895508 double 0.29999999999999998890
&lt;/pre&gt;

The test calls out to C for the score calculations, but does the
threshold tests in Python. Python's floating-point type is based on a
C double. When the C code returns ((float)9)/10, the Python/C
interface converts that into the Python value 0.89999... . The
threshold test compares it to against 0.900000... and rejects it for
being too low. It's only low by a smidgeon, but still, too low.

&lt;/P&gt;
&lt;h2&gt;A "smidgeon"&lt;/h2&gt;
&lt;P&gt;

A quick check of all the rational values in my range shows that about
1/2 the time the float value is less than the double, and about 1/2
the time it's the other way. Only about 0.5% of the time are they
identical values.

&lt;/P&gt;&lt;P&gt;

The fundamental problem is that I'm mixing float and double values. If
I use only floats or only doubles then this wouldn't be a problem.

&lt;/P&gt;&lt;P&gt;

One solution I came up with is to lower the input threshold value by a
little bit. That is, if the user specifies "0.9", then treat it as the
value "0.899999". I know my denominator is at most 2**20 so I could
do:

&lt;pre class="code"&gt;
def adjust_input(threshold):
  if threshold &gt;= 0.0:
    return threshold - 0.5/2**20
  return threshold
&lt;/pre&gt;


&lt;/P&gt;&lt;P&gt;

That "0.5/2**20" is somewhat ad hoc. Even better, I could use the C99
function "nextafter" ("_nextafter" in Visual C) and lower the floating
point value by a &lt;a
href="http://en.wikipedia.org/wiki/Unit_in_the_last_place"&gt;ulp&lt;/a&gt;,
which is the smallest possible value. (Python doesn't implement
nextafter in the core libraries but there are &lt;a
href="http://stackoverflow.com/questions/6063755/increment-a-python-floating-point-value-by-the-smallest-possible-amount"&gt;other
solutions&lt;/a&gt;.)

&lt;/P&gt;
&lt;h2&gt;Possible side effects&lt;/h2&gt;
&lt;P&gt;

This solution works for the "0.9" case. What are the side effects?

&lt;/P&gt;&lt;P&gt;

For one, about 50% of the time I will lower a threshold value which is
already low enough. This effect will only be visible to people who
those who specify an abnormally high number of decimals in the first
place.

&lt;/P&gt;&lt;P&gt;

For another, my code is available as a library, and the parameter
takes a double. Do I assume that the input has already been nudged
down, or do I apply the nextafter to all of the inputs? I think I need
to apply it to all of the inputs.

&lt;/P&gt;&lt;P&gt;

What about the return values? I expect that users of the library will
expect the following to never raise an exception:

&lt;pre class="code"&gt;
for hit in threshold_search(query, targets, threshold=0.9):
  if hit.score &amp;lt; threshold:
    raise AssertionError
&lt;/pre&gt;

This will fail 1/2 the time because the float value I return from C is
too small, compared to the double. I could make that invariant work by
using nextafter to increase the returned values by a ulp. I don't
think that will cause any problems.

&lt;/P&gt;&lt;P&gt;

That's a lot of "I think"s. I don't yet have any library users to get
feedback. The performance doesn't seem to be affected by using float or
double, and I'm not running into memory limitations. I'm therefore
going to revert all of my changes and keep things as doubles.

&lt;/P&gt;
&lt;h2&gt;String to float conversion&lt;/h2&gt;
&lt;P&gt;

The mishap occurs because the decimal values 0.9 cannot be written
exactly as float or double, which both use base 2. The conversion
function (strtod or equivalent) instead returns binary value which is
closest to the input decimal number. Similarly, the numbers shown
above are the 20-digit decimal numbers closest to the that binary
number. This number can be slightly smaller, equal to, or slightly
larger than the original input.

&lt;/P&gt;&lt;P&gt;

You can see this in action, by using an "abnormally high number of
decimals." In the following the score between the "nine" case and the
"ten" case is exactly 9/10. You can see that the treshold of
0.900000005 and still allows scores of 0.9. This is because
0.900000005 and 0.9 are converted to the same internal number, and
because this is my experimental branch which uses floats instead of
doubles. If I increase the threshold to 0.900000006 then they are no
longer the same, and the threshold test fails.

&lt;pre class="code"&gt;
% cat x.fps 
FF01	nine
FF03	ten

% simsearch --queries x.fps x.fps --threshold 0.900000005
#Simsearch/1
#num_bits=16
#type=Tanimoto k=all threshold=0.900000005
#software=chemfp/1.1a1
#queries=x.fps
#targets=x.fps
2	nine	nine	1.000	ten	0.900
2	ten	nine	0.900	ten	1.000

% simsearch --queries x.fps x.fps --threshold 0.900000006
#Simsearch/1
#num_bits=16
#type=Tanimoto k=all threshold=0.900000006
#software=chemfp/1.1a1
#queries=x.fps
#targets=x.fps
1	nine	nine	1.000
1	ten	ten	1.000
&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

Let me say that this is not a serious problem. Other than people like
me, who use values like this to probe the internal workings of
software, I'm hard-pressed to think of who might be affected by this.

&lt;/P&gt;&lt;P&gt;

Nonetheless, it's still an intellectual sticking point. If for some
reason you specify a threshold of 0.900000004 then I really don't want
my software to return scores which I know are 9/10. How can I solve
this problem in the abstract?

&lt;/P&gt;&lt;P&gt;

I've come up with two solutions. In the first, adjust the input (using
decimal math) so it's more than 1 ulp away from any rational value
which could exist as a score. Easy to say, but I'm not sure how to do
that.

&lt;/P&gt;&lt;P&gt;

In the second, read the input into a rational (like Python's &lt;a
href="http://docs.python.org/library/fractions.html"&gt;fractions&lt;/a&gt;
module). Pass the numerator and denominator into the search functions,
instead of passing in a floating point value.

&lt;/P&gt;&lt;P&gt;

My search code score computes the score based on the ratio of two
integers, so instead of doing:

&lt;pre class="code"&gt;
# old version
def count_targets(query, targets, threshold):
  num_targets = 0
  for target in targets:
    if compute_similarity(query, target) &gt;= threshold:
      num_targets += 1
  return num_targets
&lt;/pre&gt;

I can do

&lt;pre class="code"&gt;
# new version
def count_targets(query, targets, threshold_numerator, threshold_denominator):
  num_targets = 0
  for target in targets:
    score_numerator, score_numerator = compute_similarity(query, target)
    if (  score_numerator * threshold_denominator  &amp;gt;= 
          threshold_denominator_numerator * score_denominator):
      num_targets += 1
  return num_targets
&lt;/pre&gt;


This will work, but who wants to use an API where you pass in the
threshold as the numerator and denominator, rather than a single
threshold value?

&lt;/P&gt;&lt;P&gt;

So yes, I'm going to stay with doubles throughout my entire system. A
consistent floating point representation just makes life easier.

&lt;/P&gt;

</description><guid isPermaLink="true">http://www.dalkescientific.com/writings/diary/archive/2011/11/14/floats_and_doubles.html</guid><pubDate>Mon, 14 Nov 2011 12:00:00 GMT</pubDate></item><item><title>f2pypy</title><link>http://www.dalkescientific.com/writings/diary/archive/2011/11/09/f2pypy.html</link><description>&lt;P&gt;

There's a bit of discussion going on about the role of PyPy in
scientific computing with Python. I spent a few days of the last week
to add more ... shall I say "fuel?" .. to the discussion. I wrote a
new back-end to f2py called &lt;a
href="https://bitbucket.org/pypy/f2pypy"&gt;f2pypy&lt;/a&gt; which generates a
Python module to a shared library based on using ctypes. The module
works (somewhat) with CPython, and does not work with PyPy because
there's no way yet to pass a pointer to the array data to a ctypes
function. (That's a minor detail which isn't hard to implement.)

&lt;/P&gt;&lt;P&gt;

What it shows is a real mechanism to get PyPy to support existing
Fortran libraries already supported by f2py definition files.

&lt;/P&gt;
&lt;h2&gt;NumPy isn't used in all scientific software&lt;/h2&gt;
&lt;P&gt;

There is definitely place for PyPy in scientific computing even now.
There are entire branches of science which have little overlap with
the strengths of SciPy. I've been a full-time software developer for
computational chemistry for 16 years, and have only used NumPy a few
times.

&lt;/P&gt;&lt;P&gt;
One time I needed to compute the generalized inverse matrix. It was in
a command-line program called by another process, of all things, and
to my annoyance the "import numpy" on the cluster file system was
noticably long. I forgot what the numbers were then, but the current
numpy import adds 145 (yes, 145!) modules to sys.modules, and 107 of
them start with "numpy." Our Lustre configuration did poorly with file
metadata, and I think it was over a second to do the import.

&lt;/P&gt;&lt;P&gt;

I brought this up on the numpy list. While they made some changes, it
was pointed out that I am not their target user. The "import numpy"
also does an "import numpy.testing" and "input numpy.ctypeslib" and
the other imports so that people could use numpy.submodule without an
extra explicit import line, and because most people working with numpy
are in working in a long-lived interactive session or job, so the
startup performance isn't a problem.

&lt;/P&gt;&lt;P&gt;

I happen to disagree with their choice. "Explicit is better than
implicit" and all that. But my point is not to argue for them to
change but to give a specific example of how the goals of the NumPy
developers can be different than the goals of other scientific
programmers.

&lt;/P&gt;
&lt;h2&gt;How do I use Python in science research?&lt;/h2&gt;
&lt;P&gt;

A lot of what I do involves communicating with command-line
executables. These are often written by scientists, and most are
designed to be run directly by people, and not be other software. Most
of the time is spent in the executable, so it doesn't matter if I'm
using CPython or PyPy.

&lt;/P&gt;&lt;P&gt;

There are several cheminformatics libraries for Python. OpenBabel and
OEChem use SWIG bindings, RDKit uses Boost, and I don't know what
Indigo and Canvas use. Migrating these to PyPy will be hard. I hope
that someone is working on SWIG bindings, but it looks like the PyPy
developers don't want to commit to a C ABI. (See below.)

&lt;/P&gt;&lt;P&gt;

There's also code where there are no libraries, and for those I write
the code in Python, and sometimes my own extension for C. For some of
these case the 3x and higher performance of PyPy would be great. I
also know a lot of ways for my CPython-based code to talk to
PyPy-based code.

&lt;/P&gt;&lt;P&gt;

I used to develop software for bioinformatics and structural biology,
and my observations still hold for those fields. One of the Biopython
developers, Peter Cock, writes:

&lt;blockquote&gt;
Regarding Biopython using NumPy, we're already trying it out under
PyPy. Large chunks of Biopython do not use NumPy at all, although
there a few problems on PyPy 1.6 (one due to a missing XML library,
bug filed), most of that seems to work."  &lt;a
href="http://blog.streamitive.com/2011/10/19/more-thoughts-on-arrays-in-pypy/#comment-49"&gt;[*]&lt;/a&gt;
&lt;/blockquote&gt;

He continues with a list of some of what doesn't work.


&lt;/P&gt;
&lt;h2&gt;Support for existing libraries&lt;/h2&gt;
&lt;P&gt;

That said, I know that a lot of people depend Python bindings to
existing libraries. These use the C API directly, or through
auto-generated interfaces from f2py, Cython, Boost, SWIG, and many
more. There's been 10+ years to develop these tools for CPython, and
still very little time to adapt them to PyPy.

&lt;/P&gt;&lt;P&gt;

Relatively few extensions use the ctypes module, which is Python's
"other" mechanism for calling external functions. Unlike the C API,
this one is also portable across Jython, Iron Python, and
PyPy. Obviously, if everyone used ctypes then there wouldn't be a
problem. Why don't they?

&lt;/P&gt;&lt;P&gt;

One is the performance. Calling math.cos() is 8 times faster than
doing a LoadLibrary() of libm and calling cos() that way. This is of
course the worst case. But that's a CPython limitation. Pypy's ctypes
call interface is faster than CPython calling a C extension:

&lt;pre class="code"&gt;
% cat x.py
import ctypes
m = ctypes.cdll.LoadLibrary("/usr/lib/libm.dylib")
cos = m.cos
cos.argtypes = [ctypes.c_double]
cos.restype = ctypes.c_double

% python -mtimeit -s "from x import cos" "cos(0)"
1000000 loops, best of 3: 0.676 usec per loop
% python -mtimeit -s "from math import cos" "cos(0)"
10000000 loops, best of 3: 0.0811 usec per loop
% pypy -mtimeit -s "from x import cos" "cos(0)"
10000000 loops, best of 3: 0.0332 usec per loop
% pypy -mtimeit -s "from math import cos" "cos(0)"
100000000 loops, best of 3: 0.0047 usec per loop
&lt;/pre&gt;

although you can see it's still slower than using a built-in function.

&lt;/P&gt;&lt;P&gt;

Another reason to not use ctypes is that C/C++ library authors do
interesting things with the API. One library I used has public API
functions like "dt_charge(atom)" to get the formal charge of an atom,
but used a number of #define statements to change those names to the
internal name. That example became "dt_e_charge". It also defined
certain constants only in the header files. This information isn't in
the shared library.

&lt;/P&gt;&lt;P&gt;

I know at least one vendor which only ships a static library, and not
a shared library. Apparently bad LD_LIBRARY_PATHs was such a support
headache that they decided it wasn't worth it. (I think they are
right.) There's no way to get ctypes to interface to a static library.

&lt;/P&gt;&lt;P&gt;

A fourth problem is lack of support for C++ templates. That clearly
needs a compiler, which ctypes doesn't do.

&lt;/P&gt;
&lt;h2&gt;PyPy needs a (semi-)stable C ABI; can you help?&lt;/h2&gt;
&lt;P&gt;

Based on the above, there will clearly always be a need for
compiler-based Python extensions, including PyPy extensions. That
means there needs to be some sort of ABI that those extensions can
program against.

&lt;/P&gt;&lt;P&gt;

I don't know what that would look like, and I think the PyPy
developers think it's still too early to stablize on it. It may well
be; but I think it's because there's no one on the group who wants to
work on the task.

&lt;/P&gt;&lt;P&gt;

They were more than happy last year to show a proof-of-concept
interface from PyPy to C++ using the run-time type information added
by the &lt;a href="http://root.cern.ch/drupal/content/reflex"&gt;Reflex&lt;/a&gt;
system. (Yeah, I had never heard of it either.) So they have nothing
against working with an existing ABI. Do you want to offer one?

&lt;/P&gt;&lt;P&gt;

I wrote "semi-" in the title because it wasn't until Python 3.2 that
CPython got a stable ABI. PyPy notably does have emulation support for
some of the CPython 2.x ABI but there are problems. Some modules use
the ABI incorrectly, and it works for implementation-specific
reasons. (For example, bad reference counts.)

&lt;/P&gt;&lt;P&gt;

If you are going to work on this, I think it would make sense to
target the 3.2 ABI and to include instrumentation to help identify
these problems.

&lt;/P&gt;&lt;P&gt;

The best for me would be if you develop some SWIG/ABI interface. This
might just be to produce a bunch of stub functions and a ctypes
definition for them. (Hmm, wasn't there a C++ to C SWIG interface?)

&lt;/P&gt;
&lt;h2 name="punchline"&gt;f2pypy: Experimental Fortran bindings&lt;/h2&gt;
&lt;P&gt;

The above is talk and hand-waving. Code's also good. There was a PyPy
sprint this week and I decided to join in for a few days and prototype
an idea I've been thinking about: &lt;a
href="https://bitbucket.org/pypy/f2pypy"&gt;f2pypy&lt;/a&gt;. It's a variation
of f2py which generates Python ctypes bindings which PyPy could use to
talk with shared libraries implemented in Fortran.

&lt;/P&gt;&lt;P&gt;

At the end of several days of work, I got f2pypy to generate a Python
module based on the "fblas.pyf" code from SciPy. I could import that
library in CPython and (for the few functions I tested) get answers
which matched the fblas module in SciPy. I could also use pypy to call
&lt;i&gt;some&lt;/i&gt; of the functions, but PyPy's "numpy" implementation is not
mature enough. Its array objects don't yet support the ctypes interface,
so I was unable to call out to the shared library. I could only call
the scalar-based functions.

&lt;/P&gt;&lt;P&gt;

The code is definitely incomplete. Even my CPython-based tests fail
some of the the "test_blas.py" from SciPy (I don't implement "cblas"
and I think one of the tests depends on Fortran order instead of C
order.)  It's a proof-of-concept which shows that this approach is
definitely viable, and it shows some of the difficulties in the
approach.

&lt;/P&gt;&lt;P&gt;

My point though is that it opens new possibilites which aren't
available in NumPy. For example, suppose you want to use one of the
BLAS functions in your code. Every Mac includes a copy of BLAS as a
built-in library. Instead of making people install SciPy, what about
shipping the ctypes module description instead, and using that
interface? You can ship pure Python code and still take advantage of
platform-optimized libraries!

&lt;/P&gt;&lt;P&gt;

I earlier highlighted the performance problems in CPython's ctypes
interface. But this is PyPy. They already have &lt;a
href="http://morepypy.blogspot.com/2011/02/pypy-faster-than-c-on-carefully-crafted.html"&gt;cross-module
optimizations&lt;/a&gt; for Python calling Python. There's no reason why
those can't apply to ctypes-based functions. (Or perhaps it's already
there? I've not tested that.)

&lt;/P&gt;
&lt;h2&gt;How does it work?&lt;/h2&gt;
&lt;P&gt;

Fortran bindings are nice because they don't have the same
preprocessor tricks that I mentioned earlier. Pearu Peterson wrote the
excellent &lt;a href="http://cens.ioc.ee/projects/f2py2e/"&gt;f2py&lt;/a&gt;
package starting some 10+ years ago. It has several ways to work with
Fortran code. The one I used was to start with a "pyf" definition file
and generate Python code using a new back-end.

&lt;/P&gt;&lt;P&gt;

I figured out how to get SciPy to generate the pyf file for the BLAS
library. (The SciPy source uses a template language during the build
process to generate the actual code.) I used f2py's "crackfortran"
module to parse the pfy file and get the AST. It's a small tree so
perhaps I should call it an abstract syntax bush.

&lt;/P&gt;&lt;P&gt;

The f2py code generate the Python/C extension code based on the
AST. My f2pypy code is basically another back-end, which generates
ctypes-based code in Python.

&lt;/P&gt;&lt;P&gt;

The trickiest part was support for C code. Some of the pyf definition
lines contain embedded C code. Here I've gathered three examples:

&lt;pre class="code"&gt;
integer optional, intent(in),check(incx&gt;0||incx&lt;0) :: incx = 1
integer optional,intent(in),depend(x,incx,offx,y,incy,offy) :: n = (len(x)-offx)/abs(incx)
callstatement (*f2py_func)((trans?(trans==2?"C":"T"):"N"),&amp;m,&amp;n,&amp;alpha,a,&amp;m,x+offx,&amp;incx,&amp;beta,y+offy,&amp;incy)
&lt;/pre&gt;

I used Fredrick Lundh's wonderful essay on &lt;a
href="http://effbot.org/zone/simple-top-down-parsing.htm"&gt;Simple
Top-Down Parsing in Python&lt;/a&gt; to build a simple C expression parser,
which builds another AST. With a bit of AST manipulation, and symbol
table knowledge (I need to know which inputs are scalars and which are
vectors), I could generate output strings like:

&lt;pre class="code"&gt;
def srot(..., incx = None, ...):
  ...
  if incx is None:
    incx = _ct.c_int(1)
  else:
    incx = _ct.c_int(incx)
  if not ((((incx.value) &gt; 0) or ((incx.value) &lt; 0))):
    raise ValueError('(incx&gt;0||incx&lt;0) failed for argument incx: incx=%s' % incx.value)
&lt;/pre&gt;

and the more complicated:

&lt;pre class="code"&gt;
_api_cgemv((("c") if (((trans.value) == 2)) else ("t")) if ((trans.value)) else
("n"), (m), (n), (alpha), a.ctypes.data_as(_ct.POINTER(_complex_float)), (m),
(x if ((offx.value)) == 0 else x[(offx.value):]).ctypes.data_as(_ct.POINTER(_complex_float)),
(incx), (beta),
(y if ((offy.value)) == 0 else y[(offy.value):]).ctypes.data_as(_ct.POINTER(_complex_float)),
(incy))
&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

I definitely do not generate optimized code. I decided to work
completely in terms of ctypes scalars and numpy arrays, even for the
check() statements. PyPy doesn't optimize that yet, and I think
someone else could do a better job by only doing the conversion as
part of the call to the Fortran code.


&lt;/P&gt;
&lt;h2&gt;Usage&lt;/h2&gt;
&lt;P&gt;

To generate the new module on a Mac (I don't know the shared library
name for other OS installations):

&lt;pre class="code"&gt;
  $PYTHON -m f2pypy tests/fblas.pyf -l vecLib --skip cdotu,zdotu,cdotc,zdotc
&lt;/pre&gt;

This generates "fblas.py". I have some test code for that module

&lt;pre class="code"&gt;
% python test_fblas.py
python test_fblas.py
...........F...
======================================================================
FAIL: test_srot_overwrite (__main__.CBlasTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_fblas.py", line 116, in test_srot_overwrite
    assert x is x2
AssertionError

----------------------------------------------------------------------
Ran 15 tests in 0.006s

FAILED (failures=1)
&lt;/pre&gt;

This says that "numpy.array(.. copy=False)" makes a new reference,
while the internal code f2py uses passes back the same object, so a
real implementation will need to handle that detail.

&lt;/P&gt;&lt;P&gt;

Here's the same output from pypy:

&lt;pre class="code"&gt;
% pypy test_fblas.py
.EE.EEEFEEEE.E.
======================================================================
ERROR: test_dnrm2 (__main__.CBlasTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_fblas.py", line 166, in test_dnrm2
    E(m.dnrm2(x), float((numpy.array([1+1+16+81], "d")**0.5)))
  File "/Users/dalke/cvses/f2pypy/fblas.py", line 1192, in dnrm2
    return _api_dnrm2((n), (x if ((offx.value)) == 0 else x[(offx.value):]).ctypes.data_as(_ct.POINTER(_ct.c_double)), (incx))
AttributeError: 'numarray' object has no attribute 'ctypes'

======================================================================
ERROR: test_drot (__main__.CBlasTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_fblas.py", line 153, in test_drot
    E(m.drot(1,2,3,4), (numpy.array(11.0, dtype="d"), numpy.array(2.0, dtype="d")))
  File "/Users/dalke/cvses/f2pypy/fblas.py", line 181, in drot
    y = _np.array(y, 'd', copy=overwrite_y)
TypeError: __new__() got an unexpected keyword argument 'copy'
 ....
&lt;/pre&gt;

See the dots up there with the "E"rrors? That 4 of the scalar-based
tests pass. What fails is the vector-based code. The "ctypes" gets the
pointer to the numpy array data, and PyPy doesn't support the "copy"
parameter of numpy.array.

&lt;/P&gt;&lt;P&gt;

Still, it does pass some tests!

&lt;/P&gt;
&lt;h2&gt;Future&lt;/h2&gt;
&lt;P&gt;

I don't use Fortran modules. I don't use f2py. I don't use numarray. I
will not be involved in this project for the future. (I do a lot of
integration work, and I do a lot of parsing and AST transformations,
so that part of this effort was a very pretty good fit!)

&lt;/P&gt;&lt;P&gt;

I did this because I wanted to show that PyPy can support traditional
numeric software libraries and that there is a relatively doable path
for migration from existing numpy code to "numpypy" code.

&lt;/P&gt;&lt;P&gt;

I will not be maintaining the project in the future. If you want to
take it on, feel free. I've contributed it to the PyPy project, and it
has &lt;a href="https://bitbucket.org/pypy/f2pypy"&gt;its own
repository&lt;/a&gt;. Feel free also to &lt;a
href="http://dalkescientific.blogspot.com/2011/11/f2pypy.html"&gt;leave a
comment&lt;/a&gt; or &lt;a href="mailto:dalke@dalkescientific.com"&gt;ask me
questions&lt;/a&gt;.

&lt;/P&gt;

</description><guid isPermaLink="true">http://www.dalkescientific.com/writings/diary/archive/2011/11/09/f2pypy.html</guid><pubDate>Wed, 09 Nov 2011 12:00:00 GMT</pubDate></item></channel></rss>
