Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2007/10/10/other_wide_finder_implementations

other Wide Finder implementations

Beating the wide finder topic to death, I compared my code timings with other implementations. These were done with Python 2.5, Ruby 1.8.2 and gcc 4.0.1 using the same machine, timed with "time" and using the wall-clock time. I have a 2.33 GHz MacBook Pro laptop which was not plugged in when I did the tests. These explain why the numbers I get now are bigger than I reported before. In all cases I run the test code twice and report the smallest time of the two, though those never varied by more than 0.01 second. The input test set contains 1,000,000 lines and 200,995,500 bytes, as generated by Fredrik's code.

Implementationtime (s)
Python, mxTextTools0.75
command-line tools (see below)0.82
Python, simple tally plus post-processing filter1.0
C, wf3.c by Ilmari Heikkinen1.2
Python, regex on a mmap1.4
Ruby, parallel wide finder1.7
Ruby, Tim Bray's original code2.3
Python, dehÓra's version with simple speedups3.3
Python, dehÓra's version4.0

It's interesting that two Python programs are faster than the C program, though please do note that it's only that - interesting. It doesn't say much about the absolute performance of a given language. Indeed, since the Python implementation I'm using is in C the performance is not directly based on the language.

Instead, I think it's because those two fast Python programs use search implementations based on variants of Boyer-Moore, while the C code does a character-by-character search, with a few simple performance tweaks and ones which filter using regular expressions end up using a more general-purpose-but-slower tool.

In theory the regular expression engine could look at the pattern and determine that the prefix is a fixed string so use a better search algorithm. It may even do some of that now. But that's a lot of work to implement and maintain, and since most of my regex parsing is on lines under 80 or so characters and I mostly use "match" instead of "search", it's a feature which I know I don't often needed.

I thought about implementing something using flex or ragel or re2c. I even started sketching out some code. I stopped when I realized that those are not search programs - they are tokenizers and expect the current position to match exactly. I don't want to implement some Boyer-Moore-like algorithm myself, especially since I don't need this code, the Python version works very well, and the command-line version is:

egrep 'GET /ongoing/When/[0-9]{3}x/[0-9]{4}/[0-9]{2}/[0-9]{2}/[^ .]+ ' o1000k.ap | 
      awk '{print $7}' | sort | uniq -c | sort -n | tail -10
which runs in 0.82 seconds.

Any Comments?


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB