Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2006/03/20/performance_tuning

Performance tuning

Another interesting thing I found when doing my timing tests. When I change the following to use Python's gzip reader vs. the system 'gzcat'

    f = os.popen("gzcat /Users/dalke/databases/flybase/dmel-2L-r4.2.1.gff.gz")
    #f = gzip.open("/Users/dalke/databases/flybase/dmel-2L-r4.2.1.gff.gz")
my parse time goes from 47 seconds to 73 seconds. Guess I won't be using the built-in gzip if I can get away with it! Looks like the problem is the slow iter performance. The gzip.GzipFile.read function is very, very slow
    def readline(self, size=-1):
        if size < 0: size = sys.maxint
        bufs = []
        readsize = min(100, size)    # Read from the file in small chunks
        while True:
            if size == 0:
                return "".join(bufs) # Return resulting line

            c = self.read(readsize)
            i = c.find('\n')
            if size is not None:
                # We set i=size to break out of the loop under two
                # conditions: 1) there's no newline, and the chunk is
                # larger than size, or 2) there is a newline, but the
                # resulting line would be longer than 'size'.
                if i==-1 and len(c) > size: i=size-1
                elif size <= i: i = size -1

            if i >= 0 or c == '':
                bufs.append(c[:i+1])    # Add portion of last chunk
                self._unread(c[i+1:])   # Push back rest of chunk
                return ''.join(bufs)    # Return resulting line

            # Append chunk to list, decrease 'size',
            bufs.append(c)
            size = size - len(c)
            readsize = min(size, readsize * 2)
If I replace the f=popen("gzcat ...") call with f = open(filename).readlines() then my time is 45 seconds. This means the overhead of using an external process for uncompressing is a negligible 2 seconds, compared to 28 seconds.

Many of the fields in a GFF3 file can be escaped using the URL %hex convention. Few of of the fields are so I had a quick test for "%" in s before incurring the function call overhead. According to the profiler it was still being called a lot. I tracked it down to the attributes section, which is a key/value table where the values may be multi-valued. I did a function call for every single value. Now I check for the common case, where I don't need to unescape values. That got me down to about 32 seconds for my test set.

I then did an evil hack. Instead of calling the constructor, which does a __new__ and an __init__ Python function call, I called __new__ directly and set the attributes myself.

##    return GFF3Feature(seqid, source, type, start, end,
##                       score, strand, phase, attributes)
    # This hack gives me about 5% better performance because
    # I skip the extra __init__ call
    x = GFF3Feature.__new__(GFF3Feature)
    x.seqid = seqid
    x.source = source
    x.type = type
    x.start = start
    x.end = end
    x.score = score
    x.strand = strand
    x.phase = phase
    x.attributes = attributes
    return x
As you can read, I got a 5% overall speedup from that. In my fastest implementation, with all the hacks and tricks including the split change I'm about to talk about, I can process 566,119 lines in 25.3 seconds. By skipping the __init__ the time goes to 27 seconds, which is a bit over 6% slower.

I used line=line.rstrip("\n") to remove the terminal newline. The profiler said rstrip was kinda slow. Micro-timing it

% python -m timeit 's="Hello!\n"' 'if s[-1] == "\n": s=s[:-1]'
1000000 loops, best of 3: 1.2 usec per loop
% python -m timeit 's="Hello!\n";s=s.rstrip("\n")'
1000000 loops, best of 3: 1.72 usec per loop
That's a shame. I don't like having if statements like that. I would rather have a "chomp()" method which implements this logic. On the other hand, for my test data set that's only 0.3 seconds of (at final reckoning) 25 seconds; about a 1% speedup.

The profiler now claims the slowest code is 'split'.

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    99995   11.960    0.000   21.160    0.000 gff3_parser.py:220(parse_gff3_line)
   397148    7.640    0.000    7.640    0.000 :0(split)
   100001    3.970    0.000   25.130    0.000 gff3_parser.py:392(parse_gff3)
        1    1.880    1.880   27.010   27.010 gff3_parser.py:722(go)
    99995    1.520    0.000    1.520    0.000 :0(__new__)
      131    0.010    0.000    0.040    0.000 :0(map)
      272    0.010    0.000    0.010    0.000 :0(join)
      272    0.010    0.000    0.030    0.000 urllib.py:1055(unquote)
                           ...
I've removed as many split calls as I can. My profile test set works over the first 100,000 lines of the file, which means about 4 splits per line. Why? Ahh, each line contains a list of key/value attributes of the form:
ID=CG11023-RA-u5;Dbxref=FlyBase:FBgn0031208;Parent=CG11023-RA
Every line has one, and most lines have several. I split on ';' then split on '=' to get the fields. Turns out
i = field.index("=")
d[field[:i]] = field[i+1:]
is faster than
k, v = field.split("=", 1)
d[k] = v
I looked at the Python's implementation of split. Here is the heart
	for (i = j = 0; i < len; ) {
		if (s[i] == ch) {
			if (maxcount-- <= 0)
				break;
			SPLIT_APPEND(s, j, i);
			i = j = i + 1;
		} else
			i++;
	}
	if (j <= len) {
		SPLIT_APPEND(s, j, len);
	}
	return list;
For my most common case this requires two reallocs, because I have 9 elements and in listobject.c
   * The growth pattern is:  0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
I implemented an alternative which scanned the string to find how many characters matched and used that to build a list of the correct size. That sped up my code by a few percent, which was only about the same speed as the .index() approach.

My solution isn't a good general purpose one though. I should only scan the string once to identify breakpoints. Suppose the string was in shared memory or memory mapped to a file and it changes underneath. Bad idea, but possible. One advantage to the current scheme is it will never cause problems even if the characters are mutated while it's working. My scheme will cause problems because it requires a double scan. My also causes problems with cache coherency because it will read long strings twice.

Other solutions: only try to speed up if the string is less than, say, 200 bytes. That's going to be a frequent case. Use a local list of, say, 10 offset values. Find the split points and save to that list. If there are fewer than 10 values, prealloc the list with the right size and use the stored split points when making the substrings. If there are more than 10 points, or the string is too long, fall back to the existing method, because the realloc performance is less of a problem.

This is interesting. Split without a maxsplit is faster than with a maxsplit

% python -m timeit 's="xyz=123"; a,b=s.split("=")'
100000 loops, best of 3: 2.95 usec per loop
% python -m timeit 's="xyz=123"; a,b=s.split("=", 1)'
100000 loops, best of 3: 3.34 usec per loop
That accounts for about 0.2 seconds of the 25 second performance time.

However, I'm now at the 1% speedup point, where I need multiple runs to make sure I'm seeing real speedups and not system variances. 47 seconds to 25 seconds is nearly a factor of two. I think that's good enough for now. If I need more later, guess I'll have to look at C.

Anyone want to spend time improving the performance of Python's native split function? If Raymond Hettinger's 'partition' gets in that could speed up the case of parsing the "key=value" terms but won't handle my need to split nearly every line on tab to get 9 fields.


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2020 Andrew Dalke Scientific AB