Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2008/08/08/consume_memory

Read 0 bytes, run out of memory

I've been working on a file format for chemical fingerprints, influenced by the PNG file format. To make sure I'm doing it right, I wrote a program to dump blocks from PNG files. I made a mistake and my program gave a MemoryError. How did that happen when my test file is only a few K long?

I tracked it down. I don't know if it's a bug. Here's something for you all to ponder over:

BLOCKSIZE = 10*1024*1024

f=open("empty.txt", "w")
f.close()

f=open("empty.txt")
data = []
for i in range(10000):
    s = f.read(BLOCKSIZE)
    assert len(s) == 0
    data.append(s)
That's an empty file, so the read() must return empty strings. The assert statement verifies that that's the case. But when I run it I get:
Python(18996) malloc: *** vm_allocate(size=10489856) failed (error code=3)
Python(18996) malloc: *** error: can't allocate region
Python(18996) malloc: *** set a breakpoint in szone_error to debug
Traceback (most recent call last):
  File "mem_fill.py", line 9, in <module>
    s = f.read(BLOCKSIZE)
MemoryError
The reason why is in the C implementation of read (Objects/fileobject.c). The relevant line i:
        v = PyString_FromStringAndSize((char *)NULL, buffersize);
That preallocates space assuming the requested read size will be correct. In my example code it preallocates 10MB of space even though the result is 0 bytes long. Since I keep the result around, all of the preallocated space is also kept around. Repeat that 10,000 times and my machine quickly runs out of memory. So will yours.

Bug in Python? Correct behavior? You decide. Feel free to make comments if you wish.

Update 9 August: I submitted this as issue3531 in the Python bug tracker. Antoine Pitrou pointed out that there's a string resize at the end of the function, so my assumption on the source of the problem was wrong. I dug into it some more and tracked the problem down to obmalloc.c inside of PyObject_Realloc. The resize ends up calling the C function 'realloc', and it didn't seem to be reallocing.

With those clues I did some searching and found Bob Ippolito's blog post realloc.. doesn't. "Apparently, Darwin's implementation of realloc never frees memory if you ask for a smaller size." There's a thread about this from January 2005 on python-dev titled: "Darwin's realloc(...) implementation never shrinks allocations." There's additional discussion with issue1092502.

I've been looking through various libraries to see if this causes a problem in real life. A denial of service attack is possible but only for formats where the data describes how many bytes to read. The PNG format is one of many. The reader code has to trust that number and use it blindly, but any hardened reader knows those values are under suspicion. For example, the PNG reader in PIL uses ImageFile.py:_safe_read:

# Reads large blocks in a safe way.  Unlike fp.read(n), this function
# doesn't trust the user.  If the requested size is larger than
# SAFEBLOCK, the file is read block by block.
#
# @param fp File handle.  Must implement a read method.
# @param size Number of bytes to read.
# @return A string containing up to size bytes of data.

def _safe_read(fp, size):
    if size <= 0:
        return ""
    if size <= SAFEBLOCK:
        return fp.read(size)
    data = []
    while size > 0:
        block = fp.read(min(size, SAFEBLOCK))
        if not block:
            break
        data.append(block)
        size = size - len(block)
    return string.join(data, "")
which does not end up with un-realloc'ed blocks.

If you aren't using a hardened reader then it's easy to do a standard denial-of-service attack.

I could come up with hypothetical examples, but they are pretty contrived. I tried to write one which sounded at least somewhat reasonable, but I couldn't. Therefore, I don't think this realloc(3C) behavior can cause a serious denial-of-service attack. At worst it seems to make already bad code somewhat more suspectable to attack.


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2020 Andrew Dalke Scientific AB