Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python, why does mmap.move() fill up the memory?

edit: Using Win10 and python 3.5

I have a function that uses mmap to remove bytes from a file at a certain offset:

def delete_bytes(fobj, offset, size):
    fobj.seek(0, 2)
    filesize = fobj.tell()
    move_size = filesize - offset - size

    fobj.flush()
    file_map = mmap.mmap(fobj.fileno(), filesize)
    file_map.move(offset, offset + size, move_size)
    file_map.close()

    fobj.truncate(filesize - size)
    fobj.flush()

It works super fast, but when I run it on a large number of files, the memory quickly fills up and my system becomes unresponsive.

After some experimenting, I found that the move() method was the culprit here, and in particular the amount of data being moved (move_size). The amount of memory being used is equivalent to the total amount of data being moved by mmap.move(). If I have 100 files with each ~30 MB moved, the memory gets filled with ~3GB.

Why isn't the moved data released from memory?

Things I tried that had no effect:

  • calling gc.collect() at the end of the function.
  • rewriting the function to move in small chunks.
like image 396
mahkitah Avatar asked Jun 14 '16 11:06

mahkitah


People also ask

How does Python mmap work?

Python's mmap provides memory-mapped file input and output (I/O). It allows you to take advantage of lower-level operating system functionality to read files as if they were one large string or array. This can provide significant performance improvements in code that requires a lot of file I/O.

What is mmap offset?

The mmap() function asks to map 'length' bytes starting at offset 'offset' from the file (or other object) specified by the file descriptor fd into memory, preferably at address 'start'. Sepcifically, for the last argument: 'offset' should be a multiple of the page size as returned by getpagesize(2).

When should we use mmap?

mmap is great if you have multiple processes accessing data in a read only fashion from the same file, which is common in the kind of server systems I write. mmap allows all those processes to share the same physical memory pages, saving a lot of memory.

What does mmap do in Linux?

The mmap() function establishes a mapping between a process' address space and a stream file. The address space of the process from the address returned to the caller, for a length of len, is mapped onto a stream file starting at offset off.


Video Answer


1 Answers

This seems like it should work. I did find one suspicious bit in the mmapmodule.c source code, #ifdef MS_WINDOWS. Specifically, after all the setup to parse arguments, the code then does this:

if (fileno != -1 && fileno != 0) {
    /* Ensure that fileno is within the CRT's valid range */
    if (_PyVerify_fd(fileno) == 0) {
        PyErr_SetFromErrno(PyExc_OSError);
        return NULL;
    }
    fh = (HANDLE)_get_osfhandle(fileno);
    if (fh==(HANDLE)-1) {
        PyErr_SetFromErrno(PyExc_OSError);
        return NULL;
    }
    /* Win9x appears to need us seeked to zero */
    lseek(fileno, 0, SEEK_SET);
}

which moves your underlying file object's offset from "end of file" to "start of file" and then leaves it there. That seems like it should not break anything, but it might be worth doing your own seek-to-start-of-file just before calling mmap.mmap to map the file.

(Everything below is wrong, but left in since there are comments on it.)


In general, after using mmap(), you must use munmap() to undo the mapping. Simply closing the file descriptor has no effect. The Linux documentation calls this out explicitly:

munmap()
The munmap() system call deletes the mappings for the specified address range, and causes further references to addresses within the range to generate invalid memory references. The region is also automatically unmapped when the process is terminated. On the other hand, closing the file descriptor does not unmap the region.

(The BSD documentation is similar. Windows may behave differently from Unix-like systems here, but what you are seeing suggests that they work the same way.)

Unfortunately, Python's mmap module does not bind the munmap system call (nor mprotect), at least as of both 2.7.11 and 3.4.4. As a workaround you can use the ctypes module. See this question for an example (it calls reboot but the same technique works for all C library functions). Or, for a somewhat nicer method, you can write wrappers in cython.

like image 148
torek Avatar answered Oct 22 '22 10:10

torek