Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why doesn't Python's mmap work with large files?

[Edit: This problem applies only to 32-bit systems. If your computer, your OS and your python implementation are 64-bit, then mmap-ing huge files works reliably and is extremely efficient.]

I am writing a module that amongst other things allows bitwise read access to files. The files can potentially be large (hundreds of GB) so I wrote a simple class that lets me treat the file like a string and hides all the seeking and reading.

At the time I wrote my wrapper class I didn't know about the mmap module. On reading the documentation for mmap I thought "great - this is just what I needed, I'll take out my code and replace it with an mmap. It's probably much more efficient and it's always good to delete code."

The problem is that mmap doesn't work for large files! This is very surprising to me as I thought it was perhaps the most obvious application. If the file is above a few gigabytes then I get an EnvironmentError: [Errno 12] Cannot allocate memory. This only happens with a 32-bit Python build so it seems it is running out of address space, but I can't find any documentation on this.

My code is just

f = open('somelargefile', 'rb') map = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) 

So my question is am I missing something obvious here? Is there a way to get mmap to work portably on large files or should I go back to my naïve file wrapper?


Update: There seems to be a feeling that the Python mmap should have the same restrictions as the POSIX mmap. To better express my frustration here is a simple class that has a small part of the functionality of mmap.

import os  class Mmap(object):     def __init__(self, f):         """Initialise with a file object."""         self.source = f      def __getitem__(self, key):         try:             # A slice             self.source.seek(key.start, os.SEEK_SET)             return self.source.read(key.stop - key.start)         except AttributeError:             # single element             self.source.seek(key, os.SEEK_SET)             return self.source.read(1) 

It's read-only and doesn't do anything fancy, but I can do this just the same as with an mmap:

map2 = Mmap(f) print map2[0:10] print map2[10000000000:10000000010] 

except that there are no restrictions on filesize. Not too difficult really...

like image 822
Scott Griffiths Avatar asked Nov 02 '09 15:11

Scott Griffiths


People also ask

How does Python mmap work?

Python's mmap provides memory-mapped file input and output (I/O). It allows you to take advantage of lower-level operating system functionality to read files as if they were one large string or array. This can provide significant performance improvements in code that requires a lot of file I/O.

Why is mmap faster?

When I ask my colleagues why mmap is faster than system calls, the answer is inevitably “system call overhead”: the cost of crossing the boundary between the user space and the kernel.

What does mmap do in Linux?

The mmap() function establishes a mapping between a process' address space and a stream file. The address space of the process from the address returned to the caller, for a length of len, is mapped onto a stream file starting at offset off.

How do you create a file like an object in Python?

To create a file object in Python use the built-in functions, such as open() and os. popen() . IOError exception is raised when a file object is misused, or file operation fails for an I/O-related reason. For example, when you try to write to a file when a file is opened in read-only mode.


1 Answers

From IEEE 1003.1:

The mmap() function shall establish a mapping between a process' address space and a file, shared memory object, or [TYM] typed memory object.

It needs all the virtual address space because that's exactly what mmap() does.

The fact that it isn't really running out of memory doesn't matter - you can't map more address space than you have available. Since you then take the result and access as if it were memory, how exactly do you propose to access more than 2^32 bytes into the file? Even if mmap() didn't fail, you could still only read the first 4GB before you ran out of space in a 32-bit address space. You can, of course, mmap() a sliding 32-bit window over the file, but that won't necessarily net you any benefit unless you can optimize your access pattern such that you limit how many times you have to visit previous windows.

like image 67
Nick Bastin Avatar answered Oct 04 '22 01:10

Nick Bastin