Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use mmap in python when the whole file is too big

Tags:

python

I have a python script which read a file line by line and look if each line matches a regular expression.

I would like to improve the performance of that script by using memory map the file before I search. I have looked into mmap example: http://docs.python.org/2/library/mmap.html

My question is how can I mmap a file when it is too big (15GB) for the memory of my machine (4GB)

I read the file like this:

fi = open(log_file, 'r', buffering=10*1024*1024)

for line in fi: 
    //do somemthong

fi.close()

Since I set the buffer to 10MB, in terms of performance, is it the same as I mmap 10MB of file?

Thank you.

like image 384
michael Avatar asked Jan 12 '13 01:01

michael


People also ask

Does mmap load file into memory?

Yes, mmap creates a mapping. It does not normally read the entire content of whatever you have mapped into memory. If you wish to do that you can use the mlock/mlockall system call to force the kernel to read into RAM the content of the mapping, if applicable.

What is mmap offset?

The mmap() function asks to map 'length' bytes starting at offset 'offset' from the file (or other object) specified by the file descriptor fd into memory, preferably at address 'start'. Sepcifically, for the last argument: 'offset' should be a multiple of the page size as returned by getpagesize(2).


2 Answers

First, the memory of your machine is irrelevant. It's the size of your process's address space that's relevant. With a 32-bit Python, this will be somewhere under 4GB. With a 64-bit Python, it will be more than enough.

The reason for this is that mmap isn't about mapping a file into physical memory, but into virtual memory. An mmapped file becomes just like a special swap file for your program. Thinking about this can get a bit complicated, but the Wikipedia links above should help.

So, the first answer is "use a 64-bit Python". But obviously that may not be applicable in your case.

The obvious alternative is to map in the first 1GB, search that, unmap it, map in the next 1GB, etc. The way you do this is by specifying the length and offset parameters to the mmap method. For example:

m = mmap.mmap(f.fileno(), length=1024*1024*1024, offset=1536*1024*1024)

However, the regex you're searching for could be found half-way in the first 1GB, and half in the second. So, you need to use windowing—map in the first 1GB, search, unmap, then map in a partially-overlapping 1GB, etc.

The question is, how much overlap do you need? If you know the maximum possible size of a match, you don't need anything more than that. And if you don't know… well, then there is no way to actually solve the problem without breaking up your regex—if that isn't obvious, imagine how you could possibly find a 2GB match in a single 1GB window.

Answering your followup question:

Since I set the buffer to 10MB, in terms of performance, is it the same as I mmap 10MB of file?

As with any performance question, if it really matters, you need to test it, and if it doesn't, don't worry about it.

If you want me to guess: I think mmap may be faster here, but only because (as J.F. Sebastian implied) looping and calling re.match 128K times as often may cause your code to be CPU-bound instead of IO-bound. But you could optimize that away without mmap, just by using read. So, would mmap be faster than read? Given the sizes involved, I'd expect the performance of mmap to be much faster on old Unix platforms, about the same on modern Unix platforms, and a bit slower on Windows. (You can still get large performance benefits out of mmap over read or read+lseek if you're using madvise, but that's not relevant here.) But really, that's just a guess.

The most compelling reason to use mmap is usually that it's simpler than read-based code, not that it's faster. When you have to use windowing even with mmap, and when you don't need to do any seeking with read, this is less compelling, but still, if you try writing the code both ways, I'd expect your mmap code would end up a bit more readable. (Especially if you tried to optimize out the buffer copies from the obvious read solution.)

like image 169
abarnert Avatar answered Oct 05 '22 23:10

abarnert


I came to try using mmap because I used fileh.readline() on a file being dozens of GB in size and wanted to make it faster. Unix strace utility seems to reveal that the file is read in 4kB chunks now, and at least the output from strace seems to me printed slowly and I know parsing the file takes many hours.

$ strace -v -f -p 32495
Process 32495 attached
read(5, "blah blah blah foo bar xxxxxxxxx"..., 4096) = 4096
read(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4096) = 4096
read(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4096) = 4096
read(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4096) = 4096
^CProcess 32495 detached
$

This thread is so far the only explaining me I should not try to mmap a too large file. I do not understand why isn't there already a helper function like mmap_for_dummies(filename) which would do internally os.path.size(filename) and then either doing normal open(filename, 'r', buffering=10*1024*1024) or doing mmap.mmap(open(filename).fileno()). I certainly want to avoid fiddling with sliding window approach myself but would the function do a simple decision whether to do mmap or not would be enough for me.

Finally to mention, it is still not clear to me why some examples on the internet mention open(filename, 'rb') without explanation (e.g. https://docs.python.org/2/library/mmap.html). Provided one often wants to use the file in a for loop with .readline() call I do not know if I should open in 'rb' or just 'r' mode (I guess it is necessary to preserve the '\n').

Thanks for mentioning the buffering=10*1024*1024) argument, is probably more helpful than changing my code to gain some speed.

like image 43
Martin Avatar answered Oct 05 '22 23:10

Martin