How to use mmap in python when the whole file is too big

Tags:

python

I have a python script which read a file line by line and look if each line matches a regular expression.

I would like to improve the performance of that script by using memory map the file before I search. I have looked into mmap example: http://docs.python.org/2/library/mmap.html

My question is how can I mmap a file when it is too big (15GB) for the memory of my machine (4GB)

I read the file like this:

fi = open(log_file, 'r', buffering=10*1024*1024)

for line in fi: 
    //do somemthong

fi.close()

Since I set the buffer to 10MB, in terms of performance, is it the same as I mmap 10MB of file?

Thank you.

384

asked Jan 12 '13 01:01

2 Answers

First, the memory of your machine is irrelevant. It's the size of your process's address space that's relevant. With a 32-bit Python, this will be somewhere under 4GB. With a 64-bit Python, it will be more than enough.

The reason for this is that mmap isn't about mapping a file into physical memory, but into virtual memory. An mmapped file becomes just like a special swap file for your program. Thinking about this can get a bit complicated, but the Wikipedia links above should help.

So, the first answer is "use a 64-bit Python". But obviously that may not be applicable in your case.

The obvious alternative is to map in the first 1GB, search that, unmap it, map in the next 1GB, etc. The way you do this is by specifying the length and offset parameters to the mmap method. For example:

m = mmap.mmap(f.fileno(), length=1024*1024*1024, offset=1536*1024*1024)

However, the regex you're searching for could be found half-way in the first 1GB, and half in the second. So, you need to use windowing—map in the first 1GB, search, unmap, then map in a partially-overlapping 1GB, etc.

The question is, how much overlap do you need? If you know the maximum possible size of a match, you don't need anything more than that. And if you don't know… well, then there is no way to actually solve the problem without breaking up your regex—if that isn't obvious, imagine how you could possibly find a 2GB match in a single 1GB window.

Answering your followup question:

Since I set the buffer to 10MB, in terms of performance, is it the same as I mmap 10MB of file?

As with any performance question, if it really matters, you need to test it, and if it doesn't, don't worry about it.

If you want me to guess: I think mmap may be faster here, but only because (as J.F. Sebastian implied) looping and calling re.match 128K times as often may cause your code to be CPU-bound instead of IO-bound. But you could optimize that away without mmap, just by using read. So, would mmap be faster than read? Given the sizes involved, I'd expect the performance of mmap to be much faster on old Unix platforms, about the same on modern Unix platforms, and a bit slower on Windows. (You can still get large performance benefits out of mmap over read or read+lseek if you're using madvise, but that's not relevant here.) But really, that's just a guess.

The most compelling reason to use mmap is usually that it's simpler than read-based code, not that it's faster. When you have to use windowing even with mmap, and when you don't need to do any seeking with read, this is less compelling, but still, if you try writing the code both ways, I'd expect your mmap code would end up a bit more readable. (Especially if you tried to optimize out the buffer copies from the obvious read solution.)

169

answered Oct 05 '22 23:10

abarnert

I came to try using mmap because I used fileh.readline() on a file being dozens of GB in size and wanted to make it faster. Unix strace utility seems to reveal that the file is read in 4kB chunks now, and at least the output from strace seems to me printed slowly and I know parsing the file takes many hours.

$ strace -v -f -p 32495
Process 32495 attached
read(5, "blah blah blah foo bar xxxxxxxxx"..., 4096) = 4096
read(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4096) = 4096
read(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4096) = 4096
read(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4096) = 4096
^CProcess 32495 detached
$

This thread is so far the only explaining me I should not try to mmap a too large file. I do not understand why isn't there already a helper function like mmap_for_dummies(filename) which would do internally os.path.size(filename) and then either doing normal open(filename, 'r', buffering=10*1024*1024) or doing mmap.mmap(open(filename).fileno()). I certainly want to avoid fiddling with sliding window approach myself but would the function do a simple decision whether to do mmap or not would be enough for me.

Finally to mention, it is still not clear to me why some examples on the internet mention open(filename, 'rb') without explanation (e.g. https://docs.python.org/2/library/mmap.html). Provided one often wants to use the file in a for loop with .readline() call I do not know if I should open in 'rb' or just 'r' mode (I guess it is necessary to preserve the '\n').

Thanks for mentioning the buffering=10*1024*1024) argument, is probably more helpful than changing my code to gain some speed.

answered Oct 05 '22 23:10

Martin

Related questions
                            
                                Replace a word in list and append to same list
                            
                                module 'tensorflow._api.v2.train' has no attribute 'GradientDescentOptimizer'
                            
                                Check whether a path exists on a remote host using paramiko
                            
                                Parsing a string which represents a list of tuples
                            
                                Fastest way to take a screenshot with python on windows
                            
                                Common Lisp -- List unpacking? (similar to Python)
                            
                                How to test (using unittest) the HTML output of a Django view?
                            
                                How to create a stock quote fetching app in python
                            
                                Lemmatizing POS tagged words with NLTK?
                            
                                How to fix this AttributeError?
                            
                                Pythonic way to pass keyword arguments on conditional
                            
                                How to make a widget in the center of the screen in PySide/PyQt?
                            
                                Python files - import from each other
                            
                                Is it possible to use bpython as a full debugger?
                            
                                django logging - django.request logger and extra context
                            
                                compare two lists in python and return indices of matched values
                            
                                best Cassandra library/wrapper for Python? [closed]
                            
                                Batch fill PDF forms from python or bash
                            
                                How to cPickle dump and load separate dictionaries to the same file?
                            
                                In Django admin, how can I hide Save and Continue and Save and Add Another buttons on a model admin?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With