I have a python script which read a file line by line and look if each line matches a regular expression.
I would like to improve the performance of that script by using memory map the file before I search. I have looked into mmap example: http://docs.python.org/2/library/mmap.html
My question is how can I mmap a file when it is too big (15GB) for the memory of my machine (4GB)
I read the file like this:
fi = open(log_file, 'r', buffering=10*1024*1024)
for line in fi:
//do somemthong
fi.close()
Since I set the buffer to 10MB, in terms of performance, is it the same as I mmap 10MB of file?
Thank you.
Yes, mmap creates a mapping. It does not normally read the entire content of whatever you have mapped into memory. If you wish to do that you can use the mlock/mlockall system call to force the kernel to read into RAM the content of the mapping, if applicable.
The mmap() function asks to map 'length' bytes starting at offset 'offset' from the file (or other object) specified by the file descriptor fd into memory, preferably at address 'start'. Sepcifically, for the last argument: 'offset' should be a multiple of the page size as returned by getpagesize(2).
First, the memory of your machine is irrelevant. It's the size of your process's address space that's relevant. With a 32-bit Python, this will be somewhere under 4GB. With a 64-bit Python, it will be more than enough.
The reason for this is that mmap
isn't about mapping a file into physical memory, but into virtual memory. An mmap
ped file becomes just like a special swap file for your program. Thinking about this can get a bit complicated, but the Wikipedia links above should help.
So, the first answer is "use a 64-bit Python". But obviously that may not be applicable in your case.
The obvious alternative is to map in the first 1GB, search that, unmap it, map in the next 1GB, etc. The way you do this is by specifying the length
and offset
parameters to the mmap
method. For example:
m = mmap.mmap(f.fileno(), length=1024*1024*1024, offset=1536*1024*1024)
However, the regex you're searching for could be found half-way in the first 1GB, and half in the second. So, you need to use windowing—map in the first 1GB, search, unmap, then map in a partially-overlapping 1GB, etc.
The question is, how much overlap do you need? If you know the maximum possible size of a match, you don't need anything more than that. And if you don't know… well, then there is no way to actually solve the problem without breaking up your regex—if that isn't obvious, imagine how you could possibly find a 2GB match in a single 1GB window.
Answering your followup question:
Since I set the buffer to 10MB, in terms of performance, is it the same as I mmap 10MB of file?
As with any performance question, if it really matters, you need to test it, and if it doesn't, don't worry about it.
If you want me to guess: I think mmap
may be faster here, but only because (as J.F. Sebastian implied) looping and calling re.match
128K times as often may cause your code to be CPU-bound instead of IO-bound. But you could optimize that away without mmap
, just by using read
. So, would mmap
be faster than read
? Given the sizes involved, I'd expect the performance of mmap
to be much faster on old Unix platforms, about the same on modern Unix platforms, and a bit slower on Windows. (You can still get large performance benefits out of mmap
over read
or read
+lseek
if you're using madvise
, but that's not relevant here.) But really, that's just a guess.
The most compelling reason to use mmap
is usually that it's simpler than read
-based code, not that it's faster. When you have to use windowing even with mmap
, and when you don't need to do any seeking with read
, this is less compelling, but still, if you try writing the code both ways, I'd expect your mmap
code would end up a bit more readable. (Especially if you tried to optimize out the buffer copies from the obvious read
solution.)
I came to try using mmap
because I used fileh.readline()
on a file being dozens of GB in size and wanted to make it faster. Unix strace
utility seems to reveal that the file is read in 4kB chunks now, and at least the output from strace seems to me printed slowly and I know parsing the file takes many hours.
$ strace -v -f -p 32495
Process 32495 attached
read(5, "blah blah blah foo bar xxxxxxxxx"..., 4096) = 4096
read(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4096) = 4096
read(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4096) = 4096
read(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4096) = 4096
^CProcess 32495 detached
$
This thread is so far the only explaining me I should not try to mmap
a too large file. I do not understand why isn't there already a helper function like mmap_for_dummies(filename)
which would do internally os.path.size(filename) and then either doing normal open(filename, 'r', buffering=10*1024*1024)
or doing mmap.mmap(open(filename).fileno())
. I certainly want to avoid fiddling with sliding window approach myself but would the function do a simple decision whether to do mmap
or not would be enough for me.
Finally to mention, it is still not clear to me why some examples on the internet mention open(filename, 'rb')
without explanation (e.g. https://docs.python.org/2/library/mmap.html). Provided one often wants to use the file in a for loop with .readline()
call I do not know if I should open in 'rb'
or just 'r'
mode (I guess it is necessary to preserve the '\n'
).
Thanks for mentioning the buffering=10*1024*1024)
argument, is probably more helpful than changing my code to gain some speed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With