In short, mmap() is great if you're doing a large amount of IO in terms of total bytes transferred; this is because it reduces the number of copies needed, and can significantly reduce the number of kernel entries needed for reading cached data.
Advantages of mmap( ) Aside from any potential page faults, reading from and writing to a memory-mapped file does not incur any system call or context switch overhead. It is as simple as accessing memory. When multiple processes map the same object into memory, the data is shared among all the processes.
When I ask my colleagues why mmap is faster than system calls, the answer is inevitably “system call overhead”: the cost of crossing the boundary between the user space and the kernel.
Malloc generally functions in most of the memory management process. In the event the program requires additional memory, this is borrowed from the OS. Mmap on the other hand makes use of a context switch that converts into kernel land.
I was trying to find the final word on mmap / read performance on Linux and I came across a nice post (link) on the Linux kernel mailing list. It's from 2000, so there have been many improvements to IO and virtual memory in the kernel since then, but it nicely explains the reason why mmap
or read
might be faster or slower.
mmap
has more overhead than read
(just like epoll
has more overhead than poll
, which has more overhead than read
). Changing virtual memory mappings is a quite expensive operation on some processors for the same reasons that switching between different processes is expensive.However,
read
, your file may have been flushed from the cache ages ago. This does not apply if you use a file and immediately discard it. (If you try to mlock
pages just to keep them in cache, you are trying to outsmart the disk cache and this kind of foolery rarely helps system performance).The discussion of mmap/read reminds me of two other performance discussions:
Some Java programmers were shocked to discover that nonblocking I/O is often slower than blocking I/O, which made perfect sense if you know that nonblocking I/O requires making more syscalls.
Some other network programmers were shocked to learn that epoll
is often slower than poll
, which makes perfect sense if you know that managing epoll
requires making more syscalls.
Conclusion: Use memory maps if you access data randomly, keep it around for a long time, or if you know you can share it with other processes (MAP_SHARED
isn't very interesting if there is no actual sharing). Read files normally if you access data sequentially or discard it after reading. And if either method makes your program less complex, do that. For many real world cases there's no sure way to show one is faster without testing your actual application and NOT a benchmark.
(Sorry for necro'ing this question, but I was looking for an answer and this question kept coming up at the top of Google results.)
There are lots of good answers here already that cover many of the salient points, so I'll just add a couple of issues I didn't see addressed directly above. That is, this answer shouldn't be considered a comprehensive of the pros and cons, but rather an addendum to other answers here.
Taking the case where the file is already fully cached1 as the baseline2, mmap
might seem pretty much like magic:
mmap
only requires 1 system call to (potentially) map the entire file, after which no more system calls are needed.mmap
doesn't require a copy of the file data from kernel to user-space.mmap
allows you to access the file "as memory", including processing it with whatever advanced tricks you can do against memory, such as compiler auto-vectorization, SIMD intrinsics, prefetching, optimized in-memory parsing routines, OpenMP, etc.In the case that the file is already in the cache, it seems impossible to beat: you just directly access the kernel page cache as memory and it can't get faster than that.
Well, it can.
A primary hidden cost of mmap
vs read(2)
(which is really the comparable OS-level syscall for reading blocks) is that with mmap
you'll need to do "some work" for every 4K page accessed in a new mapping, even though it might be hidden by the page-fault mechanism.
For a example a typical implementation that just mmap
s the entire file will need to fault-in so 100 GB / 4K = 25 million faults to read a 100 GB file. Now, these will be minor faults, but 25 million page faults is still not going to be super fast. The cost of a minor fault is probably in the 100s of nanos in the best case.
Now, you can pass MAP_POPULATE
to mmap
to tell it to set up all the page tables before returning, so there should be no page faults while accessing it. Now, this has the little problem that it also reads the entire file into RAM, which is going to blow up if you try to map a 100GB file - but let's ignore that for now3. The kernel needs to do per-page work to set up these page tables (shows up as kernel time). This ends up being a major cost in the mmap
approach, and it's proportional to the file size (i.e., it doesn't get relatively less important as the file size grows)4.
Finally, even in user-space accessing such a mapping isn't exactly free (compared to large memory buffers not originating from a file-based mmap
) - even once the page tables are set up, each access to a new page is going to, conceptually, incur a TLB miss. Since mmap
ing a file means using the page cache and its 4K pages, you again incur this cost 25 million times for a 100GB file.
Now, the actual cost of these TLB misses depends heavily on at least the following aspects of your hardware: (a) how many 4K TLB enties you have and how the rest of the translation caching works performs (b) how well hardware prefetch deals with with the TLB - e.g., can prefetch trigger a page walk? (c) how fast and how parallel the page walking hardware is. On modern high-end x86 Intel processors, the page walking hardware is in general very strong: there are at least 2 parallel page walkers, a page walk can occur concurrently with continued execution, and hardware prefetching can trigger a page walk. So the TLB impact on a streaming read load is fairly low - and such a load will often perform similarly regardless of the page size. Other hardware is usually much worse, however!
The read()
syscall, which is what generally underlies the "block read" type calls offered e.g., in C, C++ and other languages has one primary disadvantage that everyone is well-aware of:
read()
call of N bytes must copy N bytes from kernel to user space.On the other hand, it avoids most the costs above - you don't need to map in 25 million 4K pages into user space. You can usually malloc
a single buffer small buffer in user space, and re-use that repeatedly for all your read
calls. On the kernel side, there is almost no issue with 4K pages or TLB misses because all of RAM is usually linearly mapped using a few very large pages (e.g., 1 GB pages on x86), so the underlying pages in the page cache are covered very efficiently in kernel space.
So basically you have the following comparison to determine which is faster for a single read of a large file:
Is the extra per-page work implied by the mmap
approach more costly than the per-byte work of copying file contents from kernel to user space implied by using read()
?
On many systems, they are actually approximately balanced. Note that each one scales with completely different attributes of the hardware and OS stack.
In particular, the mmap
approach becomes relatively faster when:
MAP_POPULATE
implementation which can efficiently process large maps in cases where, for example, the underlying pages are contiguous in physical memory.... while the read()
approach becomes relatively faster when:
read()
syscall has good copy performance. E.g., good copy_to_user
performance on the kernel side.The hardware factors above vary wildly across different platforms, even within the same family (e.g., within x86 generations and especially market segments) and definitely across architectures (e.g., ARM vs x86 vs PPC).
The OS factors keep changing as well, with various improvements on both sides causing a large jump in the relative speed for one approach or the other. A recent list includes:
mmap
case without MAP_POPULATE
.copy_to_user
methods in arch/x86/lib/copy_user_64.S
, e.g., using REP MOVQ
when it is fast, which really help the read()
case.The mitigations for the Spectre and Meltdown vulnerabilities considerably increased the cost of a system call. On the systems I've measured, the cost of a "do nothing" system call (which is an estimate of the pure overhead of the system call, apart from any actual work done by the call) went from about 100 ns on a typical modern Linux system to about 700 ns. Furthermore, depending on your system, the page-table isolation fix specifically for Meltdown can have additional downstream effects apart from the direct system call cost due to the need to reload TLB entries.
All of this is a relative disadvantage for read()
based methods as compared to mmap
based methods, since read()
methods must make one system call for each "buffer size" worth of data. You can't arbitrarily increase the buffer size to amortize this cost since using large buffers usually performs worse since you exceed the L1 size and hence are constantly suffering cache misses.
On the other hand, with mmap
, you can map in a large region of memory with MAP_POPULATE
and the access it efficiently, at the cost of only a single system call.
1 This more-or-less also includes the case where the file wasn't fully cached to start with, but where the OS read-ahead is good enough to make it appear so (i.e., the page is usually cached by the time you want it). This is a subtle issue though because the way read-ahead works is often quite different between mmap
and read
calls, and can be further adjusted by "advise" calls as described in 2.
2 ... because if the file is not cached, your behavior is going to be completely dominated by IO concerns, including how sympathetic your access pattern is to the underlying hardware - and all your effort should be in ensuring such access is as sympathetic as possible, e.g. via use of madvise
or fadvise
calls (and whatever application level changes you can make to improve access patterns).
3 You could get around that, for example, by sequentially mmap
ing in windows of a smaller size, say 100 MB.
4 In fact, it turns out the MAP_POPULATE
approach is (at least one some hardware/OS combination) only slightly faster than not using it, probably because the kernel is using faultaround - so the actual number of minor faults is reduced by a factor of 16 or so.
The main performance cost is going to be disk i/o. "mmap()" is certainly quicker than istream, but the difference might not be noticeable because the disk i/o will dominate your run-times.
I tried Ben Collins's code fragment (see above/below) to test his assertion that "mmap() is way faster" and found no measurable difference. See my comments on his answer.
I would certainly not recommend separately mmap'ing each record in turn unless your "records" are huge - that would be horribly slow, requiring 2 system calls for each record and possibly losing the page out of the disk-memory cache.....
In your case I think mmap(), istream and the low-level open()/read() calls will all be about the same. I would recommend mmap() in these cases:
(btw - I love mmap()/MapViewOfFile()).
mmap is way faster. You might write a simple benchmark to prove it to yourself:
char data[0x1000];
std::ifstream in("file.bin");
while (in)
{
in.read(data, 0x1000);
// do something with data
}
versus:
const int file_size=something;
const int page_size=0x1000;
int off=0;
void *data;
int fd = open("filename.bin", O_RDONLY);
while (off < file_size)
{
data = mmap(NULL, page_size, PROT_READ, 0, fd, off);
// do stuff with data
munmap(data, page_size);
off += page_size;
}
Clearly, I'm leaving out details (like how to determine when you reach the end of the file in the event that your file isn't a multiple of page_size
, for instance), but it really shouldn't be much more complicated than this.
If you can, you might try to break up your data into multiple files that can be mmap()-ed in whole instead of in part (much simpler).
A couple of months ago I had a half-baked implementation of a sliding-window mmap()-ed stream class for boost_iostreams, but nobody cared and I got busy with other stuff. Most unfortunately, I deleted an archive of old unfinished projects a few weeks ago, and that was one of the victims :-(
Update: I should also add the caveat that this benchmark would look quite different in Windows because Microsoft implemented a nifty file cache that does most of what you would do with mmap in the first place. I.e., for frequently-accessed files, you could just do std::ifstream.read() and it would be as fast as mmap, because the file cache would have already done a memory-mapping for you, and it's transparent.
Final Update: Look, people: across a lot of different platform combinations of OS and standard libraries and disks and memory hierarchies, I can't say for certain that the system call mmap
, viewed as a black box, will always always always be substantially faster than read
. That wasn't exactly my intent, even if my words could be construed that way. Ultimately, my point was that memory-mapped i/o is generally faster than byte-based i/o; this is still true. If you find experimentally that there's no difference between the two, then the only explanation that seems reasonable to me is that your platform implements memory-mapping under the covers in a way that is advantageous to the performance of calls to read
. The only way to be absolutely certain that you're using memory-mapped i/o in a portable way is to use mmap
. If you don't care about portability and you can rely on the particular characteristics of your target platforms, then using read
may be suitable without sacrificing measurably any performance.
Edit to clean up answer list: @jbl:
the sliding window mmap sounds interesting. Can you say a little more about it?
Sure - I was writing a C++ library for Git (a libgit++, if you will), and I ran into a similar problem to this: I needed to be able to open large (very large) files and not have performance be a total dog (as it would be with std::fstream
).
Boost::Iostreams
already has a mapped_file Source, but the problem was that it was mmap
ping whole files, which limits you to 2^(wordsize). On 32-bit machines, 4GB isn't big enough. It's not unreasonable to expect to have .pack
files in Git that become much larger than that, so I needed to read the file in chunks without resorting to regular file i/o. Under the covers of Boost::Iostreams
, I implemented a Source, which is more or less another view of the interaction between std::streambuf
and std::istream
. You could also try a similar approach by just inheriting std::filebuf
into a mapped_filebuf
and similarly, inheriting std::fstream
into a mapped_fstream
. It's the interaction between the two that's difficult to get right. Boost::Iostreams
has some of the work done for you, and it also provides hooks for filters and chains, so I thought it would be more useful to implement it that way.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With