Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Do memory mapped files provide advantage for large buffers?

My program works with large data sets that need to be stored in contiguous memory (several Gigabytes). Allocating memory using std::allocator (i.e. malloc or new) causes system stalls as large portions of virtual memory are reserved and physical memory gets filled up.

Since the program will mostly only work on small portions at a time, my question is if using memory mapped files would provide an advantage (i.e. mmap or the Windows equivalent.) That is creating a large sparse temporary file and mapping it to virtual memory. Or is there another technique that would change the system's pagination strategy such that less pages are loaded into physical memory at a time.

I'm trying to avoid building a streaming mechanism that loads portions of a file at a time and instead rely on the system's vm pagination.

like image 262
tmlen Avatar asked Dec 01 '14 17:12

tmlen


3 Answers

Yes, mmap has the potential to speed things up.

Things to consider:

  • Remember the VMM will page things in and out in page size blocked (4k on Linux)
  • If your memory access is well localised over time, this will work well. But if you do random access over your entire file, you will end up with a lot of seeking and thrashing (still). So, consider whether your 'small portions' correspond with localised bits of the file.
  • For large allocations, malloc and free will use mmap with MAP_ANON anyway. So the difference in memory mapping a file is simply that you are getting the VMM to do the I/O for you.
  • Consider using madvise with mmap to assist the VMM in paging well.
  • When you use open and read (plus, as erenon suggests, posix_fadvise), your file is still held in buffers anyway (i.e. it's not immediately written out) unless you also use O_DIRECT. So in both situations, you are relying on the kernel for I/O scheduling.
like image 122
abligh Avatar answered Sep 16 '22 16:09

abligh


If the data is already in a file, it would speed up things, especially in the non-sequential case. (In the sequential case, read wins)

If using open and read, consider using posix_fadvise as well.

like image 40
erenon Avatar answered Sep 17 '22 16:09

erenon


This really depends on your mmap() implementation. Mapping a file into memory has several advantages that can be exploited by the kernel:

  • The kernel knows that the contents of the mmap() pages is already present on disk. If it decides to evict these pages, it can omit the write back.

  • You reduce copying operations: read() operations typically first read the data into kernel memory, then copy it over to user space.

  • The reduced copies also mean that less memory is used to store data from the file, which means more memory is available for other uses, which can reduce paging as well.

    This is also, why it is generally a bad idea to use large caches within an I/O library: Modern kernels already cache everything they ever read from disk, caching a copy in user space means that the amount of data that can be cached is actually reduced.

Of course, you also avoid a lot of headaches that result from buffering data of unknown size in your application. But that is just a convenience for you as a programmer.

However, even though the kernel can exploit these properties, it does not necessarily do so. My experience is that LINUX mmap() is generally fine; on AIX, however, I have witnessed really bad mmap() performance. So, if your goal is performance, it's the old measure-compare-decide stand by.

like image 20
cmaster - reinstate monica Avatar answered Sep 16 '22 16:09

cmaster - reinstate monica