Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast resize of a mmap file

Tags:

c++

c

linux

mmap

I need a copy-free re-size of a very large mmap file while still allowing concurrent access to reader threads.

The simple way is to use two MAP_SHARED mappings (grow the file, then create a second mapping that includes the grown region) in the same process over the same file and then unmap the old mapping once all readers that could access it are finished. However, I am curious if the scheme below could work, and if so, is there any advantage to it.

  1. mmap a file with MAP_PRIVATE
  2. do read-only access to this memory in multiple threads
  3. either acquire a mutex for the file, write to the memory (assume this is done in a way that the readers, which may be reading that memory, are not messed up by it)
  4. or acquire the mutex, but increase the size of the file and use mremap to move it to a new address (resize the mapping without copying or unnecessary file IO.)

The crazy part comes in at (4). If you move the memory the old addresses become invalid, and the readers, which are still reading it, may suddenly have an access violation. What if we modify the readers to trap this access violation and then restart the operation (i.e. don't re-read the bad address, re-calculate the address given the offset and the new base address from mremap.) Yes I know that's evil, but to my mind the readers can only successfully read the data at the old address or fail with an access violation and retry. If sufficient care is taken, that should be safe. Since re-sizing would not happen often, the readers would eventually succeed and not get stuck in a retry loop.

A problem could occur if that old address space is re-used while a reader still has a pointer to it. Then there will be no access violation, but the data will be incorrect and the program enters the unicorn and candy filled land of undefined behavior (wherein there is usually neither unicorns nor candy.)

But if you controlled allocations completely and could make certain that any allocations that happen during this period do not ever re-use that old address space, then this shouldn't be a problem and the behavior shouldn't be undefined.

Am I right? Could this work? Is there any advantage to this over using two MAP_SHARED mappings?

like image 285
Eloff Avatar asked Jan 02 '12 17:01

Eloff


1 Answers

It is hard for me to imagine a case where you don't know the upper bound on how large the file can be. Assuming that's true, you could "reserve" the address space for the maximum size of the file by providing that size when the file is first mapped in with mmap(). Of course, any accesses beyond the actual size of the file will cause an access violation, but that's how you want it to work anyway -- you could argue that reserving the extra address space ensures the access violation rather than leaving that address range open to being used by other calls to things like mmap() or malloc().

Anyway, the point is with my solution, you never move the address range, you only change its size and now your locking is around the data structure that provides the current valid size to each thread.

My solution doesn't work if you have so many files that the maximum mapping for each file runs you out of address space, but this is the age of the 64-bit address space so hopefully your maximum mapping size is no problem.

(Just to make sure I wasn't forgetting something stupid, I did write a small program to convince myself creating the larger-than-file-size mapping gives an access violation when you try to access beyond the file size, and then works fine once you ftruncate() the file to be larger, all with the same address returned from the first mmap() call.)

like image 119
andy Avatar answered Oct 23 '22 08:10

andy