I'm playing around with the idea of using the virtual memory system to allow me to do transparent data conversion (eg int to float) for some numeric data stuff I've got. The basic idea is that the library I'm writing mmaps the data file you want, and at the same time mmaps an anonymous region of an appropriate size to hold the converted data, and this pointer is returned to the user.
The anonymous region is read/write protected, so that whenever the user goes to access data through the pointer, every new page will cause a segfault, which I can catch, then transparently convert data from the mmaped file and fix up the permissions allowing the access to continue. This part of the whole thing works great so far.
However, sometimes I mmap very large files (hundreds of gigabytes), and with the anonymous memory proxying access to it, pretty quickly you'll start eating swap space as anonymous pages are dropped to disk. My thought was that if I could explicitly set the dirty bit on the anonymous pages to false after writing converted data to them, the OS would just drop them and zero-fill on demand later if they were re-accessed.
For this to work though, I think I'd have to set the dirty bit to false and convince the OS to set pages to be read protected when they're swapped out, so I can re-catch the ensuing segfault and reconvert the data on demand. After doing some research I don't think this is possible without kernel hacking, but I thought I'd ask and see if someone that knows more about the virtual memory system knows a way this might be achieved.
Here's an idea (completely untested though): for the converted data, mmap
and munmap
individual pages as you need them. Since the pages are backed by anonymous memory they should be discarded when they are unmapped. Linux will coalesce adjacent mappings into a single VMA, so this might have acceptable overhead.
Of course, there needs to be a mechanism to trigger the unmapping. You could maintain an LRU structure and evict an older page when you need to bring a new one in, thus keeping the size of the mapped region constant.
Extending on a suggestion I mentioned in your earlier related question, I think the following (Linux-specific, definitely not portable) scheme should work quite reliably:
Set up a datagram socket pair using socketpair(AF_UNIX, SOCK_DGRAM, 0, &sv)
,
and signal handler for SIGSEGV
. (You won't need to worry about SIGBUS
, even if other processes might truncate the data file.)
The signal handler uses write()
to write the size_t addr = siginfo->si_addr;
to its end of the socket. The signal handler then read()
s one byte from the socket it wrote into (blocking -- this is basically just a reliable sleep()
-- so remember to handle EINTR
), and returns.
Note that even if there are multiple threads faulting at or near the same time, there is no race condition. The signals just get reraised until the mappings are fixed.
If there is any kind of a problem with the socket communications, you can use sigaction()
with .sa_handler = SIG_DFL
to restore the default SIGSEGV
signal handler, so that when the same signal is reraised the entire process dies as normal.
A separate thread reads the other end of the socket pair for addresses faulted with SIGSEGV
, does all the mapping and file I/O necessary, and finally writes a zero byte to the same end of the socket pair to let the real signal handler know the mapping should be fixed now.
This is basically the "real" signal handler, without the drawbacks of an actual signal handler. Remember, the same thread will keep reraising the same signal until the mapping is fixed, so any race conditions between the separate thread and SIGSEGV
signals are irrelevant.
Have one PROT_NONE
, MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE
mapping matching the size of the original data file.
To reduce the cost in actual RAM -- using MAP_NORESERVE
you use neither RAM nor SWAP for the mapping, but for gigabytes of data, the page table entries themselves require considerable RAM --, you could try using MAP_HUGETLB
too. It would use huge pages, and therefore significantly less entries, but I am unsure whether there are issues when normal page sized holes are eventually punched into the mappings; you'd probably have to use huge pages all the way.
This is the "full" mapping that your "userspace" will use to access the data.
Have one PROT_READ
or PROT_READ | PROT_WRITE
, MAP_PRIVATE | MAP_ANONYMOUS
mapping for pristine or dirty (respectively), converted data. If your "userspace" almost always modifies the data, you can always treat the converted data as "dirty", but otherwise you can avoid unnecessary writes of unmodified data by first mapping the converted data PROT_READ
only; if it faults, mprotect()
it PROT_READ | PROT_WRITE
and mark it dirty (so needs to be converted and saved back to the file). I'll call these two stages "clean" and "dirty" mappings respectively.
When the dedicated thread punches a hole into a "full" mapping for a "clean" page(s), you first mmap(NULL, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, ...)
a new memory area of suitable size, read()
the data from the desired data file into it, convert the data, mprotect(..., PROT_READ)
if you separate "clean" and "dirty" mappings, and finally mremap(newly_mapped, size, size, MREMAP_MAYMOVE | MREMAP_FIXED, new_ptr)
it over the section of the "full" mapping.
Note that to avoid any accidents, you should use a global pthread_mutex_t
, which is grabbed for the duration of these mremap()
s and any mmap()
calls elsewhere, to avoid having the kernel give the punched hole to the wrong thread by accident. The mutex will guard against any other thread getting in between. (Otherwise, the kernel might place a small map requested by another thread into the temporary hole.)
When discarding "clean" page(s), you call mmap(NULL, length, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE, -1, 0)
to get a new map of suitable length, then grab the global mutex mentioned above, and mremap()
that new map over the "clean" page(s); the kernel does an implicit munmap()
. Unlock the mutex.
When discarding "dirty" page(s), you call mmap(NULL, length, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE, -1, 0)
*twice to get two new maps of suitable length*. You then grab the global mutex mentioned above, and mremap()
the dirty data over the first of the new mappings. (Basically it was only used to find out a suitable address to move the dirty data into.) Then, mremap()
the second of the new mappings to where the dirty data used to reside in. Unlock the mutex.
Using a separate thread to handle the fault conditions avoids all async-signal-safe function problems. read()
, write()
, and sigaction()
are all async-signal safe.
You only need one global pthread_mutex_t
to avoid the case where the kernel hands the recently-moved hole (mremap()
ped from memory area) to another thread; you can also use it to protect your internal data structure (pointer chain, if you support multiple concurrent file mappings).
There should be no race conditions (other than when other threads use mmap()
or mremap()
, which is handled by the mutex mentioned above). When a "dirty" page or page group is moved away, it becomes inaccessible to other threads, before it is converted and saved; even perfectly concurrent access by another thread should be handled perfectly: the page will simply be re-read from the file, and re-converted. (If that occurs often, you might wish to cache recently saved page groups.)
I do recommend using large page groups, say 2M or more, instead of single pages, to reduce the overhead. The optimal size depends on your applications access patterns, but the huge page size (if supported by your architecture) is a very good starting point.
If your data structures do not align to pages or page groups, you should cache the full converted first and last records (which are only partially within the page or page group). It usually makes the conversion back to storage format much easier.
If you know or can detect typical access patterns within the file, you probably should use posix_fadvise()
to tell the kernel; POSIX_FADV_WILLNEED
and POSIX_FADV_DONTNEED
are most useful. It helps the kernel avoid keeping unnecessary pages of the actual data file in page cache.
Finally, you might consider adding a second special thread for converting and writing dirty records back to disk asynchronously. If you take care to make sure the two threads don't get confused when the first thread wants to re-read back a record still being written to disk by the second thread, there should be no other issues there either -- but asynchronous writing is likely to increase your throughput with most access patterns, unless you are I/O bound anyway, or really short on RAM (relatively speaking).
Why use read()
and write()
instead of another memory map? Because of the overhead in-kernel for the virtual memory structures needed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With