Consider a program which uses a large number of roughly page-sized memory regions (say 64 kB or so), each of which is rather short-lived. (In my particular case, these are alternate stacks for green threads.)
How would one best do to allocate these regions, such that their pages can be returned to the kernel once the region isn't in use anymore? The naïve solution would clearly be to simply mmap
each of the regions individually, and munmap
them again as soon as I'm done with them. I feel this is a bad idea, though, since there are so many of them. I suspect that the VMM may start scaling badly after a while; but even if it doesn't, I'm still interested in the theoretical case.
If I instead just mmap
myself a huge anonymous mapping from which I allocate the regions on demand, is there a way to "punch holes" through that mapping for a region that I'm done with? Kind of like madvise(MADV_DONTNEED)
, but with the difference that the pages should be considered deleted, so that the kernel doesn't actually need to keep their contents anywhere but can just reuse zeroed pages whenever they are faulted again.
I'm using Linux, and in this case I'm not bothered by using Linux-specific calls.
I did a lot of research into this topic (for a different use) at some point. In my case I needed a large hashmap that was very sparsely populated + the ability to zero it every now and then.
mmap
solution:
The easiest solution (which is portable, madvise(MADV_DONTNEED)
is linux specific) to zero a mapping like this is to mmap
a new mapping above it.
void * mapping = mmap(MAP_ANONYMOUS);
// use the mapping
// zero certain pages
mmap(mapping + page_aligned_offset, length, MAP_FIXED | MAP_ANONYMOUS);
The last call is performance wise equivalent to subsequent munmap/mmap/MAP_FIXED
, but is thread safe.
Performance wise the problem with this solution is that the pages have to be faulted in again on a subsequence write access which issues an interrupt and a context change. This is only efficient if very few pages were faulted in in the first place.
memset
solution:
After having such crap performance if most of the mapping has to be unmapped I decided to zero the memory manually with memset
. If roughly over 70% of the pages are already faulted in (and if not they are after the first round of memset
) then this is faster then remapping those pages.
mincore
solution:
My next idea was to actually only memset
on those pages that have been faulted in before. This solution is NOT thread-safe. Calling mincore
to determine if a page is faulted in and then selectively memset
them to zero was a significant performance improvement until over 50% of the mapping was faulted in, at which point memset
ting the entire mapping became simpler (mincore
is a system call and requires one context change).
incore table solution:
My final approach which I then took was having my own in-core table (one bit per page) that says if it has been used since the last wipe. This is by far the most efficient way since you will only be actually zeroing the pages in each round that you actually used. It obviously also is not thread safe and requires you to track which pages have been written to in user space, but if you need this performance then this is by far the most efficient approach.
I don't see why doing lots of calls to mmap
/munmap
should be that bad. The lookup performance in the kernel for mappings should be O(log n).
Your only options as it seems to be implemented in Linux right now is to punch holes in the mappings to do what you want is mprotect(PROT_NONE)
and that is still fragmenting the mappings in the kernel so it's mostly equivalent to mmap
/munmap
except that something else won't be able to steal that VM range from you. You'd probably want madvise(MADV_REMOVE)
work or as it's called in BSD - madvise(MADV_FREE)
. That is explicitly designed to do exactly what you want - the cheapest way to reclaim pages without fragmenting the mappings. But at least according to the man page on my two flavors of Linux it's not fully implemented for all kinds of mappings.
Disclaimer: I'm mostly familiar with the internals of BSD VM systems, but this should be quite similar on Linux.
As in the discussion in comments below, surprisingly enough MADV_DONTNEED
seems to do the trick:
#include <sys/types.h>
#include <sys/mman.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <stdio.h>
#include <unistd.h>
#include <err.h>
int
main(int argc, char **argv)
{
int ps = getpagesize();
struct rusage ru = {0};
char *map;
int n = 15;
int i;
if ((map = mmap(NULL, ps * n, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)) == MAP_FAILED)
err(1, "mmap");
for (i = 0; i < n; i++) {
map[ps * i] = i + 10;
}
printf("unnecessary printf to fault stuff in: %d %ld\n", map[0], ru.ru_minflt);
/* Unnecessary call to madvise to fault in that part of libc. */
if (madvise(&map[ps], ps, MADV_NORMAL) == -1)
err(1, "madvise");
if (getrusage(RUSAGE_SELF, &ru) == -1)
err(1, "getrusage");
printf("after MADV_NORMAL, before touching pages: %d %ld\n", map[0], ru.ru_minflt);
for (i = 0; i < n; i++) {
map[ps * i] = i + 10;
}
if (getrusage(RUSAGE_SELF, &ru) == -1)
err(1, "getrusage");
printf("after MADV_NORMAL, after touching pages: %d %ld\n", map[0], ru.ru_minflt);
if (madvise(map, ps * n, MADV_DONTNEED) == -1)
err(1, "madvise");
if (getrusage(RUSAGE_SELF, &ru) == -1)
err(1, "getrusage");
printf("after MADV_DONTNEED, before touching pages: %d %ld\n", map[0], ru.ru_minflt);
for (i = 0; i < n; i++) {
map[ps * i] = i + 10;
}
if (getrusage(RUSAGE_SELF, &ru) == -1)
err(1, "getrusage");
printf("after MADV_DONTNEED, after touching pages: %d %ld\n", map[0], ru.ru_minflt);
return 0;
}
I'm measuring ru_minflt
as a proxy to see how many pages we needed to allocate (this isn't exactly true, but the next sentence makes it more likely). We can see that we get new pages in the third printf because the contents of map[0]
are 0.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With