Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I ask the kernel to populate (fault in) a range of anonymous pages?

In Linux, using C, if I ask for a large amount of memory via malloc or a similar dynamic allocation mechanism, it is likely that most of the pages backing the returned region won't actually be mapped into the address space of my process.

Instead, a page fault is incurred each time I access one of the allocated pages for the first time, and then kernel will map in the "anonymous" page (consisting entirely of zeros) and return to user space.

For a large region (say 1 GiB) this is a large number of page faults (~260 thousand for 4 KiB pages), and each fault incurs a user-to-kernel-user transition which are especially slow on kernels with Spectre and Meltdown mitigations. For some uses, this page-faulting time might dominate the actual work being done on the buffer.

If I know I'm going to use the entire buffer, is there some way to ask the kernel to map an already mapped region ahead of time?

If I was allocating my own memory using mmap, the way to do this would be MAP_POPULATE - but that doesn't work for regions received from malloc or new.

There is the madvise call, but the options there seem mostly to apply to file-backed regions. For example, the madvise(..., MADV_WILLNEED) call seems promising - from the man page:

MADV_WILLNEED

Expect access in the near future. (Hence, it might be a good idea to read some pages ahead.)

The obvious implication is if the region is file-backed, this call might trigger an asynchronous file read-ahead, or perhaps a synchronous additional read-ahead on subsequent faults. From the description, it isn't clear if it will do anything for anonymous pages, and based on my testing, it doesn't.

like image 828
BeeOnRope Avatar asked Oct 27 '25 05:10

BeeOnRope


1 Answers

It's a bit of a dirty hack, and works best for priviledged processes or on systems with a high RLIMIT_MEMLOCK, but... an mlock and munlock pair will achieve the effect you are looking for.

For example, given the following test program:

# compile with (for e.g.,): cc -O1 -Wall    pagefaults.c   -o pagefaults

#include <stdlib.h>
#include <stdio.h>
#include <err.h>
#include <sys/mman.h>

#define DEFAULT_SIZE        (40 * 1024 * 1024)
#define PG_SIZE     4096

void failcheck(int ret, const char* what) {
    if (ret) {
        err(EXIT_FAILURE, "%s failed", what);
    } else {
        printf("%s OK\n", what);
    }
}

int main(int argc, char **argv) {
    size_t size = (argc == 2 ? atol(argv[1]) : DEFAULT_SIZE);
    char *mem = malloc(size);

    if (getenv("DO_MADVISE")) {
        failcheck(madvise(mem, size, MADV_WILLNEED), "madvise");
    }

    if (getenv("DO_MLOCK")) {
        failcheck(mlock(mem, size), "mlock");
        failcheck(munlock(mem, size), "munlock");
    }

    for (volatile char *p = mem; p < mem + size; p += PG_SIZE) {
        *p = 'z';
    }
    printf("size: %6.2f MiB, pages touched: %zu\npoitner value : %p\n",
            size / 1024. / 1024., size / PG_SIZE, mem);
}

Running it as root for a 1 GB region and counting pagefaults with perf results in:

$ perf stat ./pagefaults 1000000000
size: 953.67 MiB, pages touched: 244140
poitner value : 0x7f2fc2584010

 Performance counter stats for './pagefaults 1000000000':

        352.474676      task-clock (msec)         #    0.999 CPUs utilized          
                 2      context-switches          #    0.006 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
           244,189      page-faults               #    0.693 M/sec                  
       914,276,474      cycles                    #    2.594 GHz                    
       703,359,688      instructions              #    0.77  insn per cycle         
       117,710,381      branches                  #  333.954 M/sec                  
           447,022      branch-misses             #    0.38% of all branches        

       0.352814087 seconds time elapsed

However, if you run prefixed with DO_MLOCK=1, you get:

sudo DO_MLOCK=1 perf stat ./pagefaults 1000000000
mlock OK
munlock OK
size: 953.67 MiB, pages touched: 244140
poitner value : 0x7f8047f6b010

 Performance counter stats for './pagefaults 1000000000':

        240.236189      task-clock (msec)         #    0.999 CPUs utilized          
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                49      page-faults               #    0.204 K/sec                  
       623,152,764      cycles                    #    2.594 GHz                    
       959,640,219      instructions              #    1.54  insn per cycle         
       150,713,144      branches                  #  627.354 M/sec                  
           484,400      branch-misses             #    0.32% of all branches        

       0.240538327 seconds time elapsed

Note that the number of page faults has dropped from 244,189 to 49, and there is a 1.46x speedup. The overwhelming majority of the time is still spend in the kernel, so this could probably be a lot faster if it wasn't necessary to invoke both mlock and munlock and possibly also because the semantics of mlock are more than is required.

For non-privileged processes, you'll probably hit the RLIMIT_MEMLOCK if you try to do a large region all at once (on my Ubuntu system it's set at 64 Kib), but you could loop over the region calling mlock(); munlock() on a smaller region.

like image 167
BeeOnRope Avatar answered Oct 29 '25 21:10

BeeOnRope



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!