Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to zero pages in Linux

Tags:

c

linux

I need to clear large address ranges (order of 200 pages at a time) in Linux. There are two approaches I tried -

  1. Use memset - The simplest way to clear the address range. Performed a bit slower than method 2.

  2. Use munmap/mmap - I called munmap on the address range then mmap'd the same address again with the same permissions. Since MAP_ANONYMOUS is passed, the pages are cleared.

The second method makes a benchmark run 5-10% faster. The benchmark ofcourse does a lot more than just clearing the pages. If I understand correctly, this is because the operating system has a pool of zero'd pages which it maps to the address range.

But I don't like this way because the munmap and mmap is not atomic. In the sense that another mmap (with NULL as first argument) done simultaneously could render my address range un-usable.

So my question is does Linux provide a system call that can swap out the physical pages for an address range with zero-pages?

I tried to look at the source of glibc (specifically memset) to see if they use any technique to do this efficiently. But I couldn't find anything.

like image 264
Ajay Brahmakshatriya Avatar asked Apr 18 '18 09:04

Ajay Brahmakshatriya


Video Answer


3 Answers

memset() appears to be about an order of magnitude faster than mmap() to get a new zero-filled page, at least on the Solaris 11 server I have access to right now. I strongly suspect that Linux will produce similar results.

I wrote a small benchmark program:

#include <stdio.h>
#include <sys/mman.h>
#include <string.h>
#include <strings.h>

#include <sys/time.h>

#define NUM_BLOCKS ( 512 * 1024 )
#define BLOCKSIZE ( 4 * 1024 )

int main( int argc, char **argv )
{
    int ii;

    char *blocks[ NUM_BLOCKS ];

    hrtime_t start = gethrtime();

    for ( ii = 0; ii < NUM_BLOCKS; ii++ )
    {
        blocks[ ii ] = mmap( NULL, BLOCKSIZE,
            PROT_READ | PROT_WRITE,
            MAP_ANONYMOUS | MAP_PRIVATE, -1, 0 );
        // force the creation of the mapping
        blocks[ ii ][ ii % BLOCKSIZE ] = ii;
    }

    printf( "setup time:    %lf sec\n",
        ( gethrtime() - start ) / 1000000000.0 );

    for ( int jj = 0; jj < 4; jj++ )
    {
        start = gethrtime();

        for ( ii = 0; ii < NUM_BLOCKS; ii++ )
        {
            blocks[ ii ] = mmap( blocks[ ii ],
                BLOCKSIZE, PROT_READ | PROT_WRITE,
                MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE, -1, 0 );
            blocks[ ii ][ ii % BLOCKSIZE ] = 0;
        }

        printf( "mmap() time:   %lf sec\n",
            ( gethrtime() - start ) / 1000000000.0 );
        start = gethrtime();

        for ( ii = 0; ii < NUM_BLOCKS; ii++ )
        {
            memset( blocks[ ii ], 0, BLOCKSIZE );
        }

        printf( "memset() time: %lf sec\n",
            ( gethrtime() - start ) / 1000000000.0 );
    }

    return( 0 );
}

Note that writing a single byte anywhere in the page is all that's needed to force the creation of the physical page.

I ran it on my Solaris 11 file server (the only POSIX-style system I have running on bare metal right now). I didn't test madvise() on my Solaris system because Solaris, unlike Linux, doesn't guarantee that the mapping will be repopulated with zero-filled pages, only that "the system starts to free the resources".

The results:

setup time:    11.144852 sec
mmap() time:   15.159650 sec
memset() time: 1.817739 sec
mmap() time:   15.029283 sec
memset() time: 1.788925 sec
mmap() time:   15.083473 sec
memset() time: 1.780283 sec
mmap() time:   15.201085 sec
memset() time: 1.771827 sec

memset() is almost an order of magnitude faster. When I get a chance, I'll rerun that benchmark on Linux, but it'll likely have to be on a VM (AWS etc.)

That's not surprising - mmap() is expensive, and the kernel still needs to zero the pages at some time.

Interestingly, commenting out one line

        for ( ii = 0; ii < NUM_BLOCKS; ii++ )
        {
            blocks[ ii ] = mmap( blocks[ ii ],
                BLOCKSIZE, PROT_READ | PROT_WRITE,
                MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE, -1, 0 );
            //blocks[ ii ][ ii % BLOCKSIZE ] = 0;
        }

produces these results:

setup time:    10.962788 sec
mmap() time:   7.524939 sec
memset() time: 10.418480 sec
mmap() time:   7.512086 sec
memset() time: 10.406675 sec
mmap() time:   7.457512 sec
memset() time: 10.296231 sec
mmap() time:   7.420942 sec
memset() time: 10.414861 sec

The burden of forcing the creation of the physical mapping has shifted to the memset() call, leaving only the implicit munmap() in the test loops, where the mappings are destroyed when the MAP_FIXED mmap() call replaces them. Note that the just the munmap() takes about 3-4 times longer than keeping the pages in the address space and memset()'ing them to zeros.

The cost of mmap() isn't really the mmap()/munmap() system call itself, it's that the new page requires a lot of behind-the-scenes CPU cycles to create the actual physical mapping, and that doesn't happen in the mmap() system call itself - it happens afterwards, when the process accesses the memory page.

If you doubt the results, note this LMKL post from Linus Torvalds himself:

...

HOWEVER, playing games with the virtual memory mapping is very expensive in itself. It has a number of quite real disadvantages that people tend to ignore because memory copying is seen as something very slow, and sometimes optimizing that copy away is seen as an obvious improvment.

Downsides to mmap:

  • quite noticeable setup and teardown costs. And I mean noticeable. It's things like following the page tables to unmap everything cleanly. It's the book-keeping for maintaining a list of all the mappings. It's The TLB flush needed after unmapping stuff.
  • ...

Profiling the code using Solaris Studio's collect and analyzer tools produced the following output:

Source File: mm.c

Inclusive        Inclusive        Inclusive         
Total CPU Time   Sync Wait Time   Sync Wait Count   Name
sec.             sec.                               
                                                      1. #include <stdio.h>
                                                      2. #include <sys/mman.h>
                                                      3. #include <string.h>
                                                      4. #include <strings.h>
                                                      5. 
                                                      6. #include <sys/time.h>
                                                      7. 
                                                      8. #define NUM_BLOCKS ( 512 * 1024 )
                                                      9. #define BLOCKSIZE ( 4 * 1024 )
                                                     10. 
                                                     11. int main( int argc, char **argv )
                                                         <Function: main>
 0.011           0.               0                  12. {
                                                     13.     int ii;
                                                     14. 
                                                     15.     char *blocks[ NUM_BLOCKS ];
                                                     16. 
 0.              0.               0                  17.     hrtime_t start = gethrtime();
                                                     18. 
 0.129           0.               0                  19.     for ( ii = 0; ii < NUM_BLOCKS; ii++ )
                                                     20.     {
                                                     21.         blocks[ ii ] = mmap( NULL, BLOCKSIZE,
                                                     22.             PROT_READ | PROT_WRITE,
 3.874           0.               0                  23.             MAP_ANONYMOUS | MAP_PRIVATE, -1, 0 );
                                                     24.         // force the creation of the mapping
 7.928           0.               0                  25.         blocks[ ii ][ ii % BLOCKSIZE ] = ii;
                                                     26.     }
                                                     27. 
                                                     28.     printf( "setup time:    %lf sec\n",
 0.              0.               0                  29.         ( gethrtime() - start ) / 1000000000.0 );
                                                     30. 
 0.              0.               0                  31.     for ( int jj = 0; jj < 4; jj++ )
                                                     32.     {
 0.              0.               0                  33.         start = gethrtime();
                                                     34. 
 0.560           0.               0                  35.         for ( ii = 0; ii < NUM_BLOCKS; ii++ )
                                                     36.         {
                                                     37.             blocks[ ii ] = mmap( blocks[ ii ],
                                                     38.                 BLOCKSIZE, PROT_READ | PROT_WRITE,
33.432           0.               0                  39.                 MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE, -1, 0 );
29.535           0.               0                  40.             blocks[ ii ][ ii % BLOCKSIZE ] = 0;
                                                     41.         }
                                                     42. 
                                                     43.         printf( "mmap() time:   %lf sec\n",
 0.              0.               0                  44.             ( gethrtime() - start ) / 1000000000.0 );
 0.              0.               0                  45.         start = gethrtime();
                                                     46. 
 0.101           0.               0                  47.         for ( ii = 0; ii < NUM_BLOCKS; ii++ )
                                                     48.         {
 7.362           0.               0                  49.             memset( blocks[ ii ], 0, BLOCKSIZE );
                                                     50.         }
                                                     51. 
                                                     52.         printf( "memset() time: %lf sec\n",
 0.              0.               0                  53.             ( gethrtime() - start ) / 1000000000.0 );
                                                     54.     }
                                                     55. 
 0.              0.               0                  56.     return( 0 );
 0.              0.               0                  57. }

                                                    Compile flags:  /opt/SUNWspro/bin/cc -g -m64  mm.c -W0,-xp.XAAjaAFbs71a00k.

Note the large amount of time spent in mmap(), and also in the setting of a single byte in each newly-mapped page.

This is an overview from the analyzer tool. Note the large amount of system time:

Profile overview

The large amount of system time consumed is the time taken to map and unmap the physical pages.

This timeline shows when all that time was consumed:

enter image description here

The light green is system time - that's all in the mmap() loops. You can see that switch over to dark-green user time when the memset() loops run. I've highlighted one of those instances so you can see what's going on at that time.

Updated results from a Linux VM:

setup time:    2.567396 sec
mmap() time:   2.971756 sec
memset() time: 0.654947 sec
mmap() time:   3.149629 sec
memset() time: 0.658858 sec
mmap() time:   2.800389 sec
memset() time: 0.647367 sec
mmap() time:   2.915774 sec
memset() time: 0.646539 sec

This tracks exactly with what I stated in my comment yesterday: FWIW, a quick test I ran showed that a simple, single-threaded call to memset() is somewhere between five and ten times faster than redoing mmap()

I simply do not understand this fascination with mmap(). mmap() is one hugely expensive call, and it's a forced single-threaded operation - there's only one set of physical memory on the machine. mmap() is not only S-L-O-W, it impacts both the entire process address space and the VM system on the entire host.

Using any form of mmap() just to zero out memory pages is counterproductive. First, the pages don't get zeroed for free - something has to memset() them to clear them. It just doesn't make any sense to add tearing down and recreating a memory mapping to that memset() just to clear a page of RAM.

memset() also has the advantage that more than one thread can be clearing memory at any one time. Making changes to memory mappings is a single-threaded process.

like image 91
Andrew Henle Avatar answered Sep 20 '22 00:09

Andrew Henle


madvise(..., MADV_DOTNEED) should be equivalent to munmap/mmap on anonymous mappings on Linux. It's a bit weird because that's not how I understand what the semantics of "don't need" should be, but it does throw away the page(s) on Linux.

$ cat > foo.c
#include <sys/types.h>
#include <sys/mman.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>

int
main(int argc, char **argv)
{
    int *foo = mmap(NULL, getpagesize(), PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
    *foo = 42;
    printf("%d\n", *foo);
    madvise(foo, getpagesize(), MADV_DONTNEED);
    printf("%d\n", *foo);
    return 0;
}
$ cc -o foo foo.c && ./foo
42
0
$ uname -sr
Linux 3.10.0-693.11.6.el7.x86_64

MADV_DONTNEED does not do that on other operating systems so this is definitely not portable. For example:

$ cc -o foo foo.c && ./foo
42
42
$ uname -sr
Darwin 17.5.0

But, you don't need to unmap, you can just overwrite the mapping. As a bonus this is much more portable:

$ cat foo.c
#include <sys/types.h>
#include <sys/mman.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>

int
main(int argc, char **argv)
{
    int *foo = mmap(NULL, getpagesize(), PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
    *foo = 42;
    printf("%d\n", *foo);
    mmap(foo, getpagesize(), PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0);
    printf("%d\n", *foo);
    return 0;
}
$ cc -o foo foo.c && ./foo
42
0
$

Also, I'm not actually sure if you benchmarked things properly. Creating and dropping mappings can be quite expensive and I don't think idle zeroing would help that much. Newly mmap:ed pages are not actually mapped until they are used for the first time and on Linux this means written and not read because Linux does silly things with copy-on-write zero pages if the first access to a page is a read instead of a write. So unless you benchmark writes to the newly mmap:ed pages I suspect that neither your previous solution, nor the ones I suggested here will actually be faster than just a dumb memset.

like image 41
Art Avatar answered Sep 22 '22 00:09

Art


Note: this is not an answer,I just needed the formatting feature.


BTW: it is possible that the /dev/zero all-zeros page doesn't even exist, and that the .read() method is implemented as follows (a similar thing happens for dev/null, which just returns the length argument):


struct file_operations {
        ...
        ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
        ...
        };

static ssize_t zero_read (struct file *not_needed_here, char __user * buff, size_t len, loff_t * ignored)
{
memset (buff, 0, len);
return len;
}
like image 25
wildplasser Avatar answered Sep 20 '22 00:09

wildplasser