Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is writing to a buffer filled with 42 way faster than writing to a buffer of zeros?

I would expect writes to a char * buffer to take the same time regardless of the existing contents of the memory1. Wouldn't you?

However, while narrowing down an inconsistency in a benchmark, I came across a case where this apparently isn't true. A buffer which contains all-zeros behaves much differently, performance-wise, from a buffer filled with 42.

Graphically, this looks like (details below):

Buffer Write Time

Here's the code I used to produce the above3:

#include <stdio.h>
#include <stdlib.h>
#include <inttypes.h>
#include <string.h>
#include <time.h>

volatile char *sink;

void process(char *buf, size_t len) {
  clock_t start = clock();
  for (size_t i = 0; i < len; i += 678)
    buf[i] = 'z';
  printf("Processing took %lu μs\n",
      1000000UL * (clock() - start) / CLOCKS_PER_SEC);
  sink = buf;
}

int main(int argc, char** argv) {
  int total = 0;
  int memset42 = argc > 1 && !strcmp(argv[1], "42");
  for (int i=0; i < 5; i++) {
    char *buf = (char *)malloc(BUF_SIZE);
    if (memset42)
      memset(buf, 42, BUF_SIZE);
    else
      memset(buf,  0, BUF_SIZE);
    process(buf, BUF_SIZE);
  }
  return EXIT_SUCCESS;
}

I compile it on my Linux box like:

 gcc -O2 buffer_weirdness.cpp -o buffer_weirdness

... and when I run the version with a zero buffer, I get:

./buffer_weirdness zero
Processing took   12952 μs
Processing took  403522 μs
Processing took  626859 μs
Processing took  626965 μs
Processing took  627109 μs

Note the first iteration is quick, while the remaining iterations take perhaps 50 times longer.

When the buffer is first filled with 42, processing is always fast:

./buffer_weirdness 42
Processing took   12892 μs
Processing took   13500 μs
Processing took   13482 μs
Processing took   12965 μs
Processing took   13121 μs

The behavior depends on the `BUF_SIZE (1GB in the example above) - larger sizes are more likely to show the issue, and also depends on the current host state. If I leave the host alone for a while, the slow iterations take perhaps 60,000 μs, rather than 600,000 - so 10x faster, but still ~5x slower than the fast processing time. Eventually the times return to the fully slow behavior.

The behavior also depends at least partly on transparent hugepages - if I disable them2, the performance on the slow iterations improves by a factor of about 3x, while the fast iterations are unchanged.

A final note is the the total runtime for the process is much closer than simply timing the process routine (in fact, the zero filled, THP off version is about 2x faster than the others, which are roughly the same).

What's going on here?


1 Outside of some very unusual optimization such as the compiler understanding what value the buffer already contained and eliding writes of the same value, which isn't happening here.

2sudo sh -c "echo never > /sys/kernel/mm/transparent_hugepage/enabled"

3 It's a distilled version of the original benchmark. Yes, I'm leaking the allocations, get over it - it leads to a more concise example. The original example didn't leak. In fact, when you don't leak the allocations, the behavior changes: probably because malloc can just re-use the region for the next allocation, rather than asking the OS for more memory.

like image 780
BeeOnRope Avatar asked Jan 06 '17 23:01

BeeOnRope


1 Answers

This seems hard to reproduce, so it's probably compiler/libc specific.

My best guess here:

When you call malloc, you get memory mapped into your process space, which does not mean that the OS has already taken the necessary pages from its pool of free memory, but that it just added entries to some tables.

Now, when you try to access the memory there, your CPU/MMU will raise a fault – and the OS can catch that, and check whether that address belongs to the category "already in the memory space, but not yet actually allocated to the process". If that is the case, the necessary free memory is found and mapped into your process' memory space.

Now, modern OSes often have a built-in option to "zero out" pages prior to (re-)use. If you do that, the memset(,0,) operation becomes unnecessary. IN case of POSIX systems, if you use calloc instead of malloc, the memory gets zeroed out.

In other words, your compiler might have noticed that and completely omitted the memset(,0,) when your OS supports that. That means that the moment you write to the pages in process() is the first moment they get accessed – and that triggers the "on-the-fly page mapping" mechanism of your OS.

The memset(,42,) can of course not be optimized away, so in that case, the pages get actually pre-allocated and you don't see that time spent in the process() function.

You should use /usr/bin/time to actually compare the whole execution time with the time spent in process – my suspicion implies that the time saved in process is actually spent in main, possibly within kernel context.

UPDATE: Tested with the excellent Godbolt Compiler Explorer: Yes, with -O2 and -O3, modern gcc simply omits the zero-memsetting (or, rather, simply fuse it into calloc, which is malloc with zeroing):

#include <cstdlib>
#include <cstring>
int main(int argc, char ** argv) {
  char *p = (char*)malloc(10000);
  if(argc>2) {
    memset(p,42,10000);
  } else {
    memset(p,0,10000);
  }
  return (int)p[190]; // had to add this for the compiler to **not** completely remove all the function body, since it has no effect at all.
}

Becomes, for x86_64 on gcc6.3

main:
        // store frame state
        push    rbx
        mov     esi, 1
        // put argc in ebx
        mov     ebx, edi
        // Setting up call to calloc (== malloc with internal zeroing)
        mov     edi, 10000
        call    calloc 
        // ebx (==argc) compared to 2 ?
        cmp     ebx, 2
        mov     rcx, rax
        // jump on less/equal to .L2
        jle     .L2
        // if(argc > 2):
        // set up call to memset
        mov     edx, 10000
        mov     esi, 42
        mov     rdi, rax
        call    memset
        mov     rcx, rax
.L2:    //else case
        //notice the distinct lack of memset here!
        // move the value at position rcx (==p)+190 into the "return" register
        movsx   eax, BYTE PTR [rcx+190]
        //restore frame
        pop     rbx
        //return
        ret

By the way, if you remove the return p[190],

  }
  return 0;
}

then there's no reason for the compiler to retain the function body at all – its return value is easily determinable at compile time, and it has no side effect. The whole program then compiles to

main:
        xor     eax, eax
        ret

Notice that A xor A is 0 for every A.

like image 109
Marcus Müller Avatar answered Sep 21 '22 08:09

Marcus Müller