I would expect writes to a char *
buffer to take the same time regardless of the existing contents of the memory1. Wouldn't you?
However, while narrowing down an inconsistency in a benchmark, I came across a case where this apparently isn't true. A buffer which contains all-zeros behaves much differently, performance-wise, from a buffer filled with 42
.
Graphically, this looks like (details below):
Here's the code I used to produce the above3:
#include <stdio.h>
#include <stdlib.h>
#include <inttypes.h>
#include <string.h>
#include <time.h>
volatile char *sink;
void process(char *buf, size_t len) {
clock_t start = clock();
for (size_t i = 0; i < len; i += 678)
buf[i] = 'z';
printf("Processing took %lu μs\n",
1000000UL * (clock() - start) / CLOCKS_PER_SEC);
sink = buf;
}
int main(int argc, char** argv) {
int total = 0;
int memset42 = argc > 1 && !strcmp(argv[1], "42");
for (int i=0; i < 5; i++) {
char *buf = (char *)malloc(BUF_SIZE);
if (memset42)
memset(buf, 42, BUF_SIZE);
else
memset(buf, 0, BUF_SIZE);
process(buf, BUF_SIZE);
}
return EXIT_SUCCESS;
}
I compile it on my Linux box like:
gcc -O2 buffer_weirdness.cpp -o buffer_weirdness
... and when I run the version with a zero buffer, I get:
./buffer_weirdness zero
Processing took 12952 μs
Processing took 403522 μs
Processing took 626859 μs
Processing took 626965 μs
Processing took 627109 μs
Note the first iteration is quick, while the remaining iterations take perhaps 50 times longer.
When the buffer is first filled with 42
, processing is always fast:
./buffer_weirdness 42
Processing took 12892 μs
Processing took 13500 μs
Processing took 13482 μs
Processing took 12965 μs
Processing took 13121 μs
The behavior depends on the `BUF_SIZE (1GB in the example above) - larger sizes are more likely to show the issue, and also depends on the current host state. If I leave the host alone for a while, the slow iterations take perhaps 60,000 μs, rather than 600,000 - so 10x faster, but still ~5x slower than the fast processing time. Eventually the times return to the fully slow behavior.
The behavior also depends at least partly on transparent hugepages - if I disable them2, the performance on the slow iterations improves by a factor of about 3x, while the fast iterations are unchanged.
A final note is the the total runtime for the process is much closer than simply timing the process routine (in fact, the zero filled, THP off version is about 2x faster than the others, which are roughly the same).
What's going on here?
1 Outside of some very unusual optimization such as the compiler understanding what value the buffer already contained and eliding writes of the same value, which isn't happening here.
2sudo sh -c "echo never > /sys/kernel/mm/transparent_hugepage/enabled"
3 It's a distilled version of the original benchmark. Yes, I'm leaking the allocations, get over it - it leads to a more concise example. The original example didn't leak. In fact, when you don't leak the allocations, the behavior changes: probably because malloc
can just re-use the region for the next allocation, rather than asking the OS for more memory.
This seems hard to reproduce, so it's probably compiler/libc specific.
My best guess here:
When you call malloc
, you get memory mapped into your process space, which does not mean that the OS has already taken the necessary pages from its pool of free memory, but that it just added entries to some tables.
Now, when you try to access the memory there, your CPU/MMU will raise a fault – and the OS can catch that, and check whether that address belongs to the category "already in the memory space, but not yet actually allocated to the process". If that is the case, the necessary free memory is found and mapped into your process' memory space.
Now, modern OSes often have a built-in option to "zero out" pages prior to (re-)use. If you do that, the memset(,0,)
operation becomes unnecessary. IN case of POSIX systems, if you use calloc
instead of malloc
, the memory gets zeroed out.
In other words, your compiler might have noticed that and completely omitted the memset(,0,)
when your OS supports that. That means that the moment you write to the pages in process()
is the first moment they get accessed – and that triggers the "on-the-fly page mapping" mechanism of your OS.
The memset(,42,)
can of course not be optimized away, so in that case, the pages get actually pre-allocated and you don't see that time spent in the process()
function.
You should use /usr/bin/time
to actually compare the whole execution time with the time spent in process
– my suspicion implies that the time saved in process
is actually spent in main
, possibly within kernel context.
UPDATE: Tested with the excellent Godbolt Compiler Explorer: Yes, with -O2
and -O3
, modern gcc simply omits the zero-memsetting (or, rather, simply fuse it into calloc
, which is malloc
with zeroing):
#include <cstdlib>
#include <cstring>
int main(int argc, char ** argv) {
char *p = (char*)malloc(10000);
if(argc>2) {
memset(p,42,10000);
} else {
memset(p,0,10000);
}
return (int)p[190]; // had to add this for the compiler to **not** completely remove all the function body, since it has no effect at all.
}
Becomes, for x86_64 on gcc6.3
main:
// store frame state
push rbx
mov esi, 1
// put argc in ebx
mov ebx, edi
// Setting up call to calloc (== malloc with internal zeroing)
mov edi, 10000
call calloc
// ebx (==argc) compared to 2 ?
cmp ebx, 2
mov rcx, rax
// jump on less/equal to .L2
jle .L2
// if(argc > 2):
// set up call to memset
mov edx, 10000
mov esi, 42
mov rdi, rax
call memset
mov rcx, rax
.L2: //else case
//notice the distinct lack of memset here!
// move the value at position rcx (==p)+190 into the "return" register
movsx eax, BYTE PTR [rcx+190]
//restore frame
pop rbx
//return
ret
By the way, if you remove the return p[190]
,
}
return 0;
}
then there's no reason for the compiler to retain the function body at all – its return value is easily determinable at compile time, and it has no side effect. The whole program then compiles to
main:
xor eax, eax
ret
Notice that A xor A
is 0
for every A
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With