So calloc()
works by asking the OS for some virtual memory. The OS is working in cahoots with the MMU, and cleverly responds with a virtual memory address which actually maps to a copy-on-write, read-only page full of zeroes. When a program tries to write to anywhere in that page, a page fault occurs (because you cannot write to read-only pages), a copy of the page is created, and your program's virtual memory is mapped to this brand new copy of those zeroes.
Now that Meltdown is a thing, OSes have been patched so that it's no longer possible to speculatively execute across the kernel-user boundary. This means that whenever user code calls kernel code, it effectively causes a pipeline stall. Typically, when the pipeline stalls in a loop, it's devastating for performance, since the CPU ends up wasting time waiting for data, whether from cache or main memory.
calloc()
, and the remapping to the new CoW page occurs, is this executing kernel code?calloc()
to allocate 4GiB of memory, then initialize it with some arbitrary value (say, 0xFF
instead of 0x00
) in a tight loop, is my (Intel) CPU going to be hitting a speculation boundary every time it writes to a new page?Recently I read somewhere that calloc () does lazy allocation, in that it doesn't allocate any real memory, but just virtual memory. Real memory pages are only allocated as writes to it are performed, and any reads performed on uninitialized memory returns 0, just like a sparse file.
For code that uses malloc, remember that being lazy is fine for allocating memory but do not be lazy about freeing up that memory. To help track down memory leaks in your applications, use the Instruments app. Because memory is such a fundamental resource, OS X and iOS both provide several ways to allocate it.
The calloc function reserves the required virtual address space for the memory but waits until the memory is actually used before initializing it. This approach is much more efficient than using memset, which forces the virtual memory system to map the corresponding pages into physical memory in order to zero-initialize them.
For large memory allocations, where large is anything more than a few virtual memory pages, malloc automatically uses the vm_allocate routine to obtain the requested memory.
Speculative execution across the kernel/user boundary was never possible; Intel CPUs don't rename the privilege level, i.e. kernel/user transitions always required a full pipeline flush. I think you're misunderstanding Meltdown: it's cause purely by speculative execution in user-space and delayed handling of the privilege checks on TLB hits.
This is universal in CPU design, AFAIK. I'm not aware of any microarchitectures that rename the privilege level or otherwise speculate into kernel code, x86 or otherwise.
The cost added by Meltdown mitigation is that entering the kernel flushes the TLB. (Or on CPUs with TLB process-context ID support, the kernel can use PCIDs to make using separate page-tables for kernel vs. user-space much cheaper).
The kernel entry point (on Linux) becomes a trampoline that swaps page tables and jumps to the real kernel entry point, to avoid exposing the kernel ASLR offset to user-space. But other than that and an extra mov cr3, reg
on entry and exit from the kernel (setting a new page table), nothing else is changed.
(Spectre mitigation is tricky, too, and required more changes like retpolines... and might also significantly increase the cost of user->kernel->user. IDK about page fault costs.)
@BeeOnRope reports (see comments and his answer for full details) that without Spectre patches, just Meltdown patches applied but nopti
boot option to "disable" it, increased the cost of a round trip to the kernel on a Skylake CPU (with syscall
with bogus RAX, returning -ENOSYS
right away) went up from ~100 to ~300 cycles. So that's maybe the cost of the trampoline? And with actual page-table isolation enabled, it went up to ~700 cycles. That's without Spectre mitigation patches at all. (Also, that's the x86-64 syscall
entry point, not page-fault. They're likely similar, though.)
Page fault exceptions:
CPUs don't predict page faults, so they couldn't speculatively execute the handler anyway. Prefetch or decode of the page fault entry point could maybe happen while the pipeline was flushing, but that process wouldn't start until the page-faulting instruction tried to retire. A faulting load/store is marked to take effect on retirement, and doesn't re-steer the front-end; the whole key to Meltdown is the lack of action on a faulting load until it reaches retirement.
Related: When an interrupt occurs, what happens to instructions in the pipeline?
Also: Out-of-order execution vs. speculative execution has some detail about what kind of speculation really causes Meltdown, and how CPUs handle faults.
When a program writes to a never-before-accessed page which was allocated with
calloc()
, and the remapping to the new CoW page occurs, is this executing kernel code?
Yes, page faults are handled by the kernel's page-fault handler. There's no pure-hardware handling for copy-on-write.
If I call calloc() to allocate 4GiB of memory, then initialize it with some arbitrary value (say, 0xFF instead of 0x00) in a tight loop, is my (Intel) CPU going to be hitting a speculation boundary every time it writes to a new page?
Yes. The kernel doesn't fault-around for zeroed pages (unlike for file-backed mappings when data is hot in the pagecache). So every new page touched causes a pagefault, even for small 4k normal pages. (Thanks to @BeeOnRope for accurate info on this.) With anonymous hugepages, you'll only pagefault once per 2MiB (x86-64), which is tremendously better.
If you want to avoid per-page costs, allocate with mmap(MAP_POPULATE)
to prefault all the pages into the HW page table, on a Linux system. I'm not sure if madvise
can prefault pages for you, e.g. madvise(MADV_WILLNEED)
on an already-mapped region. But madvise(MADV_HUGEPAGE)
will encourage the kernel to use anonymous hugepages (and maybe to defrag physical memory to free up contiguous 2M blocks to enable that, if you don't have it configured to do that without madvise
).
Related: Two TLB-miss per mmap/access/munmap has some perf
results on a Linux kernel with KPTI patches.
Yes use of calloc()
-allocated memory will suffer a performance degradation due to the Meltdown and Spectre patches.
In fact, calloc()
isn't special here: malloc()
, new
and more generally all allocated memory will probably suffer approximately the same performance impact. Both calloc()
and malloc()
are ultimately backed by pages returned by the OS (although the allocator will re-use them after they are freed). The only real difference being that a smart allocator, when it goes down the path of using new pages from the OS (rather than re-using a previously free
d allocation) in the case of calloc
it can omit the zeroing because the OS-provided pages are guaranteed to be zero. Other than that the allocator behavior is largely the same and the OS-level zeroing behavior is the same (there is usually no option to ask the OS for non-zero pages).
So the performance impact applies more broadly than you thought, but the performance impact is likely smaller than you suggest, since a page fault is already doing a lot of work anyways, so you aren't talking an order of magnitude degradation or anything. See Peter's answer on the reasons the performance impact is likely to be limited. I wrote this answer mostly because the answer to your headline question is still yes as there is some impact.
To estimate the impact on a malloc
heavy workflow, I tried running some allocation and page-fault heavy test on a current kernel (4.13.0-39-generic
) with the Spectre and Meltdown mitigations, as well as on an older kernel prior to these mitigations.
The test code is very simple:
#include <stdlib.h>
#include <stdio.h>
#define SIZE (40 * 1024 * 1024)
#define PG_SIZE 4096
int main() {
char *mem = malloc(SIZE);
for (volatile char *p = mem; p < mem + SIZE; p += PG_SIZE) {
*p = 'z';
}
printf("pages touched: %d\npoitner value : %p\n", SIZE / PG_SIZE, mem);
}
The results on the newer kernel were about ~3700 cycles per page fault, and on the older kernel without mitigations around ~3300 cycles. The overall regression (presumably) due to the mitigations was about 14%. Note that this in on Skylake hardware (i7-6700HQ) where some of the Spectre mitigations are somewhat cheaper, and the kernel supports PCID which makes the KPTI Meltdown mitigations cheaper. The results might be worse on different hardware.
Oddly, the results on the new kernel with Spectre and Meltdown mitigations disabled at boot (using spectre_v2=off nopti
) were much worse than either the new kernel default or the old kernel, coming in at about 5050 cycles per page fault, something like a 35% regression over the same kernel with the mitigations enabled. So something is going really wrong, performance-wise when the mitigations are disabled.
Here is the full perf stat
output for the two runs.
pages touched: 10240
poitner value : 0x7f7d2561e010
Performance counter stats for './pagefaults':
12.980048 task-clock (msec) # 0.976 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
10,286 page-faults # 0.792 M/sec
33,662,397 cycles # 2.593 GHz
27,230,864 instructions # 0.81 insn per cycle
4,535,443 branches # 349.417 M/sec
11,760 branch-misses # 0.26% of all branches
0.013293417 seconds time elapsed
pages touched: 10240
poitner value : 0x7f306ad69010
Performance counter stats for './pagefaults':
14.789615 task-clock (msec) # 0.966 CPUs utilized
8 context-switches # 0.541 K/sec
0 cpu-migrations # 0.000 K/sec
10,288 page-faults # 0.696 M/sec
38,318,595 cycles # 2.591 GHz
28,796,523 instructions # 0.75 insn per cycle
4,693,944 branches # 317.381 M/sec
26,853 branch-misses # 0.57% of all branches
0.015312764 seconds time elapsed
pages touched: 10240
poitner value : 0x7ff079ede010
Performance counter stats for './pagefaults':
16.690621 task-clock (msec) # 0.982 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
10,286 page-faults # 0.616 M/sec
51,964,080 cycles # 3.113 GHz
28,602,441 instructions # 0.55 insn per cycle
4,699,608 branches # 281.572 M/sec
25,064 branch-misses # 0.53% of all branches
0.017001581 seconds time elapsed
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With