for (int i = 0; i < 100000; ++i) {
int *page = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
page[0] = 0;
munmap(page, PAGE_SIZE);
}
I expect to get ~100000 dTLB-store-misses in userspace, one per each iteration (Also ~100000 page-faults and dTLB-load-misses for kernel). Running following command, the result is roughly 2x what I expect. I would appreciate if someone could clarify why this is the case:
perf stat -e dTLB-store-misses:u ./test
Performance counter stats for './test':
200,114 dTLB-store-misses
0.213379649 seconds time elapsed
P.S. I have verified and am certain that the generated code doesn't introduce anything that would justify this result. Also, I do get ~100000 page-faults and dTLB-load-misses:k.
I expect to get ~100000 dTLB-store-misses in userspace, one per each iteration
I would expect that:
page[0] = 0;
, tries to load the cache line containing page[0]
, can't find the TLB entry for it, increments dTLB-load-misses
, fetches the translation, realises the page is "not present", then generates a page fault.INVLPG
). The page fault handler returns to the instruction that caused the fault so it can be retried.page[0] = 0;
a second time, tries to load the cache line containing page[0]
, can't find the TLB entry for it, increments dTLB-load-misses
, fetches the translation, then modifies the cache line.For fun, you could use the MAP_POPULATE
flag with mmap()
to try to get the kernel to pre-allocate the pages (and avoid the page fault and the first TLB miss).
Update 2: I think Brendan's answer is right. I should maybe delete this, but the ocperf.py
suggestion is still useful for future readers, I think. And it might explain extra TLB misses on CPUs without Process-Context-Identifiers with kernels that mitigate Meltdown.
Update: the below guess was wrong. New guess: mmap
has to modify your process's page table, so perhaps there's some TLB invalidation of something just from that. My recommendation to use ocperf.py record
to try to figure out which asm instructions are causing TLB misses still stands. Even with optimization enabled, the code will store to the stack when pushing/popping a return address for the glibc wrapper function calls.
Perhaps your kernel has kernel / user page-table isolation enabled to mitigate Meltdown, so on return from kernel to user, all TLB entries have been invalidated (by modifying CR3 to point to page tables that don't include the kernel mappings at all).
Look for Kernel/User page tables isolation: enabled
in your dmesg output. You can try booting with kpti=off
as a kernel option to disable it, if you don't mind being vulnerable to Meltdown while testing.
Because you're using C, you're using the mmap
and munmap
system calls through their glibc wrappers, not with inline syscall
instructions directly. The ret
instruction in that wrapper needs to load the return address from the stack, which TLB misses.
The extra store misses probably come from call
instructions pushing a return address, although I'm not sure that's right because the current stack page should already be in the TLB from the ret
from the previous system call.
You can profile with ocperf.py to get symbolic names for uarch-specific events. Assuming you're on a recent Intel CPU, ocperf.py record -e mem_inst_retired.stlb_miss_stores,page-faults,dTLB-load-misses
to find which instructions cause store misses. (Then use ocperf.py report -Mintel
). If report
doesn't make it easy to choose which event to see counts for, only record with a single event.
mem_inst_retired.stlb_miss_stores
is a "precise" event, unlike most of the other store TLB events, so the counts should be for the real instruction, rather than maybe some later instructions like imprecise perf events. (See Andy Glew's trap vs. exception answer for some details about why some performance-counters can't easily be precise; many store events aren't.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With