I was trying to benchmark instruction cache misses (by generating some functions largely consisting of nops and calling each other) when I noticed something a bit weird which I've tried to reproduce. The setup for the benchmark is as follows.
I write functions in assembly (x86-64) of differing sizes; the functions consist of a varying number of nop instructions followed by a ret. I benchmark these functions using google benchmark.
Using google benchmark you can get the times for each function. The functions I tested only differ in the number of nop instructions - the number of instructions going from 2^10 to 2^29 in powers of 2.
Finally, I divide the times measured for each function by the number of instructions in the function and multiply it with my cpu's frequency, getting a rough cycles/nop count for each function.

This is what the result was - I've repeated the process a few times and the graph is always similar to this.
I expected the graph to be a flat line but there is a sudden rise around 2^23. Obviously the processor will optimize the execution (super-scalar, prefetching, etc.) but I see no reason as to why there is a sudden rise when the number of instructions is around 2^23.
So that is what the question is - what is causing that sudden rise in the number of cycles/nop? I've not had a chance to run the program on other processors/operating systems. I'm using Ubuntu 22.04 on an 11th gen Intel i5-11400H.
I've put the code to run stuff on a github repo so it should be easy enough for anyone to run the setup and get the graph. For the sake of the completeness of the question, the main code used to run the benchmark is as follows:
#include <benchmark/benchmark.h>
#include <x86intrin.h>
extern "C" void fun_1024();
extern "C" void fun_2048();
extern "C" void fun_4096();
extern "C" void fun_8192();
extern "C" void fun_16384();
extern "C" void fun_32768();
extern "C" void fun_65536();
extern "C" void fun_131072();
extern "C" void fun_262144();
extern "C" void fun_524288();
extern "C" void fun_1048576();
extern "C" void fun_2097152();
extern "C" void fun_4194304();
extern "C" void fun_8388608();
extern "C" void fun_16777216();
extern "C" void fun_33554432();
extern "C" void fun_67108864();
extern "C" void fun_134217728();
extern "C" void fun_268435456();
extern "C" void fun_536870912();
static void (*funs[])(void) = {
fun_1024,
fun_2048,
fun_4096,
fun_8192,
fun_16384,
fun_32768,
fun_65536,
fun_131072,
fun_262144,
fun_524288,
fun_1048576,
fun_2097152,
fun_4194304,
fun_8388608,
fun_16777216,
fun_33554432,
fun_67108864,
fun_134217728,
fun_268435456,
fun_536870912,
};
void BM(benchmark::State &state) {
for (auto _ : state) {
funs[state.range(0)]();
}
}
BENCHMARK(BM)->DenseRange(0, 19, 1);
BENCHMARK_MAIN();
where fun_x is simply a function consisting of x-1 nops followed by a ret.
Edit:
After Peter's answer I've tried measuring certain perf counters to confirm whether the L2TLB misses are what's causing the slow down. itlb_misses.miss_causes_a_walk was not available on my system so I've used frontend_retired.itlb_miss and frontend_retired.stlb_miss which I think should also be able to indicate if the TLB misses is the root cause.

(the cache-misses line is always zero because of some bug(?) when using perf with the -n flag so ignore that)
The setup remains the same and I'm using google benchmark. For each event I've plotted the number of samples for each function (available using the -n flag with perf-report). Since functions are run a different no. of times and have different sizes, I've divided the no. of samples of a function by (no. of iterations of the function * no. of nop instructions in the function). A bit to my surprise this division makes not much of a difference to the graph (looks like google benchmark's variable no. of iterations already deals with this indirectly).
From the results it seems like TLB misses are not the issue? The frontend_retired.itlb_miss and frontend_retired.stlb_miss samples are barely above zero.
The icache_data.stalls line matches up perfectly with the original graph, which seems to show that cache misses are the cause for the slow-down.
I then measured frontend_retired.l1i_miss and frontend_retired.l2_miss and the L1i counter seems to be about right - the rise is around 2^15 which matches up with its 32 KiB size. I'm not sure why this line then dips but maybe that's an artifact of perf?
The L2 miss counter is again unexpected, I would've expected it to not stay so close to zero. And there seems to be no event to measure instruction fetches which lead to L3 cache misses, or at least I couldn't find it.
So, in essence after this "experiment" my questions can be summarized as follows:
(1) Does this method of measurement - i.e comparing the no. of samples per event per function - even make sense? Because the results will only make sense if the measurement is correct, and the results do seem a bit weird.
(2) If the measurements are correct and the slow-down is indeed caused by cache misses, I would be a bit surprised since I would've intuitively expected the prefetching logic to be able to deal with such a sequential load - are there any reasons as to why this prefetching does not happen?
(also icache_data.stalls is counting the no. of cycles where the frontend is stalled waiting for instructions, whereas the frontend_retired events is counting the no. of instructions so it is a bit like comparing apples vs oranges - but since they both qualitatively can point towards the actual reason and are of similar scales I've plotted them all together)
Assuming 1-byte NOPs, 1<<23 = 8 MiB is the amount of address-space covered by Sunny Cove's (Tiger Lake's) L2TLB using 4K pages. That's the prime suspect since it's a cutoff / breakpoint that matches the point where performance falls off in your testing.
(https://www.anandtech.com/show/14664/testing-intel-ice-lake-10nm/2 shows the L1iTLB is smaller than the L1dTLB, but the L2TLB is shared. Tiger Lake's Willow Cove cores are mostly the same as ICL Sunny Cove.)
With that many NOPs plus a caller in a different page and/or any interrupt handlers, you're going to be getting iTLB misses that require page walks. (Perf event itlb_misses.miss_causes_a_walk). I would have thought that hardware TLB prefetch would mostly hide latency, and that the page-walk hardware's internal caching would make contiguous next-page access cheaper than walking all 4 (or 5?) levels, but maybe the cost is still significant.
You could test with 2M largepages. If Linux isn't using them for your .text section, use mmap(MAP_ANONYMOUS|MAP_PRIVATE) to allocate some write+exec pages that you memset to fill with 0x90 (short NOP). Use Linux madvise(MADV_HUGEPAGE) to encourage THP, and/or madvise(MADV_COLLAPSE) to force a one-time synchronous collapse of 4K pages into hugepages. Actually, MADV_COLLAPSE says it works on file-backed mappings, so it might work on your existing functions, without having to mmap and generate code. (https://man7.org/linux/man-pages/man2/madvise.2.html)
With your Google::Benchmark setup, the first call acts as a warm-up so most of the timed region isn't paying for (soft) page-faults the first time a page is touched.
Are each of your NOP functions separate, or to their tails overlap? (e.g. the fun_1024: label can be half way into the fun_2048: label, and so on, so they all end with the same ret). Either way would be a slightly different test as far as TLB hits for the next function, but Benchmark repeats the call enough times that shouldn't make much difference.
For frequency, did you use perf to check the actual core clock during the benchmark? Or did you assume max turbo 4.5GHz? (https://www.intel.com/content/www/us/en/products/sku/213805/intel-core-i511400h-processor-12m-cache-up-to-4-50-ghz/specifications.html)
https://chipsandcheese.com/p/sunny-cove-intels-lost-generation tested instruction-fetch bandwidth with 8-byte and 4-byte NOPs, finding a dropoff at each level of cache. On their Ice Lake Xeon (with 48 MiB L3), bandwidth stayed above 5 bytes per cycle until a loop size of 65536 KiB = 64 MiB which is somewhat larger than its L3.
But at least for their cache/DRAM latency testing they used 2M pages to avoid L2 TLB latency and page walks. Maybe also for their code-fetch testing.
If anything, Tiger Lake client should have better single-core DRAM bandwidth due to lower uncore latency with a small ring bus instead of the mesh on Xeon models. (And maybe better L3 bandwidth). That article mentions that some Tiger Lake systems would use LPDDR4X which has higher DRAM latency, but hopefully HW prefetch can still hide it for sequential reads.
In your test with 1-byte NOPs, you won't see 5/cycle throughput since the uop cache can only handle 18 uops per 32 bytes of machine code (3x 6-uop "lines"). So even for smallish functions you'll be limited by legacy decode (MITE) to 4 IPC.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With