I am trying to determine time needed to read an element to make sure it's a cache hit or a cache miss. for reading to be in order I use _mm_lfence() function. I got unexpected results and after checking I saw that lfence function's overhead is not deterministic. So I am executing the program that measures this overhead in a loop of for example 100 000 iteration. I get results of more than 1000 clock cycle for one iteration and next time it's 200. What can be a reason of such difference between lfence function overheads and if it is so unreliable how can I judge latency of cache hits and cache misses correctly? I was trying to use same approach as in this post: Memory latency measurement with time stamp counter
the code that gives unreliable results is this:
for(int i=0; i < arr_size; i++){
_mm_mfence();
_mm_lfence();
t1 = __rdtsc();
_mm_lfence();
_mm_lfence();
t2 = __rdtsc();
_mm_lfence();
arr[i] = t2-t1;
}
the values in arr vary in different ranges, arr_size is 100 000.
I get results of more than 1000 clock cycle for one iteration and next time it's 200.
Sounds like your CPU ramped up from idle to normal clock speed after the first few iterations.
Remember that RDTSC counts reference cycles (fixed frequency, equal or close to the max non-turbo frequency of the CPU), not core clock cycles. (idle/turbo / whatever). Older CPUs had RDTSC count core clock cycles, but for years now CPU vendors have had fixed RDTSC frequency making it useful for clock_gettime()
, and advertized this fact with the invariant_tsc
CPUID feature bit. See also Get CPU cycle count?
If you really want to use RDTSC instead of performance counters, disable turbo and use a warm-up loop to get your CPU to its max frequency.
There are libraries that let you program the HW performance counters, and set permissions so you can run rdpmc
in user-space. This actually has lower overhead than rdtsc
. See What will be the exact code to get count of last level cache misses on Intel Kaby Lake architecture for a summary of ways to access perf counters in user-space.
I also found a paper about adding user-space rdpmc
support to Linux perf
(PAPI): ftp://ftp.cs.uoregon.edu/pub/malony/ESPT/Papers/espt-paper-1.pdf. IDK if that made it into mainline kernel/perf code or not.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With