I am trying to find configuration or memory access pattern for Intel's clwb instruction that would not invalidate cache line. I am testing on Intel Xeon Gold 5218 processor with NVDIMMs. Linux version is 5.4.0-3-amd64. I tried using Device−DAX mode and directly mapping this char device to the address space. I also tried adding this non-volatile memory as a new NUMA node and using numactl --membind
command to bind memory to it. In both cases when I use clwb to cached address, it is evicted. I am observing eviction with PAPI hardware counters, with disabled prefetchers.
This is a simple loop that I am testing. array and tmp variable, both are declared as volatile, so the loads are really executed.
for(int i=0; i < arr_size; i++){
tmp = array[i];
_mm_clwb(& array[i]);
_mm_mfence();
tmp = array[i];
}
Both reads are giving cache misses.
I was wondering if anyone else has tried to detect whether there is some configuration or memory access pattern that would leave the cache line in the cache?
clwb
behaves like clflushopt
on SKX and CSL. However, programs that use clwb
on these processors will automatically benefit when run on a future process that supports an optimized implementation of clwb
.
clwb
retains the cache line on ICL.
Note that cpuid
leaf 0x7 information from InstLatx64 says that ICL doesn't support clwb
, which is incorrect.
clwb
is also supported on Zen 2, but I don't know how it works on this microarchitecture.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With