I am testing some of intrinsic operations' behaviors. I got surprised when I noticed that _mm_mfence() issues load instruction from user space, but it does not count in L1 data cache - miss, hit or fill buffer hit. I am using papi's native events such as MEM_INST_RETIRED, and MEM_LOAD_RETIRED to read performance counters. This piece of code:
for(int i=0; i < 1000000; i++){
_mm_mfence();
}
counts ALL_LOADS: 737030, L1_HIT: 99, L1_MISS: 10, FB_HIT: 25. while without mfence, overhead of reading counters is something like this: ALL_LOADS: 125, L1_HIT: 94, L1_MISS: 11, FB_HIT: 24
I checked and sfence and lfence does not have this impact. I am using -O3 for compilation. From compiled file I guess it calls __builtin_ia32_mfence function, but I could not find much on it.
I understand in general what _mm_mfence() does and why we use it, but now the question is more about how it works. It would be great if anyone could explain or give any related article to understand this behavior.
_mm_mfence()
compiles to just the mfence
instruction, which is not a load or store, architecturally speaking
One or more of the uops that it decodes to may microarchitecturally run on a load port and get counted as a load, though.
What CPU are you using? If Skylake, I assume you have updated microcode so mfence
costs more than Agner Fog's tables list it as. (and it blocks out-of-order exec of non-memory uops, like lfence
. See Are loads and stores the only instructions that gets reordered? Apparently some Intel CPUs before Skylake didn't do that for mfence
.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With