Is it possible to measure the number of successful store-forwarding operations using the performance counters on recent Intel x86 chips?
I see events for ld_blocks.store_forward
which measure failed store-forwarding, but it's clear to me if the successful case can be measured.
I don't see anything more than you did for SKL, but older uarches may have more details:
For Core2 (what Intel confusingly calls the Core microarchitecture), the optimization manual documents (in B.7 EVENT RATIOS FOR INTEL CORE MICROARCHITECTURE):
B.7.5.2 4K Aliasing and Store Forwarding Block Detection
- Loads Blocked by Overlapping Store Rate:
LOAD_BLOCK.OVERLAP_STORE/CPU_CLK_UNHALTED.CORE
4K aliasing and store forwarding block are two different scenarios in which loads are blocked by preceding stores due to different reasons. Both scenarios are detected by the same event:
LOAD_BLOCK.OVERLAP_STORE
. A high value for “Loads Blocked by Overlapping Store Rate” indicates that either 4K aliasing or store forwarding block may affect performance
This may count stalled and successful store-forwarding. (And 4k aliasing, so you need to avoid that or subtract it.)
B.7.5.3 Load Block by Preceding Stores
- Loads Blocked by Unknown Store Address
Rate: LOAD_BLOCK.STA / CPU_CLK_UNHALTED.CORE
A high value for “Loads Blocked by Unknown Store Address Rate” indicates that loads are frequently blocked by preceding stores with unknown address and implies performance penalty.
- Loads Blocked by Unknown Store Data Rate:
LOAD_BLOCK.STD / CPU_CLK_UNHALTED.CORE
A high value for “Loads Blocked by Unknown Store Data Rate” indicates that loads are frequently blocked by preceding stores with unknown data and implies performance penalty.
These last two counters would appear to count successful store forwarding, but only in cases where the load actually had to wait after detecting the (possible) overlap.
There is no documented event to count the number of successful store forwarding operations. However, I have experimentally determined a set of undocumented events for that purpose on Haswell and Broadwell. In particular, any event with event code 0x2 and an odd value for umask (any odd number such as 1) seems to be representing the event of successful store forwarding very accurately, i.e., the counts are as expected and the standard deviation is practically zero. I think you can use the same events on later (and even earlier) microarchitectures. Again, none of these events are documented.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With