Linux Kernel : 4.10.0-20-generic (also tried this on 4.11.3)
Ubuntu : 17.04
I have been trying to collect stats of memory-accesses using perf stat
. I am able to collect stats for memory-stores but the count for memory-loads return me a 0 value.
The below is the details for memory-stores :-
perf stat -e cpu/mem-stores/u ./libquantum_base.arnab 100
N = 100, 37 qubits required
Random seed: 33
Measured 3277 (0.200012), fractional approximation is 1/5.
Odd denominator, trying to expand by 2.
Possible period is 10.
100 = 4 * 25
Performance counter stats for './libquantum_base.arnab 100':
158,115,510 cpu/mem-stores/u
0.559922797 seconds time elapsed
For memory-loads, I get a 0 count as can be seen below :-
perf stat -e cpu/mem-loads/u ./libquantum_base.arnab 100
N = 100, 37 qubits required
Random seed: 33
Measured 3277 (0.200012), fractional approximation is 1/5.
Odd denominator, trying to expand by 2.
Possible period is 10.
100 = 4 * 25
Performance counter stats for './libquantum_base.arnab 100':
0 cpu/mem-loads/u
0.563806170 seconds time elapsed
I cannot understand why this does not count properly. Should I use a different event in any way to get proper data ?
The mem-loads
event is mapped to the MEM_TRANS_RETIRED.LOAD_LATENCY_GT_3
performance monitoring unit event on Intel processors. The events MEM_TRANS_RETIRED.LOAD_LATENCY_*
are special and can only be counted by using the p
modifier. That is, you have to specify mem-loads:p
to perf to use the event correctly.
MEM_TRANS_RETIRED.LOAD_LATENCY_*
is a precise event and it only makes sense to be counted at the precise level. According to this Intel article (emphasis mine):
When a user elects to sample one of these events, special hardware is used that can keep track of a data load from issue to completion. This is more complicated than simply counting instances of an event (as with normal event-based sampling), and so only some loads are tracked. Loads are randomly chosen, the latency determined for each, and the correct event(s) incremented (latency >4, >8, >16, etc). Due to the nature of the sampling for this event, only a small percentage of an application's data loads can be tracked at any one time.
As you can see, MEM_TRANS_RETIRED.LOAD_LATENCY_*
by no means count the total number of loads and it is not designed for that purpose at all.
If you want to to determine which instructions in your code are issuing load requests that take more than a specific number of cycles to complete, then MEM_TRANS_RETIRED.LOAD_LATENCY_*
is the right performance event to use. In fact, that is exactly the purpose of perf-mem
and it achieves its purpose by using this event.
If you want to count the total number of load uops retired, then you should use L1-dcache-loads
, which is mapped to the MEM_UOPS_RETIRED.ALL_LOADS
performance event on Intel processors.
On the other hand, mem-stores
and L1-dcache-stores
are mapped to the exact same performance event on all current Intel processors, namely, MEM_UOPS_RETIRED.ALL_STORES
, which does count all retired store uops.
So in summary, if you are using perf-stat
, you should (almost) always use L1-dcache-loads
and L1-dcache-stores
to count retired loads and stores, respectively. These are mapped to the raw events you have used in the answer you posted, only more portable because they also work on AMD processors.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With