There are some questions about this topic, but none has a real answer. The question is: how can I measure L1, L2, L3 (if any) cache misses on macOS?
The problem is not that macOS does not provide, in theory, those values even without any external tool. In Instruments we could use the Counters and go to Recording Options... as in here:

However, there is no L1 cache miss or L2, but a huge list of possible items that could be selected:

So, when measuring L1 and L2 cache misses (or even L3 if there is any), how can I count them?
Which of the list is the "cache misses" I should pay attention to in order to retrieve that magic "cache miss" number?
On Ivy Bridge, Haswell, Broadwell, and Goldmont processors, you can use the following events to count the number of data cache lines that were needed by demand load requests from cacheable1 load instructions that missed the L1, L2, and L3: MEM_LOAD_UOPS_RETIRED.L1_MISS, MEM_LOAD_UOPS_RETIRED.L2_MISS, and MEM_LOAD_UOPS_RETIRED.L3_MISS, respectively. On Skylake and later, the corresponding events are called: MEM_LOAD_RETIRED.L1_MISS, MEM_LOAD_RETIRED.L2_MISS, and MEM_LOAD_RETIRED.L3_MISS. These events only count cache lines that were needed by load instructions that were retired.
On Nehalem and later, you can use the following events to count the number of cache lines that were needed by demand store requests from cacheable store instructions that missed the L1, L2, and L3: L2_RQSTS.ALL_RFO, L2_RQSTS.RFO_MISS, and OFFCORE_RESPONSE (MSR bits 1, 17, 26-29, 30-37), respectively. These events count cache lines that were needed by stores instructions that were retired or flushed from the pipeline.
Counting only retired instructions can be more useful that counting accesses from all instructions depending on the scenario. Unfortunately, there are no store events that correspond to MEM_LOAD_UOPS_*. However, there are load events that count both retired and flushed loads. These include L2_RQSTS.ALL_DEMAND_DATA_RD for L1 load misses, L2_RQSTS.DEMAND_DATA_RD_MISS for L2 load misses, and OFFCORE_RESPONSE (MSR bits 0, 17, 26-29, 30-37) for L3 load misses. Note that the first two events include also loads from the L1 hardware prefetchers. The L2_RQSTS.DEMAND_DATA_RD_MISS event is only supported on Ivy Bridge and later. On Sandy Bridge, I think it can be calculated by subtracting L2_RQSTS.DEMAND_DATA_RD_HIT from L2_RQSTS.ALL_DEMAND_DATA_RD.
See also: How does Linux perf calculate the cache-references and cache-misses events.
Footnotes:
(1) The IN instruction is counted as a MEM_LOAD_UOPS_RETIRED.L1_MISS event on Haswell (See: What does port-mapped I/O look like on Sandy Bridge). I've also verified empirically that all of the MEM_LOAD_UOPS_RETIRED.L1|2|3|LFB_MISS|HIT events don't count loads from the UC or WC memory types and that they do count loads from the WP, WB, and WT memory types. Note that the manual only mentions that UC loads are excluded and for only some of the events. By the way, MEM_UOPS_RETIRED.ALL_LOADS counts loads from all memory types.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With