I'm trying to analyse an execution on an Intel Haswell CPU (Intel® Core™ i7-4900MQ) with the Top-down Microarchitecture Analysis Method (TMAM), described in Chapters B.1 and B.4 of the Intel® 64 and IA-32 Architectures Optimization Reference Manual. (I adjust the Sandy Bridge formulas described in B.4 to the Haswell Microarchitecture if needed.)
Therefore I perform performance counter events measurements with Perf. There are some results I don’t understand:
CPU_CLK_UNHALTED.THREAD_P
< CYCLE_ACTIVITY.CYCLES_LDM_PENDING
This holds only for a few measurements, but still is weird. Does the PMU count halted cycles for CYCLE_ACTIVITY.CYCLES_LDM_PENDING
?
CYCLE_ACTIVITY.CYCLES_L2_PENDING
> CYCLE_ACTIVITY.CYCLES_L1D_PENDING
and CYCLE_ACTIVITY.STALLS_L2_PENDING
> CYCLE_ACTIVITY.STALLS_L1D_PENDING
This applies for all measurements. When there is a L1D cache miss, the load gets transferred to the L2 cache, right? So a load missed L2 earlier also missed L1. There is the L1 instruction cache not counted here, but with *_L2_PENDING
being 100x or even 1000x greater than *_L1D_PENDING
it is probably not that.. Are the stalls/cycles being measured somehow separately? But than there is this formula:
%L2_Bound =
(CYCLE_ACTIVITY.STALLS_L1D_PENDING - CYCLE_ACTIVITY.STALLS_L2_PENDING) / CLOCKS
Hence CYCLE_ACTIVITY.STALLS_L2_PENDING
< CYCLE_ACTIVITY.STALLS_L1D_PENDING
is assumed (the result of the formula must be positive). (The other thing with this formula is that it should probably be CYCLES
instead of STALLS
. However this wouldn't solve the problem described above.) So how can this be explained?
edit: My OS: Ubuntu 14.04.3 LTS, kernel: 3.13.0-65-generic x86_64, perf version: 3.13.11-ckt26
I'll start with the second part of the question, i.e., how CYCLE_ACTIVITY.CYCLES_L2_PENDING
and CYCLE_ACTIVITY.STALLS_L2_PENDING
can be larger than CYCLE_ACTIVITY.CYCLES_L1D_PENDING
and CYCLE_ACTIVITY.STALLS_L1D_PENDING
, respectively.
First, note that the formula for %L2_Bound
is from Section B.5 of the Intel Optimization Manual. The first paragraph of that section says:
This section covers various performance tuning techniques using performance monitoring events. Some techniques can be adapted in general to other microarchitectures, most of the performance events are specific to Intel microarchitecture code name Sandy Bridge.
My first hunch was that prefetching has something to do with it (see my comment). This paragraph pushed me further in the right direction; these events may represent different things in Sandy Bridge and in Haswell. Here is what they mean on Haswell:
CYCLE_ACTIVITY.CYCLES_L1D_PENDING: Cycles with pending L1 data cache miss loads. CYCLE_ACTIVITY.CYCLES_L2_PENDING: Cycles with pending L2 miss loads. CYCLE_ACTIVITY.STALLS_L1D_PENDING: Execution stalls due to L1 data cache miss loads. CYCLE_ACTIVITY.STALLS_L2_PENDING: Number of loads missed L2.
The manual also says the counters for L2 should only be used when hyperthreading is disabled. Now here is what they mean on Sandy Bridge:
CYCLE_ACTIVITY.CYCLES_L1D_PENDING: Each cycle there was a miss-pending demand load this thread, increment by 1.
CYCLE_ACTIVITY.CYCLES_L2_PENDING: Each cycle there was a MLC-miss pending demand load this thread, increment by 1.
CYCLE_ACTIVITY.STALLS_L1D_PENDING: Each cycle there was a miss-pending demand load this thread and no uops dispatched, increment by 1.
CYCLE_ACTIVITY.STALLS_L2_PENDING: Each cycle there was a MLC-miss pending demand load and no uops dispatched on this thread, increment by 1.
There are three important differences:
CYCLE_ACTIVITY.STALLS_L2_PENDING
on HSW counts the number of load misses at L2, but on SNB, it counts the number of cycles during which there was at least one demand load miss at L2.On HSW, CYCLE_ACTIVITY.CYCLES_L2_PENDING
can be larger than CYCLE_ACTIVITY.CYCLES_L1D_PENDING
because of the miss-pending loads issued by the L1D prefetcher (and/or the L2 prefetcher(s) depending on whether the prefetcher increments the counter for the same level of cache). Similarly, while they count different things, CYCLE_ACTIVITY.STALLS_L2_PENDING
can be larger than CYCLE_ACTIVITY.STALLS_L1D_PENDING
due to prefetching. TLB prefetching and prefetching at other MMU caches may also impact these performance events on HSW. On the other hand, on SNB, it is guaranteed that CYCLE_ACTIVITY.STALLS_L2_PENDING
< CYCLE_ACTIVITY.STALLS_L1D_PENDING
, and that's why the %L2_Bound
formula is valid on SNB.
Like I said in the comment, disabling HT and/or prefetching may "fix" your problem.
Actually, the Intel spec update document for the Mobile Haswell processors mentions two bugs that affect CYCLES_L2_PENDING
:
CYCLES_L2_PENDING
on Haswell is to count only for demand loads, but it may count inaccurately in SMT mode.CYCLES_L2_PENDING
may overcount due to requests from the next page prefetcher.I think you can minimize the error in CYCLES_L2_PENDING
by disabling SMT (either in BIOS or putting the other logical core into sleep). In addition, try to not trigger the NPP. This can be achieved by avoiding locations towards the end of a virtual page where the translation of the next page is not already in the TLB hierarchy.
Related: When L1 misses are a lot different than L2 accesses… TLB related?
Regarding the first part of the question, i.e., how CPU_CLK_UNHALTED.THREAD_P
can be smaller than CYCLE_ACTIVITY.CYCLES_LDM_PENDING
. One explanation that I could think of is that the CYCLE_ACTIVITY.CYCLES_LDM_PENDING
occurs for loads issued from (some) other threads (in particular, on the same physical core), not just the halted thread. Erratum HSM146 mentions that CYCLES_LDM_PENDING
may count inaccurately when the logical core is not in C0, which explains how CPU_CLK_UNHALTED.THREAD_P
can be smaller than CYCLES_LDM_PENDING
. Disabling HT may eliminate this inaccuracy, although the spec update document doesn't provide any workaround.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With