Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding CYCLE_ACTIVITY.* Haswell Performance-Monitoring Events

I'm trying to analyse an execution on an Intel Haswell CPU (Intel® Core™ i7-4900MQ) with the Top-down Microarchitecture Analysis Method (TMAM), described in Chapters B.1 and B.4 of the Intel® 64 and IA-32 Architectures Optimization Reference Manual. (I adjust the Sandy Bridge formulas described in B.4 to the Haswell Microarchitecture if needed.)

Therefore I perform performance counter events measurements with Perf. There are some results I don’t understand:

  1. CPU_CLK_UNHALTED.THREAD_P < CYCLE_ACTIVITY.CYCLES_LDM_PENDING

This holds only for a few measurements, but still is weird. Does the PMU count halted cycles for CYCLE_ACTIVITY.CYCLES_LDM_PENDING?

  1. CYCLE_ACTIVITY.CYCLES_L2_PENDING > CYCLE_ACTIVITY.CYCLES_L1D_PENDING and CYCLE_ACTIVITY.STALLS_L2_PENDING > CYCLE_ACTIVITY.STALLS_L1D_PENDING

This applies for all measurements. When there is a L1D cache miss, the load gets transferred to the L2 cache, right? So a load missed L2 earlier also missed L1. There is the L1 instruction cache not counted here, but with *_L2_PENDING being 100x or even 1000x greater than *_L1D_PENDING it is probably not that.. Are the stalls/cycles being measured somehow separately? But than there is this formula:

%L2_Bound = (CYCLE_ACTIVITY.STALLS_L1D_PENDING - CYCLE_ACTIVITY.STALLS_L2_PENDING) / CLOCKS

Hence CYCLE_ACTIVITY.STALLS_L2_PENDING < CYCLE_ACTIVITY.STALLS_L1D_PENDING is assumed (the result of the formula must be positive). (The other thing with this formula is that it should probably be CYCLES instead of STALLS. However this wouldn't solve the problem described above.) So how can this be explained?

edit: My OS: Ubuntu 14.04.3 LTS, kernel: 3.13.0-65-generic x86_64, perf version: 3.13.11-ckt26

like image 920
lary Avatar asked Nov 12 '15 17:11

lary


1 Answers

I'll start with the second part of the question, i.e., how CYCLE_ACTIVITY.CYCLES_L2_PENDING and CYCLE_ACTIVITY.STALLS_L2_PENDING can be larger than CYCLE_ACTIVITY.CYCLES_L1D_PENDING and CYCLE_ACTIVITY.STALLS_L1D_PENDING, respectively.

First, note that the formula for %L2_Bound is from Section B.5 of the Intel Optimization Manual. The first paragraph of that section says:

This section covers various performance tuning techniques using performance monitoring events. Some techniques can be adapted in general to other microarchitectures, most of the performance events are specific to Intel microarchitecture code name Sandy Bridge.

My first hunch was that prefetching has something to do with it (see my comment). This paragraph pushed me further in the right direction; these events may represent different things in Sandy Bridge and in Haswell. Here is what they mean on Haswell:

CYCLE_ACTIVITY.CYCLES_L1D_PENDING: Cycles with pending L1 data cache miss loads. CYCLE_ACTIVITY.CYCLES_L2_PENDING: Cycles with pending L2 miss loads. CYCLE_ACTIVITY.STALLS_L1D_PENDING: Execution stalls due to L1 data cache miss loads. CYCLE_ACTIVITY.STALLS_L2_PENDING: Number of loads missed L2.

The manual also says the counters for L2 should only be used when hyperthreading is disabled. Now here is what they mean on Sandy Bridge:

CYCLE_ACTIVITY.CYCLES_L1D_PENDING: Each cycle there was a miss-pending demand load this thread, increment by 1.
CYCLE_ACTIVITY.CYCLES_L2_PENDING: Each cycle there was a MLC-miss pending demand load this thread, increment by 1.
CYCLE_ACTIVITY.STALLS_L1D_PENDING: Each cycle there was a miss-pending demand load this thread and no uops dispatched, increment by 1.
CYCLE_ACTIVITY.STALLS_L2_PENDING: Each cycle there was a MLC-miss pending demand load and no uops dispatched on this thread, increment by 1.

There are three important differences:

  • Some of the Haswell events can only valid when HT is disabled. All SNB events are valid even when HT is enabled.
  • CYCLE_ACTIVITY.STALLS_L2_PENDING on HSW counts the number of load misses at L2, but on SNB, it counts the number of cycles during which there was at least one demand load miss at L2.
  • The HSW events include all accesses, not just demand loads. In contrast, the SNB events only occur for demand loads.

On HSW, CYCLE_ACTIVITY.CYCLES_L2_PENDING can be larger than CYCLE_ACTIVITY.CYCLES_L1D_PENDING because of the miss-pending loads issued by the L1D prefetcher (and/or the L2 prefetcher(s) depending on whether the prefetcher increments the counter for the same level of cache). Similarly, while they count different things, CYCLE_ACTIVITY.STALLS_L2_PENDING can be larger than CYCLE_ACTIVITY.STALLS_L1D_PENDING due to prefetching. TLB prefetching and prefetching at other MMU caches may also impact these performance events on HSW. On the other hand, on SNB, it is guaranteed that CYCLE_ACTIVITY.STALLS_L2_PENDING < CYCLE_ACTIVITY.STALLS_L1D_PENDING, and that's why the %L2_Bound formula is valid on SNB.

Like I said in the comment, disabling HT and/or prefetching may "fix" your problem.

Actually, the Intel spec update document for the Mobile Haswell processors mentions two bugs that affect CYCLES_L2_PENDING:

  • HSM63: The intended behavior of CYCLES_L2_PENDING on Haswell is to count only for demand loads, but it may count inaccurately in SMT mode.
  • HSM80: CYCLES_L2_PENDING may overcount due to requests from the next page prefetcher.

I think you can minimize the error in CYCLES_L2_PENDING by disabling SMT (either in BIOS or putting the other logical core into sleep). In addition, try to not trigger the NPP. This can be achieved by avoiding locations towards the end of a virtual page where the translation of the next page is not already in the TLB hierarchy.

Related: When L1 misses are a lot different than L2 accesses… TLB related?

Regarding the first part of the question, i.e., how CPU_CLK_UNHALTED.THREAD_P can be smaller than CYCLE_ACTIVITY.CYCLES_LDM_PENDING. One explanation that I could think of is that the CYCLE_ACTIVITY.CYCLES_LDM_PENDING occurs for loads issued from (some) other threads (in particular, on the same physical core), not just the halted thread. Erratum HSM146 mentions that CYCLES_LDM_PENDING may count inaccurately when the logical core is not in C0, which explains how CPU_CLK_UNHALTED.THREAD_P can be smaller than CYCLES_LDM_PENDING. Disabling HT may eliminate this inaccuracy, although the spec update document doesn't provide any workaround.

like image 159
Hadi Brais Avatar answered Oct 31 '22 15:10

Hadi Brais