I'm working on a custom implementation on top of perf_event_open
syscall.
The implementation aims to support various of PERF_TYPE_HARDWARE
, PERF_TYPE_SOFTWARE
and PERF_TYPE_HW_CACHE
events for specific threads on any core.
In Intel® 64 and IA-32 Architectures Software Developer’s Manual vol 3B I see the following for my testing CPU (Kaby Lake):
To my understanding so far, one can monitor (theoretically) unlimited PERF_TYPE_SOFTWARE
events concurrently but limited (without multiplexing) PERF_TYPE_HARDWARE
and PERF_TYPE_HW_CACHE
events concurrently since each event is measured by the limited (as can be seen on the manual above) number of counters of the CPU's PMU.
So for a quad-core Kaby Lake CPU with HyperThreading enabled I assume that up to 4 PERF_TYPE_HARDWARE
/PERF_TYPE_HW_CACHE
events can be monitored concurrently (or up to 8 if only 4 threads are used).
Experimenting with the above assumptions I found out that while I can successfully monitor up to 4 PERF_TYPE_HARDWARE
events (for 8 threads) this is not the case for PERF_TYPE_HW_CACHE
events where only up to 2 events can be monitored concurrently!
I also tried to use only 4 threads but the upper limit of concurrently monitored 'PERF_TYPE_HARDWARE' events remains 4. The same is happening with HyperThreading disabled!
One could ask: why do you need to avoid multiplexing. First of all, the implementation needs to be as much accurate as possible by avoiding the potential blind spots of multiplexing and secondly when the "upper limit" is exceeded all event values are 0...
The PERF_TYPE_HW_CACHE
events I'm targeting are:
CACHE_LLC_READ(PERF_HW_CACHE_TYPE_ID.PERF_COUNT_HW_CACHE_LL.value | PERF_HW_CACHE_OP_ID.PERF_COUNT_HW_CACHE_OP_READ.value << 8 | PERF_HW_CACHE_OP_RESULT_ID.PERF_COUNT_HW_CACHE_RESULT_ACCESS.value << 16),
CACHE_LLC_WRITE(PERF_HW_CACHE_TYPE_ID.PERF_COUNT_HW_CACHE_LL.value | PERF_HW_CACHE_OP_ID.PERF_COUNT_HW_CACHE_OP_WRITE.value << 8 | PERF_HW_CACHE_OP_RESULT_ID.PERF_COUNT_HW_CACHE_RESULT_ACCESS.value << 16),
CACHE_LLC_READ_MISS(PERF_HW_CACHE_TYPE_ID.PERF_COUNT_HW_CACHE_LL.value | PERF_HW_CACHE_OP_ID.PERF_COUNT_HW_CACHE_OP_READ.value << 8 | PERF_HW_CACHE_OP_RESULT_ID.PERF_COUNT_HW_CACHE_RESULT_MISS.value << 16),
CACHE_LLC_WRITE_MISS(PERF_HW_CACHE_TYPE_ID.PERF_COUNT_HW_CACHE_LL.value | PERF_HW_CACHE_OP_ID.PERF_COUNT_HW_CACHE_OP_WRITE.value << 8 | PERF_HW_CACHE_OP_RESULT_ID.PERF_COUNT_HW_CACHE_RESULT_MISS.value << 16),
all are implemented with the provided formula:
(perf_hw_cache_id) | (perf_hw_cache_op_id << 8) |
(perf_hw_cache_op_result_id << 16)
and are manipulated as a group (the first is the group leader etc).
So, my questions are the following:
PERF_TYPE_HARDWARE
and which for PERF_TYPE_HW_CACHE
events and where can I find this information?PERF_TYPE_HARDWARE
pre-defined events (such as PERF_COUNT_HW_CACHE_MISSES
) and the PERF_TYPE_HW_CACHE
events?PERF_TYPE_HW_CACHE
events?PERF_TYPE_HARDWARE
or/and PERF_TYPE_HW_CACHE
events?Thanks in advance!
PERF_TYPE_HARDWARE
and PERF_TYPE_HW_CACHE
events are mapped to two sets of registers involved in performance monitoring. The first set of MSRs are called IA32_PERFEVTSELx
where x can vary from 0 to N-1, N being the total number of general purpose counters available. The PERFEVTSEL
is a short for "performance event select", they specify various conditions on the fulfillment of which event counting will happen. The second set of MSRs are called IA32_PMCx
, where x varies similarly as PERFEVTSEL
. These PMC registers store the counts of performance monitoring events. Each PERFEVTSEL
register is paired with a corresponding PMC
register.The mapping happens as follows-
At the initialization of the architecture specific portion of the kernel, a pmu for measuring hardware specific events is registered here with type PERF_TYPE_RAW
. All PERF_TYPE_HARDWARE
and PERF_TYPE_HW_CACHE
events are mapped to PERF_TYPE_RAW
events to identify the pmu, as can be seen here.
if (type == PERF_TYPE_HARDWARE || type == PERF_TYPE_HW_CACHE)
type = PERF_TYPE_RAW;
The same architecture specific initialization is responsible for setting up the addresses of the first/base registers of each of the aforementioned sets of performance monitoring event registers, here
.eventsel = MSR_ARCH_PERFMON_EVENTSEL0,
.perfctr = MSR_ARCH_PERFMON_PERFCTR0,
The event_init
function specific to the PMU identified, is responsible for setting up and "reserving" the two sets of performance monitoring registers, as well as checking for event constraints etc., here. The reservation happens here.
for (i = 0; i < x86_pmu.num_counters; i++) {
if (!reserve_perfctr_nmi(x86_pmu_event_addr(i)))
goto perfctr_fail;
}
for (i = 0; i < x86_pmu.num_counters; i++) {
if (!reserve_evntsel_nmi(x86_pmu_config_addr(i)))
goto eventsel_fail;
}
The value num_counters
= number of general-purpose counters as identified by CPUID
instruction.
In addition to this, there are a couple of extra registers that monitor offcore events (eg. the LLC-cache specific events).
In later versions of architectural performance monitoring, some of the hardware events are measured with the help of fixed-purpose registers, as seen here. These are the fixed-purpose registers -
#define MSR_ARCH_PERFMON_FIXED_CTR0 0x309
#define MSR_ARCH_PERFMON_FIXED_CTR1 0x30a
#define MSR_ARCH_PERFMON_FIXED_CTR2 0x30b
The PERF_TYPE_HARDWARE
pre-defined events are all architectural performance monitoring events. These events are architectural, since the behavior of each architectural performance event is expected to be consistent on all processors that support that event. All of the PERF_TYPE_HW_CACHE
events are non-architectural, which means they are model-specific and may vary from one family of processors to another.
For an Intel Kaby Lake machine that I have, a total of 20 PERF_TYPE_HW_CACHE
events are pre-defined. The event constraints involved, ensure that the 3 fixed-function counters available are mapped to 3 PERF_TYPE_HARDWARE
architectural events. Only one event can be measured on each of the fixed-function counters, so we can discard them for our analysis. The other constraint is that only two events targeting the LLC-caches, can be measured at the same time, since there are only two OFFCORE RESPONSE
registers. Also, the nmi-watchdog
may pin an event to another counter from the family of general-purpose counters. If the nmi-watchdog
is disabled, we are left with 4 general purpose counters.
Given the constraints involved, and the limited number of counters available, there is just no way to avoid multiplexing if all the 20 hardware cache events are measured at the same time. Some workarounds to measure all the events, without incurring multiplexing and its errors, are -
3.1. Group all the PERF_TYPE_HW_CACHE
events into groups of 4, such that all of the 4 events can be scheduled on each of the 4 general-purpose counters at the same time. Make sure there are no more than 2 LLC cache events in a group. Run the same profile and obtain the counts for each of the groups separately.
3.2. If all the PERF_TYPE_HW_CACHE
events are to be monitored at the same time, then some of the errors with multiplexing can be reduced, by decreasing the value of perf_event_mux_interval_ms
. It can be configured via a sysfs entry called /sys/devices/cpu/perf_event_mux_interval_ms
. This value cannot be lowered beyond a point, as can be seen here.
CPUID
instruction and the number of such counters are setup at the architecture initialization portion of the kernel startup via the early_initcall
function. This can be seen here. Once the initialization is done, the kernel understands that only 4 counters are available, and any changes in hyperthreading capabilities later, do not make any difference. If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With