Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can the Intel performance monitor counters be used to measure memory bandwidth?

Can the Intel PMU be used to measure per-core read/write memory bandwidth usage? Here "memory" means to DRAM (i.e., not hitting in any cache level).

like image 740
BeeOnRope Avatar asked Dec 02 '17 21:12

BeeOnRope


People also ask

How is memory bandwidth measured?

To measure the memory bandwidth for a function, I wrote a simple benchmark. For each function, I access a large 3 array of memory and compute the bandwidth by dividing by the run time 4. For example, if a function takes 120 milliseconds to access 1 GB of memory, I calculate the bandwidth to be 8.33 GB/s.

What do performance counters measure in Performance Monitor?

Performance counters are bits of code that monitor, count, or measure events in software, which allow us to see patterns from a high-level view. They are registered with the operating system during installation of the software, allowing anyone with the proper permissions to view them.

What can performance counters do for memory subsystem analysis?

They can provide a wealth of information as to how the hardware is being used by software. Many processors now support events to measure precisely and with very limited overhead, the traffic between a core and the memory subsystem. It is possible to compute average load latency and bus band-width utilization.

What are CPU performance counters?

The CPU performance counters are counting the number of instructions, clock ticks and multi counters ticks. They are used to measure the run-time of a c-function. The result is stored in a global variable.


2 Answers

Yes(ish), indirectly. You can use the relationship between counters (including time stamp) to infer other numbers. For example, if you sample a 1 second interval, and there are N last-level (3) cache misses, you can be pretty confident you are occupying N*CacheLineSize bytes per second.

It gets a bit stickier to relate it accurately to program activity, as those misses might reflect cpu prefetching, interrupt activity, etc.

There is also a morass of ‘this cpu doesn’t count (MMX, SSE, AVX, ..) unless this config bit is in this state’; thus rolling your own is cumbersome....

like image 168
mevets Avatar answered Sep 24 '22 12:09

mevets


Yes, this is possible, although it is not necessarily as straightforward as programming the usual PMU counters.

One approach is to use the programmable memory controller counters which are accessed via PCI space. A good place to start is by examining Intel's own implementation in pcm-memory at pcm-memory.cpp. This app shows you the per-socket or per-memory-controller throughput, which is suitable for some uses. In particular, the bandwidth is shared among all cores, so on a quiet machine you can assume most of the bandwidth is associated with the process under test, or if you wanted to monitor at the socket level it's exactly what you want.

The other alternative is to use careful programming of the "offcore repsonse" counters. These, as far as I know, relate to traffic between the L2 (the last core-private cache) and the rest of the system. You can filter by the result of the offcore response, so you can use a combination of the various "L3 miss" events and multiply by the cache line size to get a read and write bandwidth. The events are quite fine grained, so you can further break it down by the what caused the access in the first place: instruction fetch, data demand requests, prefetching, etc, etc.

The offcore response counters generally lag behind in support by tools like perf and likwid but at least recent versions seem to have reasonable support, even for client parts like SKL.

like image 35
BeeOnRope Avatar answered Sep 24 '22 12:09

BeeOnRope