Can the Intel PMU be used to measure per-core read/write memory bandwidth usage? Here "memory" means to DRAM (i.e., not hitting in any cache level).
To measure the memory bandwidth for a function, I wrote a simple benchmark. For each function, I access a large 3 array of memory and compute the bandwidth by dividing by the run time 4. For example, if a function takes 120 milliseconds to access 1 GB of memory, I calculate the bandwidth to be 8.33 GB/s.
Performance counters are bits of code that monitor, count, or measure events in software, which allow us to see patterns from a high-level view. They are registered with the operating system during installation of the software, allowing anyone with the proper permissions to view them.
They can provide a wealth of information as to how the hardware is being used by software. Many processors now support events to measure precisely and with very limited overhead, the traffic between a core and the memory subsystem. It is possible to compute average load latency and bus band-width utilization.
The CPU performance counters are counting the number of instructions, clock ticks and multi counters ticks. They are used to measure the run-time of a c-function. The result is stored in a global variable.
Yes(ish), indirectly. You can use the relationship between counters (including time stamp) to infer other numbers. For example, if you sample a 1 second interval, and there are N last-level (3) cache misses, you can be pretty confident you are occupying N*CacheLineSize bytes per second.
It gets a bit stickier to relate it accurately to program activity, as those misses might reflect cpu prefetching, interrupt activity, etc.
There is also a morass of ‘this cpu doesn’t count (MMX, SSE, AVX, ..) unless this config bit is in this state’; thus rolling your own is cumbersome....
Yes, this is possible, although it is not necessarily as straightforward as programming the usual PMU counters.
One approach is to use the programmable memory controller counters which are accessed via PCI space. A good place to start is by examining Intel's own implementation in pcm-memory
at pcm-memory.cpp. This app shows you the per-socket or per-memory-controller throughput, which is suitable for some uses. In particular, the bandwidth is shared among all cores, so on a quiet machine you can assume most of the bandwidth is associated with the process under test, or if you wanted to monitor at the socket level it's exactly what you want.
The other alternative is to use careful programming of the "offcore repsonse" counters. These, as far as I know, relate to traffic between the L2 (the last core-private cache) and the rest of the system. You can filter by the result of the offcore response, so you can use a combination of the various "L3 miss" events and multiply by the cache line size to get a read and write bandwidth. The events are quite fine grained, so you can further break it down by the what caused the access in the first place: instruction fetch, data demand requests, prefetching, etc, etc.
The offcore response counters generally lag behind in support by tools like perf
and likwid
but at least recent versions seem to have reasonable support, even for client parts like SKL.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With