Can the Intel PMU be used to measure per-core read/write memory bandwidth usage? Here "memory" means to DRAM (i.e., not hitting in any cache level).

Yes(ish), indirectly. You can use the relationship between counters (including time stamp) to infer other numbers. For example, if you sample a 1 second interval, and there are N last-level (3) cache misses, you can be pretty confident you are occupying N*CacheLineSize bytes per second. It gets a bit stickier to relate it accurately to program activity, as those misses might reflect cpu prefetching, interrupt activity, etc. There is also a morass of ‘this cpu doesn’t count (MMX, SSE, AVX, ..) unless this config bit is in this state’; thus rolling your own is cumbersome....

Can the Intel performance monitor counters be used to measure memory bandwidth?

2 Answers

Yes(ish), indirectly. You can use the relationship between counters (including time stamp) to infer other numbers. For example, if you sample a 1 second interval, and there are N last-level (3) cache misses, you can be pretty confident you are occupying N*CacheLineSize bytes per second.

It gets a bit stickier to relate it accurately to program activity, as those misses might reflect cpu prefetching, interrupt activity, etc.

There is also a morass of ‘this cpu doesn’t count (MMX, SSE, AVX, ..) unless this config bit is in this state’; thus rolling your own is cumbersome....

168

answered Sep 24 '22 12:09

mevets

Yes, this is possible, although it is not necessarily as straightforward as programming the usual PMU counters.

One approach is to use the programmable memory controller counters which are accessed via PCI space. A good place to start is by examining Intel's own implementation in pcm-memory at pcm-memory.cpp. This app shows you the per-socket or per-memory-controller throughput, which is suitable for some uses. In particular, the bandwidth is shared among all cores, so on a quiet machine you can assume most of the bandwidth is associated with the process under test, or if you wanted to monitor at the socket level it's exactly what you want.

The other alternative is to use careful programming of the "offcore repsonse" counters. These, as far as I know, relate to traffic between the L2 (the last core-private cache) and the rest of the system. You can filter by the result of the offcore response, so you can use a combination of the various "L3 miss" events and multiply by the cache line size to get a read and write bandwidth. The events are quite fine grained, so you can further break it down by the what caused the access in the first place: instruction fetch, data demand requests, prefetching, etc, etc.

The offcore response counters generally lag behind in support by tools like perf and likwid but at least recent versions seem to have reasonable support, even for client parts like SKL.

answered Sep 24 '22 12:09

BeeOnRope

Related questions
                            
                                android: gson performance
                            
                                GetPropertyAction vs System.getProperty in obtaining system variables
                            
                                Does rewriting memcpy/memcmp/... with SIMD instructions make sense?
                            
                                Do multiple Solr shards on a single machine improve performance?
                            
                                C++ does printing to terminal significantly slow down code?
                            
                                Optimising Java objects for CPU cache line efficiency
                            
                                Intel Intrinsics guide - Latency and Throughput
                            
                                Keras (Tensorflow backend) slower on GPU than on CPU when training certain networks
                            
                                What exactly is the optimization `functools.partial` is making?
                            
                                Is there a performance difference between asp.net mvc and web forms? [duplicate]
                            
                                itertools.islice compared to list slice
                            
                                Disable JavaScript function based on the user's computer's performance
                            
                                Best way to insert hundreds of DOM elements into the page dynamically while keeping performance high
                            
                                Ajax requests/responses: how to make them lightning fast?
                            
                                Which is more efficient : using removeAll() or using the following HashMap technique to retain only changed records in an ArrayList
                            
                                Branch prediction at php
                            
                                Which bitset implementation should I use for maximum performance?
                            
                                Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision
                            
                                Slow len function on dask distributed dataframe
                            
                                Why is this improved sieve slower with pypy?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Can the Intel performance monitor counters be used to measure memory bandwidth?

Tags:

performance

x86

memory-bandwidth

intel-pmu

BeeOnRope

People also ask

2 Answers

mevets

BeeOnRope

Recent Activity

Donate For Us