I'm looking for a Linux utility that allows profiling the cache eviction in my program. Specifically, I'm interested in finding what causes certain cache line(s) to be repeatedly evicted from L2 cache.
Any suggestions?
Cache eviction is a feature where file data blocks in the cache are released when fileset usage exceeds the fileset soft quota, and space is created for new files. The process of releasing blocks is called eviction. However, file data is not evicted if the file data is dirty.
4.1 Cache profiling Cachegrind is a tool for doing cache simulations and annotating your source line-by-line with the number of cache misses.
Some people say that you need about 1MB of cache if you are just browsing the Internet, whereas others say that 8MB should be more than enough. It really depends on what you do with your computer most of the time. If you are a gamer, then you might want to increase the cache to 12MB at least.
You have several options at your disposal, some of which are free. Below I'll mostly talk about profiling L2 misses, not necessarily L2 evictions since those are more or less the same thing. Lines get evicted from the L2 because another line is being brought in, and another line is being brought in usually due to an L2 miss1.
First, I'd try out cachegrind. This basically runs your binary under a type of lightweight virtual machine which allows it to intercept all memory accesses and subsequently model their effect on the cache. It can pinpoint exactly where cache misses occur, who is responsible for eviction and so on.
It is important to note that cachegrind doesn't actually tell you what's going on with the hardware caches but rather what happens in its cache model. Since the L1 and L2 are simple enough on Intel x86, the cachegrind model should be accurate, except in unusual cases.
Cachegrind can only simulate two cache levels, but modern Intel has 3 or sometimes 4. That shouldn't be a problem if you are trying to evaluate L2 misses though. By default cachegrind sets the L1 cache to the detected values of the local L1 cache, and it's LLC to the detected values of the LLC. In your case, you'll want to override that latter decision to reflect the L2 cache, not the LLC. You can find the details in the manual, but this should be correct for recent Intel Broadwell and earlier:
--LL=262144,8,64
For Skylake client/Kaby Lake and friends you'd want:
--LL=262144,4,64
For Skylake-X server you'll want to look up the new values because the L2 changed.
The primary downside of this approach is that you can't be 100% sure that the cache model is an accurate reflection of reality (e.g., it doesn't model things like prefetching or virtual-physical paging). Another downside is that running a process under cachegrind is probably an order of magnitude slower than running it native, but for an investigation outside of "production" this probably isn't an issue.
You can use the default, included and free profiling tool to learn exactly what's actually going on with your actual hardware: perf
.
In particular, you can use perf record
combined with perf report
or perf annotate
to determine where in your program misses are occurring. You can start with something like this:
perf record -e mem_load_retired.l2_miss <your process>
This periodically records where an L2 misses appears. You can display the result with perf report
which lets you explore the results interactively. There are lots of other options such as --call-graph
to record the full call graph which may be useful.
The perf record
approach always where you where in your code something is happening, but it doesn't help you determine what memory was being accessed when the misses occurred. That often doesn't matter: the location in the code often makes it very obvious what memory is being accessed. Sometimes, however, that's not the case: you have some code that might access a large region of memory and you want to know the address to figure how why misses are occurring.
In that case you can use perf mem
, which records both the location, in code, of the miss and the address of the miss. This tool isn't as polished as the others, but the source is at least available so you could always make some improvements. I cover this option in some detail in another answer.
The primary disadvantage of perf is that it is less straightforward to use than something like cachegrind. The behavior and available events depends on your hardware and kernel version, and sometimes things like stack traces don't work, etc. You have to be relatively comfortable with the command line to make good use of this tool.
This tool uses the same underlying performance counters as perf
, but uses a GUI based exploration and is perhaps easier to jump into than perf
. It takes more of a top down approach: telling you where the problems are and allowing you to drill down, whereas perf
is more about "here's the raw data, figure out what's wrong".
It provides specific analyses like the Memory Access Analysis which might be appropriate for your problem.
The main downside is that this is a paid product, unless you qualify to use it for free. It may be somewhat easier to use than perf
but it's still not exactly easy and there is a lot of magic that goes on so if something goes wrong it may be hard to debug.
1 In some scenarios this might not be true. The main one I can think of is if prefetching into L2 causes most lines to arrive before they are missing. In that case, the number of L2 replacements might be might higher than the number of L2 misses. This is the kind of thing that cachegrind
won't be able to help you with, but perf
can: you can compare the number of L2 lines in/replaced to the number of L2 misses and see if they are close. If they aren't, you'll have to play around with other counters to see if prefetching is the cause.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With