I have a critical path which executes in one thread, pinned to a single core.
I am interested in identifying where cache misses are occurring. After looking around it seems valgrind's cachegrind tool would help me. However I have some questions regarding the tool's capabilities in this scenario:
Question 1 is the most important.
Any help with command line arguments is most-appreciated.
To use this tool, you must specify --tool=cachegrindon the Valgrind command line. Cachegrind is a tool for doing cache simulations and annotating your source line-by-line with the number of cache misses. In particular, it records: L1 instruction cache reads and misses;
But by contrast with normal Valgrind use, you probably do want to turn optimisation on, since you should profile your program as it will be normally run. The two steps are: Run your program with valgrind --tool=cachegrind in front of the normal command line invocation. When the program finishes, Cachegrind will print summary cache statistics.
The Valgrind profiling tools are cachegrind and callgrind. The cachegrind tool simulates the L1/L2 caches and counts cache misses/hits. The callgrind tool counts function calls and the CPU instructions executed within each call and builds a function callgraph.
Use the -I / --includeoption to tell Valgrind where to look for source files if the filenames found from the debugging information aren't specific enough. Beware that cg_annotate can take some time to digest large cachegrind.out.pidfiles, e.g. 30 seconds or more. Also beware that auto-annotation can produce a lot of output if your program is large!
cachegrind
can output both global and local informations concerning cache misses, and annotate at the line level (if the original program was compiled with debug information). For instance, the following code:
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
int main(int argc, char**argv) {
size_t n = (argc == 2 ) ? atoi(argv[1]) : 100;
double* v = malloc(sizeof(double) * n);
for(size_t i = 0; i < n ; i++)
v[i] = i;
double s = 0;
for(size_t i = 0; i < n ; ++i)
s += v[i] * v[n - 1 - i];
printf("%ld\n", s);
free(v);
return 0;
}
compiled with gcc a.c -O2 -g -o a
and run with valgrind --tool=cachegrind ./a 10000000
outputs:
==11551== Cachegrind, a cache and branch-prediction profiler
==11551== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote et al.
==11551== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==11551== Command: ./a 10000000
==11551==
--11551-- warning: L3 cache found, using its data for the LL simulation.
80003072
==11551==
==11551== I refs: 150,166,282
==11551== I1 misses: 876
==11551== LLi misses: 870
==11551== I1 miss rate: 0.00%
==11551== LLi miss rate: 0.00%
==11551==
==11551== D refs: 30,055,919 (20,041,763 rd + 10,014,156 wr)
==11551== D1 misses: 3,752,224 ( 2,501,671 rd + 1,250,553 wr)
==11551== LLd misses: 3,654,291 ( 2,403,770 rd + 1,250,521 wr)
==11551== D1 miss rate: 12.4% ( 12.4% + 12.4% )
==11551== LLd miss rate: 12.1% ( 11.9% + 12.4% )
==11551==
==11551== LL refs: 3,753,100 ( 2,502,547 rd + 1,250,553 wr)
==11551== LL misses: 3,655,161 ( 2,404,640 rd + 1,250,521 wr)
==11551== LL miss rate: 2.0% ( 1.4% + 12.4% )
The I1 miss rates tells us there was no instruction cache miss.
The D1 miss rates tells us there was a lot of cache L1 misses
The LL miss rates tells us there was some Last Level cache misses.
To get a more accurate view of the miss location, we can run kcachegrind cachegrind.out.11549
, select the L1 Data Read miss
and navigate in the application code, as shown by
This should answer 1). I think the answer is no to 2) 3) and 4). It's yes for 5) if you compiled with debug info (without them, you'll get the global info, but not the per line info). As of 6) I'd say valgrind
usually provides a very decent first approximation. Goig to perf
is obviously more accurate !
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With