Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using valgrind to measure cache misses [closed]

I have a critical path which executes in one thread, pinned to a single core.

I am interested in identifying where cache misses are occurring. After looking around it seems valgrind's cachegrind tool would help me. However I have some questions regarding the tool's capabilities in this scenario:

  1. How specific are the locations of cache misses provided? Does it output the variable name?
  2. Can I profile just one thread?
  3. Is it possible to profile specific parts of the code?
  4. All the capabilities for measuring cache misses, do they equally-apply to TLB misses?
  5. Can I use cachegrind with my release/optimised code?
  6. I understand valgrind uses a virtual machine to sample. How accurate is this approach?

Question 1 is the most important.

Any help with command line arguments is most-appreciated.

like image 647
user997112 Avatar asked Sep 12 '15 18:09

user997112


People also ask

How do I use cachegrind in Valgrind?

To use this tool, you must specify --tool=cachegrindon the Valgrind command line. Cachegrind is a tool for doing cache simulations and annotating your source line-by-line with the number of cache misses. In particular, it records: L1 instruction cache reads and misses;

How to use Valgrind optimisation?

But by contrast with normal Valgrind use, you probably do want to turn optimisation on, since you should profile your program as it will be normally run. The two steps are: Run your program with valgrind --tool=cachegrind in front of the normal command line invocation. When the program finishes, Cachegrind will print summary cache statistics.

What are the Valgrind profiling tools?

The Valgrind profiling tools are cachegrind and callgrind. The cachegrind tool simulates the L1/L2 caches and counts cache misses/hits. The callgrind tool counts function calls and the CPU instructions executed within each call and builds a function callgraph.

How to use Valgrind to find the source files?

Use the -I / --includeoption to tell Valgrind where to look for source files if the filenames found from the debugging information aren't specific enough. Beware that cg_annotate can take some time to digest large cachegrind.out.pidfiles, e.g. 30 seconds or more. Also beware that auto-annotation can produce a lot of output if your program is large!


1 Answers

cachegrind can output both global and local informations concerning cache misses, and annotate at the line level (if the original program was compiled with debug information). For instance, the following code:

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>

int main(int argc, char**argv) {
  size_t n = (argc == 2 ) ? atoi(argv[1]) : 100;
  double* v = malloc(sizeof(double) * n);
  for(size_t i = 0; i < n ; i++)
    v[i] = i;

  double s = 0;
  for(size_t i = 0; i < n ; ++i)
    s += v[i] * v[n - 1 - i];
  printf("%ld\n", s);
  free(v);
  return 0;
}

compiled with gcc a.c -O2 -g -o a and run with valgrind --tool=cachegrind ./a 10000000 outputs:

==11551== Cachegrind, a cache and branch-prediction profiler
==11551== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote et al.
==11551== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==11551== Command: ./a 10000000
==11551== 
--11551-- warning: L3 cache found, using its data for the LL simulation.
80003072
==11551== 
==11551== I   refs:      150,166,282
==11551== I1  misses:            876
==11551== LLi misses:            870
==11551== I1  miss rate:        0.00%
==11551== LLi miss rate:        0.00%
==11551== 
==11551== D   refs:       30,055,919  (20,041,763 rd   + 10,014,156 wr)
==11551== D1  misses:      3,752,224  ( 2,501,671 rd   +  1,250,553 wr)
==11551== LLd misses:      3,654,291  ( 2,403,770 rd   +  1,250,521 wr)
==11551== D1  miss rate:        12.4% (      12.4%     +       12.4%  )
==11551== LLd miss rate:        12.1% (      11.9%     +       12.4%  )
==11551== 
==11551== LL refs:         3,753,100  ( 2,502,547 rd   +  1,250,553 wr)
==11551== LL misses:       3,655,161  ( 2,404,640 rd   +  1,250,521 wr)
==11551== LL miss rate:          2.0% (       1.4%     +       12.4%  )

The I1 miss rates tells us there was no instruction cache miss.

The D1 miss rates tells us there was a lot of cache L1 misses

The LL miss rates tells us there was some Last Level cache misses.

To get a more accurate view of the miss location, we can run kcachegrind cachegrind.out.11549, select the L1 Data Read miss and navigate in the application code, as shown by this screenshot

This should answer 1). I think the answer is no to 2) 3) and 4). It's yes for 5) if you compiled with debug info (without them, you'll get the global info, but not the per line info). As of 6) I'd say valgrind usually provides a very decent first approximation. Goig to perf is obviously more accurate !

like image 60
serge-sans-paille Avatar answered Sep 21 '22 18:09

serge-sans-paille