Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Different cache miss count for a same program between multiple runs

I am using Cachegrind to retrieve the number of cache misses of a static program compiled without libc (just a _start that calls my main function and an exit syscall in asm). The program is fully deterministic, the instructions and the memory references does not change from one run to another. The cache is fully-associative with LRU as replacement policy.

However, I noticed that the number of misses changes sometimes. More specifically, the number of misses is always the same until I go to a different directory:

 % cache=8 && valgrind --tool=cachegrind --I1=$((cache * 64)),$cache,64 --D1=$((cache * 64)),$cache,64 --L2=262144,4096,64 ./adpcm        
 ...
 ==31352== I   refs:      216,145,010
 ...
 ==31352== D   refs:      130,481,003  (95,186,001 rd   + 35,295,002 wr)
 ==31352== D1  misses:        240,004  (   150,000 rd   +     90,004 wr)
 ==31352== LLd misses:             31  (        11 rd   +         20 wr)

And if I execute the same command again and again, I will keep having the same results. But if I run this program from a different directory:

 % cd ..
 % cache=8 && valgrind --tool=cachegrind --I1=$((cache * 64)),$cache,64 --D1=$((cache * 64)),$cache,64 --L2=262144,4096,64 ./malardalen2/adpcm
 ...
 ==31531== I   refs:      216,145,010
 ...
 ==31531== D   refs:      130,481,003  (95,186,001 rd   + 35,295,002 wr)
 ==31531== D1  misses:        250,004  (   160,000 rd   +     90,004 wr)
 ==31531== LLd misses:             31  (        11 rd   +         20 wr)

And I even have a different result from a different directory.

I've also done some experiments with a Pin tool and with this one I don't need to change the directory to get different values. But it seems that the set of possible values is very limited and is exactly the same as with Cachegrind.

My question is: what could be the sources of such differences?

My first hint is that my program is not aligned the same way in memory and as a consequence, some variables stored in the same line in a previous run are not anymore. That could also explain the limited number of combinations. But I though that cachegrind (and Pin) were using the virtual addresses and I'd assume that the OS (Linux) is always giving the same virtual addresses. Any other idea?

Edit: As you can guess reading the LLd misses, the program only uses 31 different cache lines. Also, the cache can only contain 8 cache lines. So even on real, the difference can't be explained by the idea of the cache being already populated the second time (at max, only 8 lines could stay in the L1).

Edit 2: Cachegrind's report is not based on actual cache misses (given by performance counters) but is the result of a simulation. Basically, it simulate the behavior of a cache in order to count the number of misses. Since the consequences are only temporal, that's totally fine and that allows to change the cache properties (size, associativity).

Edit 3: The hardware I am using is an Intel Core i7 on a Linux 3.2 x86_64. The compile flags are -static and for some programs -nostdlib (IIRC, I'm not at home right now).

like image 558
Maxime Chéramy Avatar asked Mar 24 '23 08:03

Maxime Chéramy


2 Answers

Linux implements the "Address space layout randomization" technique (http://en.wikipedia.org/wiki/Address_space_layout_randomization) for security matters. And you can deactivate this behavior like this:

echo -n "0" > /proc/sys/kernel/randomize_va_space

You can test that through this example:

#include <stdio.h>

int main() {
   char a;
   printf("%u\n", &a);
   return 0;
}

You should always have the same value printed.

Before:

 % ./a.out
4006500239
 % ./a.out
819175583
 % ./a.out
2443759599
 % ./a.out
2432498159

After:

 % ./a.out
4294960207
 % ./a.out
4294960207
 % ./a.out
4294960207
 % ./a.out
4294960207

That also explain the different amount of cache misses since two variables that were in the same line can now be in two different lines.

Edit: This does not solve entirely the problem apparently but I think it was one of the reason. I'll give the bounty to anyone who can help me resolving this issue.

like image 124
Maxime Chéramy Avatar answered Apr 06 '23 22:04

Maxime Chéramy


It seems this is a known behavior in valgrind:

I used the example that outputs the cache base address, I also disabled the layout randomization.

I ran the executable twice getting the same results in both runs:

D   refs:       40,649  (28,565 rd   + 12,084 wr)
==15016== D1  misses:     11,465  ( 8,412 rd   +  3,053 wr)
==15016== LLd misses:      1,516  ( 1,052 rd   +    464 wr)
==15016== D1  miss rate:    28.2% (  29.4%     +   25.2%  )
==15016== LLd miss rate:     3.7% (   3.6%     +    3.8%  )

villar@localhost ~ $ cache=8 && valgrind --tool=cachegrind --I1=$((cache * 64)),$cache,64 --D1=$((cache * 64)),$cache,64 --L2=262144,4096,64 ./a.out 

==15019== D   refs:       40,649  (28,565 rd   + 12,084 wr)
==15019== D1  misses:     11,465  ( 8,412 rd   +  3,053 wr)
==15019== LLd misses:      1,516  ( 1,052 rd   +    464 wr)
==15019== D1  miss rate:    28.2% (  29.4%     +   25.2%  )
==15019== LLd miss rate:     3.7% (   3.6%     +    3.8%  )

According to the cachegrind documentation (http://www.cs.washington.edu/education/courses/cse326/05wi/valgrind-doc/cg_main.html)

Another thing worth nothing is that results are very sensitive. Changing the size of the >valgrind.so file, the size of the program being profiled, or even the length of its name can perturb the results. Variations will be small, but don't expect perfectly >repeatable results if your program changes at all. While these factors mean you shouldn't trust the results to be super-accurate, hopefully >they should be close enough to be useful.

After reading this, I changed the file name and got the following:

villar@localhost ~ $ mv a.out a.out2345345345
villar@localhost ~ $ cache=8 && valgrind --tool=cachegrind --I1=$((cache * 64)),$cache,64 --D1=$((cache * 64)),$cache,64 --L2=262144,4096,64 ./a.out2345345345 

==15022== D   refs:       40,652  (28,567 rd   + 12,085 wr)
==15022== D1  misses:     10,737  ( 8,201 rd   +  2,536 wr)
==15022== LLd misses:      1,517  ( 1,054 rd   +    463 wr)
==15022== D1  miss rate:    26.4% (  28.7%     +   20.9%  )
==15022== LLd miss rate:     3.7% (   3.6%     +    3.8%  )

Changing the name back to "a.out" gave me exactly the same result as before.

Notice that changing the file name or the path to it will change the base of the stack!!. and this may be the cause after reading what Mr. Evgeny said in a prior comment

When you change current working directory, you also change corresponding environment variable (and its length). Since a copy of all environment variables is usually stored just above the stack, you get different allocation for stack variables and different number of cache misses. (And shell could change some other variables besides "PWD").

EDIT: Documentation also says:

Program start-up/shut-down calls a lot of functions that aren't interesting and just complicate the output. Would be nice to exclude these somehow.

The simulated cache may be tracking the start and end of the program being it the source of the variations.

like image 29
Emilcasvi Avatar answered Apr 07 '23 00:04

Emilcasvi