I am doing some experiments with Cachegrind, Callgrind and Gem5. I noticed that a number of accesses were counted as read for cachegrind, as write for callgrind and for both read and write by gem5.
Let's take a very simple example:
int main() {
int i, l;
for (i = 0; i < 1000; i++) {
l++;
l++;
l++;
l++;
l++;
l++;
l++;
l++;
l++;
l++;
... (100 times)
}
}
I compile with:
gcc ex.c --static -o ex
So basically, according to the asm file, addl $1, -8(%rbp)
is executed 100,000 times. Since it's both a read and a write, I was expecting 100k read and 100k write. However, cachegrind only counts them as read and callgrind only as write.
% valgrind --tool=cachegrind --I1=512,8,64 --D1=512,8,64
--L2=16384,8,64 ./ex
==15356== Cachegrind, a cache and branch-prediction profiler
==15356== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al.
==15356== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==15356== Command: ./ex
==15356==
--15356-- warning: L3 cache found, using its data for the LL simulation.
==15356==
==15356== I refs: 111,535
==15356== I1 misses: 475
==15356== LLi misses: 280
==15356== I1 miss rate: 0.42%
==15356== LLi miss rate: 0.25%
==15356==
==15356== D refs: 104,894 (103,791 rd + 1,103 wr)
==15356== D1 misses: 557 ( 414 rd + 143 wr)
==15356== LLd misses: 172 ( 89 rd + 83 wr)
==15356== D1 miss rate: 0.5% ( 0.3% + 12.9% )
==15356== LLd miss rate: 0.1% ( 0.0% + 7.5% )
==15356==
==15356== LL refs: 1,032 ( 889 rd + 143 wr)
==15356== LL misses: 452 ( 369 rd + 83 wr)
==15356== LL miss rate: 0.2% ( 0.1% + 7.5% )
-
% valgrind --tool=callgrind --I1=512,8,64 --D1=512,8,64
--L2=16384,8,64 ./ex
==15376== Callgrind, a call-graph generating cache profiler
==15376== Copyright (C) 2002-2012, and GNU GPL'd, by Josef Weidendorfer et al.
==15376== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==15376== Command: ./ex
==15376==
--15376-- warning: L3 cache found, using its data for the LL simulation.
==15376== For interactive control, run 'callgrind_control -h'.
==15376==
==15376== Events : Ir Dr Dw I1mr D1mr D1mw ILmr DLmr DLmw
==15376== Collected : 111532 2777 102117 474 406 151 279 87 85
==15376==
==15376== I refs: 111,532
==15376== I1 misses: 474
==15376== LLi misses: 279
==15376== I1 miss rate: 0.42%
==15376== LLi miss rate: 0.25%
==15376==
==15376== D refs: 104,894 (2,777 rd + 102,117 wr)
==15376== D1 misses: 557 ( 406 rd + 151 wr)
==15376== LLd misses: 172 ( 87 rd + 85 wr)
==15376== D1 miss rate: 0.5% ( 14.6% + 0.1% )
==15376== LLd miss rate: 0.1% ( 3.1% + 0.0% )
==15376==
==15376== LL refs: 1,031 ( 880 rd + 151 wr)
==15376== LL misses: 451 ( 366 rd + 85 wr)
==15376== LL miss rate: 0.2% ( 0.3% + 0.0% )
Could someone give me a reasonable explanation? Would I be correct to consider there are in fact ~100k reads and ~100k writes (i.e. 2 cache accesses for an addl)?
Callgrind records the count of instructions, not the actual time spent in a function. If you have a program where the bottleneck is file I/O, the costs associated with reading and writing files won't show up in the profile, as those are not CPU-intensive tasks.
Callgrind is a profiling tool that records the call history among functions in a program's run as a call-graph. By default, the collected data consists of the number of instructions executed, their relationship to source lines, the caller/callee relationship between functions, and the numbers of such calls.
To use Callgrind, you must specify --tool=callgrind on the Valgrind command line or use the supplied script callgrind . Callgrind's cache simulation is based on the Cachegrind tool of the Valgrind package.
Valgrind is a system for debugging and profiling Linux programs useful to automatically detect many memory management and threading bugs, making programs more stable and robust.
From cachegrind manual: 5.7.1. Cache Simulation Specifics
Instructions that modify a memory location (e.g. inc and dec) are counted as doing just a read, i.e. a single data reference. This may seem strange, but since the write can never cause a miss (the read guarantees the block is in the cache) it's not very interesting.
Thus it measures not the number of times the data cache is accessed, but the number of times a data cache miss could occur.
It would seem that callgrind's cache simulation logic is different from cachegrind. I would think that callgrind should produce the same results as cachegrind, so maybe this is a bug?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With