While trying to optimize a code, I'm a bit puzzled by differences in profiles produced by kcachegrdind
and gprof
. Specifically, if I use gprof (compiling with the -pg
switch, etc), I have this:
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
89.62 3.71 3.71 204626 0.02 0.02 objR<true>::R_impl(std::vector<coords_t, std::allocator<coords_t> > const&, std::vector<unsigned long, std::allocator<unsigned long> > const&) const
5.56 3.94 0.23 18018180 0.00 0.00 W2(coords_t const&, coords_t const&)
3.87 4.10 0.16 200202 0.00 0.00 build_matrix(std::vector<coords_t, std::allocator<coords_t> > const&)
0.24 4.11 0.01 400406 0.00 0.00 std::vector<double, std::allocator<double> >::vector(std::vector<double, std::allocator<double> > const&)
0.24 4.12 0.01 100000 0.00 0.00 Wrat(std::vector<coords_t, std::allocator<coords_t> > const&, std::vector<coords_t, std::allocator<coords_t> > const&)
0.24 4.13 0.01 9 1.11 1.11 std::vector<short, std::allocator<short> >* std::__uninitialized_copy_a<__gnu_cxx::__normal_iterator<std::vector<short, std::alloca
Which seems to suggest that I need not bother looking anywhere but ::R_impl(...)
At the same time, if I compile without the -pg
switch and run valgrind --tool=callgrind ./a.out
instead, I have something rather different: here's a screenshot of the kcachegrind
output
If I interpret this correctly, it seems to suggest that ::R_impl(...)
only takes about 50% of time, while the other half is spent in linear algebra (Wrat(...)
, eigenvalues
and the underlying lapack calls ) which was way down below in the gprof
profile.
I understand that gprof
and cachegrind
use different techniques, and I'd not bother if their results were somewhat different. But here, it looks very different, and I'm at loss as to how to interpret those. Any ideas or suggestions?
You are looking at the wrong column. You have to look at the second column in kcachegrind output, the one named "self". This is the time spent by the particular subroutine only without considering its children. The first column has the cumulative time (it is equal to 100% of machine time for the main) and it is not that informative (in my opinion).
Note that from the output of kcachegrind you can see that the total time of the process is 53.64 second while the time spent in the subroutine "R_impl" is 46.72 second which is 87% of the total time. So gprof and kcachegrind agree almost perfectly.
gprof
is an instrumented profiler, callgrind
is a sampling profiler. With an instrumented profiler you get overhead for every function entry and exit, which can skew the profile, particularly if you have relatively small functions which are called many times. Sampling profilers tend to be more accurate - they slow the overall program execution slightly, but this tends to have the same relative effect on all functions.
Try the free 30 day evaluation of Zoom from RotateRight - I suspect it will give you a profile which agrees more with callgrind
than with gprof
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With