Though I can intuitively get most of the results, I'm having hard time fully understanding the output of the perf report
command especially for what concerns the call graph, so I wrote a stupid test to solve this issue of mine once for all.
I compiled what follows with:
gcc -Wall -pedantic -lm perf-test.c -o perf-test
No aggressive optimizations to avoid inlining and such.
#include <math.h> #define N 10000000UL #define USELESSNESS(n) \ do { \ unsigned long i; \ double x = 42; \ for (i = 0; i < (n); i++) x = sin(x); \ } while (0) void baz() { USELESSNESS(N); } void bar() { USELESSNESS(2 * N); baz(); } void foo() { USELESSNESS(3 * N); bar(); baz(); } int main() { foo(); return 0; }
perf record ./perf-test perf report
With these I get:
94,44% perf-test libm-2.19.so [.] __sin_sse2 2,09% perf-test perf-test [.] sin@plt 1,24% perf-test perf-test [.] foo 0,85% perf-test perf-test [.] baz 0,83% perf-test perf-test [.] bar
Which sounds reasonable since the heavy work is actually performed by __sin_sse2
and sin@plt
is probably just a wrapper, while the overhead of my functions take into account just the loop, overall: 3*N
iterations for foo
, 2*N
for the other two.
perf record -g ./perf-test perf report -G perf report
Now the overhead columns that I get are two: Children
(the output is sorted by this one by default) and Self
(the same overhead of the flat profile).
Here is where I start feeling I miss something: regardless of the fact that I use -G
or not I'm unable to explain the hierarchy in terms of "x calls y" or "y is called by x", for example:
without -G
("y is called by x"):
- 94,34% 94,06% perf-test libm-2.19.so [.] __sin_sse2 - __sin_sse2 + 43,67% foo + 41,45% main + 14,88% bar - 37,73% 0,00% perf-test perf-test [.] main main __libc_start_main - 23,41% 1,35% perf-test perf-test [.] foo foo main __libc_start_main - 6,43% 0,83% perf-test perf-test [.] bar bar foo main __libc_start_main - 0,98% 0,98% perf-test perf-test [.] baz - baz + 54,71% foo + 45,29% bar
__sin_sse2
is called by main
(indirectly?), foo
and bar
but not by baz
?baz
) and sometimes not (e.g., the last instance of bar
)?with -G
("x calls y"):
- 94,34% 94,06% perf-test libm-2.19.so [.] __sin_sse2 + __sin_sse2 + __libc_start_main + main - 37,73% 0,00% perf-test perf-test [.] main - main + 62,05% foo + 35,73% __sin_sse2 2,23% sin@plt - 23,41% 1,35% perf-test perf-test [.] foo - foo + 64,40% __sin_sse2 + 29,18% bar + 3,98% sin@plt 2,44% baz __libc_start_main main foo
__sin_sse2
?main
calls foo
and that's ok, but why if it calls __sin_sse2
and sin@plt
(indirectly?) it does not also call bar
and baz
?__libc_start_main
and main
appear under foo
? And why foo
appears twice?Suspect is that there are two levels of this hierarchy, in which the second actually represents the "x calls y"/"y is called by x" semantics, but I'm tired to guess so I'm asking here. And the documentation doesn't seem to help.
Sorry for the long post but I hope that all this context may help or act as a reference for someone else too.
There is builtin perf. data parser and printer in perf tool of linux tools with subcommand "script". perf-script - Read perf. data (created by perf record) and display trace output This command reads the input file and displays the trace recorded.
perf report is able to auto-detect whether a perf. data file contains branch stacks and it will automatically switch to the branch view mode, unless --no-branch-stack is used. --branch-history Add the addresses of sampled taken branches to the callstack. This allows to examine the path the program took to each sample.
Sampling with perf record Overhead: the percentage of overall samples collected in the corresponding function.
Perf works on the Model Specific Registers of your CPU for measurements like cycles or branch-misses or so. A special Part called PMU(Performance Measurement Unit) is counting all kinds of events.
Alright, well, let's ignore temporarily the difference between caller and callee call-graphs, mostly because when I compare the results between these two options on my machine, I only see effects inside the kernel.kallsyms
DSO for reasons I don't understand -- relatively new to this myself.
I found that for your example, it's a little easier to read the whole tree. So, using --stdio
, let's look at the whole tree for __sin_sse2
:
# Overhead Command Shared Object Symbol # ........ ......... ................. ...................... # 94.72% perf-test libm-2.19.so [.] __sin_sse2 | --- __sin_sse2 | |--44.20%-- foo | | | --100.00%-- main | __libc_start_main | _start | 0x0 | |--27.95%-- baz | | | |--51.78%-- bar | | foo | | main | | __libc_start_main | | _start | | 0x0 | | | --48.22%-- foo | main | __libc_start_main | _start | 0x0 | --27.84%-- bar | --100.00%-- foo main __libc_start_main _start 0x0
So, the way I read this is: 44% of the time, sin
is called from foo
; 27% of the time it's called from baz
, and 27% from bar.
The documentation for -g is instructive:
-g [type,min[,limit],order[,key]], --call-graph Display call chains using type, min percent threshold, optional print limit and order. type can be either: · flat: single column, linear exposure of call chains. · graph: use a graph tree, displaying absolute overhead rates. · fractal: like graph, but displays relative rates. Each branch of the tree is considered as a new profiled object. order can be either: - callee: callee based call graph. - caller: inverted caller based call graph. key can be: - function: compare on functions - address: compare on individual code addresses Default: fractal,0.5,callee,function.
The important piece here is that the default is fractal, and in fractal mode, each branch is a new object.
So, you can see that 50% of the time that baz
is called, it's called from bar
, and the other 50% it's called from foo
.
This isn't always the most useful measure, so it's instructive to look at the results using -g graph
:
94.72% perf-test libm-2.19.so [.] __sin_sse2 | --- __sin_sse2 | |--41.87%-- foo | | | --41.48%-- main | __libc_start_main | _start | 0x0 | |--26.48%-- baz | | | |--13.50%-- bar | | foo | | main | | __libc_start_main | | _start | | 0x0 | | | --12.57%-- foo | main | __libc_start_main | _start | 0x0 | --26.38%-- bar | --26.17%-- foo main __libc_start_main _start 0x0
This changes to using absolute percentages, where each percentage of time is reported for that call chain: So foo->bar
is 26% of the total ticks (which in turn calls baz
), and foo->baz
(direct) is 12% of the total ticks.
I still have no idea why I don't see any differences between callee and caller graphs though, from the perspective of __sin_sse2
.
One thing I did change from your command line is how the callgraphs were gathered. Linux perf by default uses the frame pointer method of reconstructing callstacks. This can be a problem when the compiler uses -fomit-frame-pointer
as a default. So I used
perf record --call-graph dwarf ./perf-test
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With