Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What causes failure to unwind in a DWARF perf call stack?

In backtraces produced by perf record --call-graph dwarf and printed by perf script, I am consistently getting wrong addresses for maybe 5% of call stacks, i.e. unwinding fails. One example is

my_bin 770395 705462.825887:    3560360 cycles: 
        7f0398b9b7e2 __vsnprintf_internal+0x12 (/usr/lib/x86_64-linux-gnu/libc-2.32.so)
        7ffcdb6fbfdf [unknown] ([stack])

my_bin 770395 705462.827040:    3447195 cycles: 
        7f0398ba1624 __GI__IO_default_xsputn+0x104 (inlined)
        7ffcdb6fb7af [unknown] ([stack])

and it was produced from this code (compiled with g++ -O3 -g -fno-omit-frame-pointer my_bin.cpp -o my_bin):

#include <cstdio>
#include <iostream>

int __attribute__ ((noinline)) libc_string(int x) {
    char buf[64] = {0};
    // Some nonsense workload using libc
    int res = 0;
    for (int i = 0; i < x; ++i) {
        res += snprintf(buf, 64, "%d %d %d Can I bring my friends to tea?", (i%10), (i%3)*10, i+2);
        res = res % 128;
    }
    return res;
}

int main() {
    int result = libc_string(20000000);
    std::cout << result << "\n";
}

I'm pretty sure that my program should not have executable code in the stack, so these addresses seem wrong. It's not only one program, but most programs that I've profiled which have about 5% wrong call stacks. These call stacks mostly just have two stack frames, with the innermost one sometimes in libraries like Eigen (even though they generally have correct calls stacks) and sometimes in the C++ STL or libc. I've seen the unwinding end up in unknown, [stack], [heap], anon, //anon, libstdc++.so.6.0.28, or <my_bin>.

I've seen this on Ubuntu 18.04, 20.04 and 20.10.

This happens only with DWARF unwinding. How can this be fixed?

like image 226
nnnmmm Avatar asked Jun 18 '21 14:06

nnnmmm


1 Answers

Which other types of unwinding did you test?

In examples I disabled kernel ASLR feature with setarch x86_64 -R for more stable addresses and smaller perf.data files.

Usage of perf record option -e cycles:u command may help, as it will not include kernel samples.

I have reproduction of similar dwarf unwind issue for files generated with default perf record event ('-e cycles:u'; libc-2.31 from libc6-prof package was used) for __GI__IO_default_xsputn (inlined) function:

env LD_LIBRARY_PATH=/lib/libc6-prof/x86_64-linux-gnu setarch `uname -m` -R perf record --call-graph dwarf -o perf.data.dwarf.u -e cycles:u ./my_bin
perf script -i perf.data.dwarf.u |less

Incorrect sample:

my_bin 28100 760107.271010:     461418 cycles:u: 
            7ffff7c74f06 __GI__IO_default_xsputn+0x106 (inlined)
            7ffff7c59c6c __vfprintf_internal+0xf4c (/usr/lib/libc6-prof/x86_64-linux-gnu/libc-2.31.so)

Correct sample:

my_bin 28100 760107.257283:     267268 cycles:u: 
            7ffff7c74eff __GI__IO_default_xsputn+0xff (inlined)
            7ffff7c59c6c __vfprintf_internal+0xf4c (/usr/lib/libc6-prof/x86_64-linux-gnu/libc-2.31.so)
            7ffff7c6e9f6 __vsnprintf_internal+0xb6 (/usr/lib/libc6-prof/x86_64-linux-gnu/libc-2.31.so)
            7ffff7d14a2c ___snprintf_chk+0x9c (inlined)
            555555555314 libc_string+0xb4 (/home/user/so/my_bin)
            555555555314 libc_string+0xb4 (/home/user/so/my_bin)
            555555555111 main+0x11 (/home/user/so/my_bin)
            7ffff7c040fa __libc_start_main+0x10a (/usr/lib/libc6-prof/x86_64-linux-gnu/libc-2.31.so)
            55555555519d _start+0x2d (/home/user/so/my_bin)

For correct unwind I have many different offsets in __GI__IO_default_xsputn+ (the number after +):

perf script -i perf.data.dwarf.u ||grep vsnprintf_internal -B3 |grep _GI__IO_default_xsputn|cut -d + -f 2|sort | uniq -c
...
    208 0x0 (inlined)
     45 0x101 (inlined)
      2 0x105 (inlined)
     91 0x10 (inlined)
    294 0x110 (inlined)
      2 0x117 (inlined)
      2 0x11d (inlined)
    326 0x121 (inlined)

But I have no +0x106 address with correct unwind. And all incorrect unwinds have +0x106 address. Let's check with gdb (it is easier after ASLR disabling; +262 is +0x106):

env LD_LIBRARY_PATH=/lib/libc6-prof/x86_64-linux-gnu setarch `uname -m` -R  gdb -q ./my_bin
(gdb) start
(gdb) x/i 0x7ffff7c74f06
   0x7ffff7c74f06 <__GI__IO_default_xsputn+262>:    retq  
(gdb) disassemble __GI__IO_default_xsputn
Dump of assembler code for function __GI__IO_default_xsputn:
...
   0x00007ffff7c74f05 <+261>:   pop    %rbp
   0x00007ffff7c74f06 <+262>:   retq   

Unwind issue seems to be connected with inlined(?) functions sampled at retq instruction or after pop %rbp?

like image 179
osgx Avatar answered Oct 24 '22 04:10

osgx