Suppose my binaries are running in a customer site where I cannot enable <code>core dump</code> generation using <code>ulimit -c</code> . How do engineers debug the <code>segmentation faults</code> in such real world scenarios? Is there any other method of debugging or identifying crashes without <code>core dumps</code> generated.

In the past, I had to deal with this kind of restriction on several occasions. A segmentation fault or, more generally, abnormal process termination had to be investigated with the caveat that a core dump was not available. For Linux, our platform of choice for this walkthrough, a few reasons come to mind: <ul> <li>Core dump generation is disabled altogether (using <code>limits.conf</code> or <code>ulimit</code>)</li> <li>The target directory (current working directory or a directory in <code>/proc/sys/kernel/core_pattern</code>) does not exist or is inaccessible due to filesystem permissions or SELinux</li> <li>The target filesystem has insufficient diskspace resulting in a partial dump</li> </ul> For all of those, the net result is the same: there's no (valid) core dump to use for analysis. Fortunately, a workaround exists for post-mortem debugging that has the potential to save the day, but given it's inherent limitations, your mileage may vary from case to case. <h3>Identifying the Faulting Instruction</h3> The following sample contains a classic use-after-free memory error: <pre class="prettyprint"><code>#include <iostream> struct Test { const std::string &m_value; Test(const std::string &value): m_value(value) { } void print() { std::cout << m_value << std::endl; } }; int main() { std::string *value = new std::string("this is a test"); Test test(*value); delete value; test.print(); return 0; } </code></pre> After <code>delete value</code>, the <code>std::string</code> reference <code>Test::m_value</code> points to inaccessible memory. Therefore, running it results in a segmentation fault: <pre class="prettyprint"><code>$ ./a.out Segmentation fault </code></pre> When a process terminates due to an access violation, the Linux kernel creates a log entry accessible via <code>dmesg</code> and, depending on the system's configuration, the syslog (usually <code>/var/log/messages</code>). The example (compiled with <code>-O0</code>) creates the following entry: <pre class="prettyprint"><code>$ dmesg | grep segfault [80440.957955] a.out[7098]: segfault at ffffffffffffffe8 ip 00007f9f2c2b56a3 sp 00007ffc3e75bc48 error 5 in libstdc++.so.6.0.19[7f9f2c220000+e9000] </code></pre> The corresponding Linux kernel source from <code>arch/x86/mm/fault.c</code>: <pre class="prettyprint"><code> printk("%s%s[%d]: segfault at %lx ip %px sp %px error %lx", loglvl, tsk->comm, task_pid_nr(tsk), address, (void *)regs->ip, (void *)regs->sp, error_code); </code></pre> The error (<code>error_code</code>) reveals what the trigger was. It's a CPU-specific bit set (x86). In our case, the value <code>5</code> (<code>101</code> in binary) indicates that the page represented by the faulting address <code>0xffffffffffffffe8</code> was mapped but inaccessible due to page protection and a read was attempted. The log message identifies the module that executed the faulting instruction: <code>libstdc++.so.6.0.1</code>. The sample was compiled without optimization, so the call to <code>std::basic_ostream<char, std::char_traits<char> >& std::operator<< <char, std::char_traits<char>, std::allocator<char> >(std::basic_ostream<char, std::char_traits<char> >&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)</code> was not inlined: <pre class="prettyprint"><code> 400bef: e8 4c fd ff ff callq 400940 <_ZStlsIcSt11char_traitsIcESaIcEERSt13basic_ostreamIT_T0_ES7_RK SbIS4_S5_T1_E@plt> </code></pre> The STL performs the read access. Knowing those basics, how can we identify where the segmentation fault occurred exactly? The log entry features two essential addresses we need for doing so: <pre class="prettyprint"><code>ip 00007f9f2c2b56a3 [...] error 5 in ^^^^^^^^^^^^^^^^ libstdc++.so.6.0.19[7f9f2c220000+e9000] ^^^^^^^^^^^^ </code></pre> The first is the instruction pointer (<code>rip</code>) at the time of the access violation, the second is the address the <code>.text</code> section of the library is mapped to. By subtracting the <code>.text</code> base address from <code>rip</code>, we get the relative address of the instruction in the library and can disassemble the implementation using <code>objdump</code> (you can simply search for the offset): <pre class="prettyprint"><code>0x7f9f2c2b56a3-0x7f9f2c220000=0x956a3 </code></pre> <pre class="prettyprint"><code>$ objdump --demangle -d /usr/lib64/libstdc++.so.6 [...] 00000000000956a0 <std::basic_ostream<char, std::char_traits<char> >& std::operator<< <char, std::char_traits<char>, s td::allocator<char> >(std::basic_ostream<char, std::char_traits<char> >&, std::basic_string<char, std::char_traits<ch ar>, std::allocator<char> > const&)@@GLIBCXX_3.4>: 956a0: 48 8b 36 mov (%rsi),%rsi 956a3: 48 8b 56 e8 mov -0x18(%rsi),%rdx ^^^^^ 956a7: e9 24 4e fc ff jmpq 5a4d0 <std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)@plt> 956ac: 0f 1f 40 00 nopl 0x0(%rax) [...] </code></pre> Is that the correct instruction? We can consult GDB to confirm our analysis: <pre class="prettyprint"><code>Program received signal SIGSEGV, Segmentation fault. 0x00007ffff7b686a3 in std::basic_ostream<char, std::char_traits<char> >& std::operator<< <char, std::char_traits<char>, std::allocator<char> >(std::basic_ostream<char, std::char_traits<char> >&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /lib64/libstdc++.so.6 Missing separate debuginfos, use: debuginfo-install glibc-2.17-323.el7_9.x86_64 libgcc-4.8.5-44.el7.x86_64 libstdc++-4.8.5-44.el7.x86_64 (gdb) disass Dump of assembler code for function _ZStlsIcSt11char_traitsIcESaIcEERSt13basic_ostreamIT_T0_ES7_RKSbIS4_S5_T1_E: 0x00007ffff7b686a0 <+0>: mov (%rsi),%rsi => 0x00007ffff7b686a3 <+3>: mov -0x18(%rsi),%rdx 0x00007ffff7b686a7 <+7>: jmpq 0x7ffff7b2d4d0 <_ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l@plt> End of assembler dump. </code></pre> GDB shows the very same instruction. We can also use a debugging session to verify the read address: <pre class="prettyprint"><code>(gdb) print /x $rsi-0x18 $2 = 0xffffffffffffffe8 </code></pre> This value matches the read address in the log entry. <h3>Identifying the Callers</h3> So, despite the absence of a core dump, the kernel output enables us to identify the exact location of the segmentation fault. In many scenarios, though, that is far from being enough. For one thing, we're missing the list of calls that got us to that point - the call stack or stack trace. Without a dump in the backpack, you have two options to get hold of the callers: you can start your process using <code>catchsegv</code> (a glibc utility) or you can implement your own signal handler. <code>catchsegv</code> serves as a wrapper, generates the stack trace, and also dumps register values and the memory map: <pre class="prettyprint"><code>$ catchsegv ./a.out *** Segmentation fault Register dump: RAX: 0000000002158040 RBX: 0000000002158040 RCX: 0000000002158000 [...] Backtrace: /lib64/libstdc++.so.6(_ZStlsIcSt11char_traitsIcESaIcEERSt13basic_ostreamIT_T0_ES7_RKSbIS4_S5_T1_E+0x3)[0x7f1794fd36a3] ??:?(_ZN4Test5printEv)[0x400bf4] ??:?(main)[0x400b2d] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f179467a555] ??:?(_start)[0x4009e9] Memory map: 00400000-00401000 r-xp 00000000 08:02 50331747 /home/user/a.out [...] 7f1794f3e000-7f1795027000 r-xp 00000000 08:02 33600977 /usr/lib64/libstdc++.so.6.0.19 7f1795027000-7f1795227000 ---p 000e9000 08:02 33600977 /usr/lib64/libstdc++.so.6.0.19 7f1795227000-7f179522f000 r--p 000e9000 08:02 33600977 /usr/lib64/libstdc++.so.6.0.19 7f179522f000-7f1795231000 rw-p 000f1000 08:02 33600977 /usr/lib64/libstdc++.so.6.0.19 [...] </code></pre> How does <code>catchsegv</code> work? It essentially injects a signal handler using <code>LD_PRELOAD</code> and the library <code>libSegFault.so</code>. If your application already happens to install a signal handler for <code>SIGSEGV</code> and you intend to take advantage of <code>libSegFault.so</code>, your signal handler needs to forward the signal to the original handler (as returned by <code>sigaction(SIGSEGV, NULL)</code>). The second option is to implement the stack trace functionality yourself using a custom signal handler and <code>backtrace()</code>. This allows you to customize the output location and the output itself. Based on that information, we can essentially do the same we did before (<code>0x7f1794fd36a3-0x7f1794f3e000=0x956a3</code>). This time around, we can go back to the callers to dig deeper. The second frame is represented by the following line: <pre class="prettyprint"><code>??:?(_ZN4Test5printEv)[0x400bf4] </code></pre> <code>0x400bf4</code> is the address the callee returns to after <code>Test::print()</code>, it's located in the executable. We can visualize the call site as follows: <pre class="prettyprint"><code>$ objdump --demangle -d ./a.out [...] 400bea: bf a0 20 60 00 mov $0x6020a0,%edi 400bef: e8 4c fd ff ff callq 400940 <std::basic_ostream<char, std::char_traits<char> >& std::operator<< <char, std: :char_traits<char>, std::allocator<char> >(std::basic_ostream<char, std::char_traits<char> >&, std::basic_string<char, std::char_trai ts<char>, std::allocator<char> > const&)@plt> 400bf4: be 70 09 40 00 mov $0x400970,%esi ^^^^^^ 400bf9: 48 89 c7 mov %rax,%rdi 400bfc: e8 5f fd ff ff callq 400960 <std::ostream::operator<<(std::ostream& (*)(std::ostream&))@plt> [...] </code></pre> Note that the output of objdump matches the address in this instance because we run it against the executable, which has a default base address of <code>0x400000</code> on x86_64 - objdump takes that into account. With address space layout randomization (ASLR) enabled (compiled with <code>-fpie</code>, linked with <code>-pie</code>), the base address has to be taken into account as outlined before. Going back further involves the same steps: <pre class="prettyprint"><code>??:?(main)[0x400b2d] </code></pre> <pre class="prettyprint"><code>$ objdump --demangle -d ./a.out [...] 400b1c: e8 af fd ff ff callq 4008d0 <operator delete(void*)@plt> 400b21: 48 8d 45 d0 lea -0x30(%rbp),%rax 400b25: 48 89 c7 mov %rax,%rdi 400b28: e8 a7 00 00 00 callq 400bd4 <Test::print()> 400b2d: b8 00 00 00 00 mov $0x0,%eax ^^^^^^ 400b32: eb 2a jmp 400b5e <main+0xb1> [...] </code></pre> Until now, we've been manually translating the absolute address to a relative address. Instead, the base address of the module can be passed to objdump via <code>--adjust-vma=<base-address></code>. That way, the value of <code>rip</code> or a caller's address can be used directly. <h3>Adding Debug Symbols</h3> We've come a long way without a dump. For debugging to be effective, another critical puzzle piece is absent, however: debug symbols. Without them, it can be difficult to map the assembly to the corresponding source code. Compiling the sample with <code>-O3</code> and without debug information illustrates the problem: <pre class="prettyprint"><code>[98161.650474] a.out[13185]: segfault at ffffffffffffffe8 ip 0000000000400a4b sp 00007ffc9e738270 error 5 in a.out[400000+1000] </code></pre> As a consequence of inlining, the log entry now points to our executable as the trigger. Using objdump gets us to the following: <pre class="prettyprint"><code> 400a3e: e8 dd fe ff ff callq 400920 <operator delete(void*)@plt> 400a43: 48 8b 33 mov (%rbx),%rsi 400a46: bf a0 20 60 00 mov $0x6020a0,%edi 400a4b: 48 8b 56 e8 mov -0x18(%rsi),%rdx ^^^^^^ 400a4f: e8 4c ff ff ff callq 4009a0 <std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)@plt> 400a54: 48 89 c5 mov %rax,%rbp 400a57: 48 8b 00 mov (%rax),%rax </code></pre> Part of the stream implementation was inlined, making it harder to identify the associated source code. Without symbols, you have to use export symbols, calls (like <code>operator delete(void*)</code>) and the surrounding instructions (<code>mov $0x6020a0</code> loads the address of <code>std::cout</code>: <code>00000000006020a0 <std::cout@@GLIBCXX_3.4></code>) for the purpose of orientation. With debug symbols (<code>-g</code>), more context is available by calling <code>objdump</code> with <code>--source</code>: <pre class="prettyprint"><code> 400a43: 48 8b 33 mov (%rbx),%rsi operator<<(basic_ostream<_CharT, _Traits>& __os, const basic_string<_CharT, _Traits, _Alloc>& __str) { // _GLIBCXX_RESOLVE_LIB_DEFECTS // 586. string inserter not a formatted function return __ostream_insert(__os, __str.data(), __str.size()); 400a46: bf a0 20 60 00 mov $0x6020a0,%edi 400a4b: 48 8b 56 e8 mov -0x18(%rsi),%rdx ^^^^^^ 400a4f: e8 4c ff ff ff callq 4009a0 <std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)@plt> 400a54: 48 89 c5 mov %rax,%rbp </code></pre> That worked as expected. In the real world, debug symbols are not embedded in the binaries - they are managed in separate debuginfo packages. In those circumstances, <code>objdump</code> ignores debug symbols even if they are installed. To address this limitation, symbols have to be re-added to the affected binary. The following procedure creates detached symbols and re-adds them using <code>eu-unstrip</code> from <code>elfutils</code> to the benefit of objdump: <pre class="prettyprint"><code># compile with debug info g++ segv.cxx -O3 -g # create detached debug info objcopy --only-keep-debug a.out a.out.debug # remove debug info from executable strip -g a.out # re-add debug info to executable eu-unstrip ./a.out ./a.out.debug -o ./a.out-debuginfo # objdump with executable containing debug info objdump --demangle -d ./a.out-debuginfo --source </code></pre> <h3>Using GDB instead of objdump</h3> Thus far, we've been using objdump because it's usually available, even on production systems. Can we just use GDB instead? Yes, by executing <code>gdb</code> with the module of interest. I use <code>0x0x400a4b</code> as in the previous objdump invocation: <pre class="prettyprint"><code>$ gdb ./a.out [...] (gdb) disass 0x400a4b Dump of assembler code for function main(): [...] 0x0000000000400a43 <+67>: mov (%rbx),%rsi 0x0000000000400a46 <+70>: mov $0x6020a0,%edi 0x0000000000400a4b <+75>: mov -0x18(%rsi),%rdx 0x0000000000400a4f <+79>: callq 0x4009a0 <_ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l@plt> 0x0000000000400a54 <+84>: mov %rax,%rbp </code></pre> In contrast to objdump, GDB can deal with external symbol information without a hitch. <code>disass /m</code> corresponds to <code>objdump --source</code>: <pre class="prettyprint"><code>(gdb) disass /m 0x400a4b Dump of assembler code for function main(): [...] 21 Test test(*value); 22 delete value; 0x0000000000400a25 <+37>: test %rbx,%rbx 0x0000000000400a28 <+40>: je 0x400a43 <main()+67> 0x0000000000400a3b <+59>: mov %rbx,%rdi 0x0000000000400a3e <+62>: callq 0x400920 <_ZdlPv@plt> 23 test.print(); 24 return 0; 25 } 0x0000000000400a88 <+136>: add $0x18,%rsp [...] End of assembler dump. </code></pre> In case of an optimized binary, GDB might skip instructions in this mode if the source code cannot be mapped unambiguously. Our instruction at <code>0x400a4b</code> is not listed. objdump never skips instructions and might skip the source context instead - an approach, that I prefer for debugging at this level. This does not mean that GDB is not useful for this task, it's just something to be aware of. <h3>Final Thoughts</h3> Termination reason, registers, memory map, and stack trace. It's all there without even a trace of a core dump. While definitely useful (I fixed quite a few crashes that way), you have to keep in mind that you're still missing valuable information by going that route, most notably the stack and heap as well as per-thread data (thread metadata, registers, stack). So, whatever the scenario may be, you should seriously consider enabling core dump generation and ensure that dumps can be generated successfully if push comes to shove. Debugging in itself is complex enough, debugging without information you could technically have needlessly increases complexity and turnaround time, and, more importantly, significantly lowers the probability that the root cause can be found and addressed in a timely manner.

Analyzing segmentation fault without core file

Q: What is a “segmentation fault” error?

Let’s say you run nano and get a “Segmentation fault” error: That’s a situation where a core dump file could be produced, but it’s not by default.

Q: What programming languages have segmentation faults?

We can find most segmentation faults in lower-level languages like C (the most commonly used/ fundamental language in both LINUX and UNIX). It allows a great deal on memory allocation and usage. Hence, developers can have full control over the memory allocation.

Q: What is a core file or core dump?

Along with halting the program or process, a core file or core dump will often be generated, which is an important tool in debugging the program or finding the cause of the segfault. Core dumps are valuable in locating specific information regarding the process that was running when the segmentation fault occurred:

Q: Why is my core file not being produced?

If the core file isn’t produced, check if the user has write permission on the directory and if the filesystem has enough space to store the core dump file.

Tags:

linux

debugging

linux-kernel

crash

crash-dumps

Suppose my binaries are running in a customer site where I cannot enable core dump generation using ulimit -c . How do engineers debug the segmentation faults in such real world scenarios? Is there any other method of debugging or identifying crashes without core dumps generated.

372

asked Mar 29 '21 14:03

Franc M

1 Answers

In the past, I had to deal with this kind of restriction on several occasions. A segmentation fault or, more generally, abnormal process termination had to be investigated with the caveat that a core dump was not available.

For Linux, our platform of choice for this walkthrough, a few reasons come to mind:

Core dump generation is disabled altogether (using limits.conf or ulimit)
The target directory (current working directory or a directory in /proc/sys/kernel/core_pattern) does not exist or is inaccessible due to filesystem permissions or SELinux
The target filesystem has insufficient diskspace resulting in a partial dump

For all of those, the net result is the same: there's no (valid) core dump to use for analysis. Fortunately, a workaround exists for post-mortem debugging that has the potential to save the day, but given it's inherent limitations, your mileage may vary from case to case.

Identifying the Faulting Instruction

The following sample contains a classic use-after-free memory error:

#include <iostream>

struct Test
{
  const std::string &m_value;

  Test(const std::string &value):
    m_value(value)
  {
  }

  void print()
  {
    std::cout << m_value << std::endl;
  }
};

int main()
{
  std::string *value = new std::string("this is a test");
  Test test(*value);
  delete value;
  test.print();
  return 0;
}

After delete value, the std::string reference Test::m_value points to inaccessible memory. Therefore, running it results in a segmentation fault:

$ ./a.out
Segmentation fault

When a process terminates due to an access violation, the Linux kernel creates a log entry accessible via dmesg and, depending on the system's configuration, the syslog (usually /var/log/messages). The example (compiled with -O0) creates the following entry:

$ dmesg | grep segfault
[80440.957955] a.out[7098]: segfault at ffffffffffffffe8 ip 00007f9f2c2b56a3 sp 00007ffc3e75bc48 error 5 in libstdc++.so.6.0.19[7f9f2c220000+e9000]

The corresponding Linux kernel source from arch/x86/mm/fault.c:

    printk("%s%s[%d]: segfault at %lx ip %px sp %px error %lx",
        loglvl, tsk->comm, task_pid_nr(tsk), address,
        (void *)regs->ip, (void *)regs->sp, error_code);

The error (error_code) reveals what the trigger was. It's a CPU-specific bit set (x86). In our case, the value 5 (101 in binary) indicates that the page represented by the faulting address 0xffffffffffffffe8 was mapped but inaccessible due to page protection and a read was attempted.

The log message identifies the module that executed the faulting instruction: libstdc++.so.6.0.1. The sample was compiled without optimization, so the call to std::basic_ostream<char, std::char_traits<char> >& std::operator<< <char, std::char_traits<char>, std::allocator<char> >(std::basic_ostream<char, std::char_traits<char> >&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) was not inlined:

  400bef:       e8 4c fd ff ff          callq  400940 <_ZStlsIcSt11char_traitsIcESaIcEERSt13basic_ostreamIT_T0_ES7_RK
SbIS4_S5_T1_E@plt>

The STL performs the read access. Knowing those basics, how can we identify where the segmentation fault occurred exactly? The log entry features two essential addresses we need for doing so:

ip 00007f9f2c2b56a3 [...] error 5 in
   ^^^^^^^^^^^^^^^^ 
  libstdc++.so.6.0.19[7f9f2c220000+e9000]                                     
                      ^^^^^^^^^^^^

The first is the instruction pointer (rip) at the time of the access violation, the second is the address the .text section of the library is mapped to. By subtracting the .text base address from rip, we get the relative address of the instruction in the library and can disassemble the implementation using objdump (you can simply search for the offset):

0x7f9f2c2b56a3-0x7f9f2c220000=0x956a3

$ objdump --demangle -d /usr/lib64/libstdc++.so.6
[...]
00000000000956a0 <std::basic_ostream<char, std::char_traits<char> >& std::operator<< <char, std::char_traits<char>, s
td::allocator<char> >(std::basic_ostream<char, std::char_traits<char> >&, std::basic_string<char, std::char_traits<ch
ar>, std::allocator<char> > const&)@@GLIBCXX_3.4>:
   956a0:       48 8b 36                mov    (%rsi),%rsi
   956a3:       48 8b 56 e8             mov    -0x18(%rsi),%rdx
   ^^^^^
   956a7:       e9 24 4e fc ff          jmpq   5a4d0 <std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)@plt>
   956ac:       0f 1f 40 00             nopl   0x0(%rax)
[...]

Is that the correct instruction? We can consult GDB to confirm our analysis:

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7b686a3 in std::basic_ostream<char, std::char_traits<char> >& std::operator<< <char, std::char_traits<char>, std::allocator<char> >(std::basic_ostream<char, std::char_traits<char> >&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /lib64/libstdc++.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-323.el7_9.x86_64 libgcc-4.8.5-44.el7.x86_64 libstdc++-4.8.5-44.el7.x86_64
(gdb) disass
Dump of assembler code for function _ZStlsIcSt11char_traitsIcESaIcEERSt13basic_ostreamIT_T0_ES7_RKSbIS4_S5_T1_E:
   0x00007ffff7b686a0 <+0>: mov    (%rsi),%rsi
=> 0x00007ffff7b686a3 <+3>: mov    -0x18(%rsi),%rdx
   0x00007ffff7b686a7 <+7>: jmpq   0x7ffff7b2d4d0 <_ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l@plt>
End of assembler dump.

GDB shows the very same instruction. We can also use a debugging session to verify the read address:

(gdb) print /x $rsi-0x18
$2 = 0xffffffffffffffe8

This value matches the read address in the log entry.

Identifying the Callers

So, despite the absence of a core dump, the kernel output enables us to identify the exact location of the segmentation fault. In many scenarios, though, that is far from being enough. For one thing, we're missing the list of calls that got us to that point - the call stack or stack trace.

Without a dump in the backpack, you have two options to get hold of the callers: you can start your process using catchsegv (a glibc utility) or you can implement your own signal handler.

catchsegv serves as a wrapper, generates the stack trace, and also dumps register values and the memory map:

$ catchsegv ./a.out
*** Segmentation fault
Register dump:

 RAX: 0000000002158040   RBX: 0000000002158040   RCX: 0000000002158000
[...]
Backtrace:
/lib64/libstdc++.so.6(_ZStlsIcSt11char_traitsIcESaIcEERSt13basic_ostreamIT_T0_ES7_RKSbIS4_S5_T1_E+0x3)[0x7f1794fd36a3]
??:?(_ZN4Test5printEv)[0x400bf4]
??:?(main)[0x400b2d]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f179467a555]
??:?(_start)[0x4009e9]

Memory map:

00400000-00401000 r-xp 00000000 08:02 50331747 /home/user/a.out
[...]
7f1794f3e000-7f1795027000 r-xp 00000000 08:02 33600977 /usr/lib64/libstdc++.so.6.0.19
7f1795027000-7f1795227000 ---p 000e9000 08:02 33600977 /usr/lib64/libstdc++.so.6.0.19
7f1795227000-7f179522f000 r--p 000e9000 08:02 33600977 /usr/lib64/libstdc++.so.6.0.19
7f179522f000-7f1795231000 rw-p 000f1000 08:02 33600977 /usr/lib64/libstdc++.so.6.0.19
[...]

How does catchsegv work? It essentially injects a signal handler using LD_PRELOAD and the library libSegFault.so. If your application already happens to install a signal handler for SIGSEGV and you intend to take advantage of libSegFault.so, your signal handler needs to forward the signal to the original handler (as returned by sigaction(SIGSEGV, NULL)).

The second option is to implement the stack trace functionality yourself using a custom signal handler and backtrace(). This allows you to customize the output location and the output itself.

Based on that information, we can essentially do the same we did before (0x7f1794fd36a3-0x7f1794f3e000=0x956a3). This time around, we can go back to the callers to dig deeper. The second frame is represented by the following line:

??:?(_ZN4Test5printEv)[0x400bf4]

0x400bf4 is the address the callee returns to after Test::print(), it's located in the executable. We can visualize the call site as follows:

$ objdump --demangle -d ./a.out
[...]
  400bea:       bf a0 20 60 00          mov    $0x6020a0,%edi
  400bef:       e8 4c fd ff ff          callq  400940 <std::basic_ostream<char, std::char_traits<char> >& std::operator<< <char, std:
:char_traits<char>, std::allocator<char> >(std::basic_ostream<char, std::char_traits<char> >&, std::basic_string<char, std::char_trai
ts<char>, std::allocator<char> > const&)@plt>
  400bf4:       be 70 09 40 00          mov    $0x400970,%esi
  ^^^^^^
  400bf9:       48 89 c7                mov    %rax,%rdi
  400bfc:       e8 5f fd ff ff          callq  400960 <std::ostream::operator<<(std::ostream& (*)(std::ostream&))@plt>
[...]

Note that the output of objdump matches the address in this instance because we run it against the executable, which has a default base address of 0x400000 on x86_64 - objdump takes that into account. With address space layout randomization (ASLR) enabled (compiled with -fpie, linked with -pie), the base address has to be taken into account as outlined before.

Going back further involves the same steps:

??:?(main)[0x400b2d]

$ objdump --demangle -d ./a.out
[...]
  400b1c:       e8 af fd ff ff          callq  4008d0 <operator delete(void*)@plt>
  400b21:       48 8d 45 d0             lea    -0x30(%rbp),%rax
  400b25:       48 89 c7                mov    %rax,%rdi
  400b28:       e8 a7 00 00 00          callq  400bd4 <Test::print()>
  400b2d:       b8 00 00 00 00          mov    $0x0,%eax
  ^^^^^^
  400b32:       eb 2a                   jmp    400b5e <main+0xb1>
[...]

Until now, we've been manually translating the absolute address to a relative address. Instead, the base address of the module can be passed to objdump via --adjust-vma=<base-address>. That way, the value of rip or a caller's address can be used directly.

Adding Debug Symbols

We've come a long way without a dump. For debugging to be effective, another critical puzzle piece is absent, however: debug symbols. Without them, it can be difficult to map the assembly to the corresponding source code. Compiling the sample with -O3 and without debug information illustrates the problem:

[98161.650474] a.out[13185]: segfault at ffffffffffffffe8 ip 0000000000400a4b sp 00007ffc9e738270 error 5 in a.out[400000+1000]

As a consequence of inlining, the log entry now points to our executable as the trigger. Using objdump gets us to the following:

  400a3e:       e8 dd fe ff ff          callq  400920 <operator delete(void*)@plt>
  400a43:       48 8b 33                mov    (%rbx),%rsi
  400a46:       bf a0 20 60 00          mov    $0x6020a0,%edi
  400a4b:       48 8b 56 e8             mov    -0x18(%rsi),%rdx
  ^^^^^^
  400a4f:       e8 4c ff ff ff          callq  4009a0 <std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)@plt>
  400a54:       48 89 c5                mov    %rax,%rbp
  400a57:       48 8b 00                mov    (%rax),%rax

Part of the stream implementation was inlined, making it harder to identify the associated source code. Without symbols, you have to use export symbols, calls (like operator delete(void*)) and the surrounding instructions (mov $0x6020a0 loads the address of std::cout: 00000000006020a0 <std::cout@@GLIBCXX_3.4>) for the purpose of orientation.

With debug symbols (-g), more context is available by calling objdump with --source:

  400a43:       48 8b 33                mov    (%rbx),%rsi
    operator<<(basic_ostream<_CharT, _Traits>& __os,
               const basic_string<_CharT, _Traits, _Alloc>& __str)
    {
      // _GLIBCXX_RESOLVE_LIB_DEFECTS
      // 586. string inserter not a formatted function
      return __ostream_insert(__os, __str.data(), __str.size());
  400a46:       bf a0 20 60 00          mov    $0x6020a0,%edi
  400a4b:       48 8b 56 e8             mov    -0x18(%rsi),%rdx
  ^^^^^^
  400a4f:       e8 4c ff ff ff          callq  4009a0 <std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)@plt>
  400a54:       48 89 c5                mov    %rax,%rbp

That worked as expected. In the real world, debug symbols are not embedded in the binaries - they are managed in separate debuginfo packages. In those circumstances, objdump ignores debug symbols even if they are installed. To address this limitation, symbols have to be re-added to the affected binary. The following procedure creates detached symbols and re-adds them using eu-unstrip from elfutils to the benefit of objdump:

# compile with debug info
g++ segv.cxx -O3 -g
# create detached debug info
objcopy --only-keep-debug a.out a.out.debug
# remove debug info from executable
strip -g a.out
# re-add debug info to executable
eu-unstrip ./a.out ./a.out.debug -o ./a.out-debuginfo
# objdump with executable containing debug info
objdump --demangle -d ./a.out-debuginfo --source

Using GDB instead of objdump

Thus far, we've been using objdump because it's usually available, even on production systems. Can we just use GDB instead? Yes, by executing gdb with the module of interest. I use 0x0x400a4b as in the previous objdump invocation:

$ gdb ./a.out
[...]
(gdb) disass 0x400a4b
Dump of assembler code for function main():
[...]
   0x0000000000400a43 <+67>:    mov    (%rbx),%rsi
   0x0000000000400a46 <+70>:    mov    $0x6020a0,%edi
   0x0000000000400a4b <+75>:    mov    -0x18(%rsi),%rdx
   0x0000000000400a4f <+79>:    callq  0x4009a0 <_ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l@plt>
   0x0000000000400a54 <+84>:    mov    %rax,%rbp

In contrast to objdump, GDB can deal with external symbol information without a hitch. disass /m corresponds to objdump --source:

(gdb) disass /m 0x400a4b
Dump of assembler code for function main():
[...]
21    Test test(*value);
22    delete value;
   0x0000000000400a25 <+37>:    test   %rbx,%rbx
   0x0000000000400a28 <+40>:    je     0x400a43 <main()+67>
   0x0000000000400a3b <+59>:    mov    %rbx,%rdi
   0x0000000000400a3e <+62>:    callq  0x400920 <_ZdlPv@plt>

23    test.print();
24    return 0;
25  }
   0x0000000000400a88 <+136>:   add    $0x18,%rsp
[...]
End of assembler dump.

In case of an optimized binary, GDB might skip instructions in this mode if the source code cannot be mapped unambiguously. Our instruction at 0x400a4b is not listed. objdump never skips instructions and might skip the source context instead - an approach, that I prefer for debugging at this level. This does not mean that GDB is not useful for this task, it's just something to be aware of.

Final Thoughts

Termination reason, registers, memory map, and stack trace. It's all there without even a trace of a core dump. While definitely useful (I fixed quite a few crashes that way), you have to keep in mind that you're still missing valuable information by going that route, most notably the stack and heap as well as per-thread data (thread metadata, registers, stack).

So, whatever the scenario may be, you should seriously consider enabling core dump generation and ensure that dumps can be generated successfully if push comes to shove. Debugging in itself is complex enough, debugging without information you could technically have needlessly increases complexity and turnaround time, and, more importantly, significantly lowers the probability that the root cause can be found and addressed in a timely manner.

184

answered Oct 22 '22 05:10

horstr

Related questions
                            
                                MongoDB Readahead warning
                            
                                What is the difference between a .so and a .lo file?
                            
                                How to make reading from `std::cin` timeout after a particular amount of time
                            
                                gmon.out is not written after compiling with gcc -pg -g
                            
                                Permission denied while running startup.sh in linux [closed]
                            
                                Jenkins build errors
                            
                                Node.js bash: /usr/local/bin/node: Permission denied
                            
                                Nginx downloads php instead of running it
                            
                                arm-linux-gnu-gcc fatal error: stdio.h: No such file or directory
                            
                                What is making this du command error with invalid options?
                            
                                Assigning one variable to another in Bash?
                            
                                How to create tar for files older than 7 days using linux shell scripting
                            
                                How to get the percentage of memory usage of a process?
                            
                                Debug executable with arguments in IDA
                            
                                ImportError: No module named 'appdirs'
                            
                                Relation between memory host and memory arguments xms and xmx from Java
                            
                                maven war application setting up contextroot
                            
                                passing arguments to docker exec from bash prompt and script
                            
                                Count the number of characters, words and lines in PowerShell
                            
                                conda.exe: error while loading shared libraries: libz.so.1

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With