Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SIGSEGV when entering a function

What can cause a segmentation fault when just entering a function?

The function entered looks like:

21:  void eesu3(Matrix & iQ)
22:  {

where Matrix is a struct. When running with GDB the backtrace produces:

(gdb) backtrace 
#0  eesu3 (iQ=...) at /home/.../eesu3.cc:22
#1  ...

GDB does not say what iQ is. The ... are literally there. What could cause this?

GCC: (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3

Program built with -O3 -g

The caller goes like:

Matrix q;
// do some stuff with q
eesu3(q);

Nothing special here

I reran the program with valgrind:

valgrind --tool=memcheck --leak-check=yes --show-reachable=yes --num-callers=20 --track-fds=yes <prgname>

Output:

==2240== Warning: client switching stacks?  SP change: 0x7fef7ef68 --> 0x7fe5e3000
==2240==          to suppress, use: --max-stackframe=10076008 or greater
==2240== Invalid write of size 8
==2240==    at 0x14C765B: eesu3( Matrix &) (eesu3.cc:22)
...
==2240==  Address 0x7fe5e3fd8 is on thread 1's stack
==2240== 
==2240== Can't extend stack to 0x7fe5e2420 during signal delivery for thread 1:
==2240==   no stack segment
==2240== 
==2240== Process terminating with default action of signal 11 (SIGSEGV)
==2240==  Access not within mapped region at address 0x7FE5E2420
==2240==    at 0x14C765B: eesu3( Matrix&) (eesu3.cc:22)
==2240==  If you believe this happened as a result of a stack
==2240==  overflow in your program's main thread (unlikely but
==2240==  possible), you can try to increase the size of the
==2240==  main thread stack using the --main-stacksize= flag.
==2240==  The main thread stack size used in this run was 8388608.

Looks like its a corrupted stack.

    Dump of assembler code for function eesu3( Matrix & ):
   0x00000000014c7640 <+0>: push   %rbp
   0x00000000014c7641 <+1>: mov    %rsp,%rbp
   0x00000000014c7644 <+4>: push   %r15
   0x00000000014c7646 <+6>: push   %r14
   0x00000000014c7648 <+8>: push   %r13
   0x00000000014c764a <+10>:    push   %r12
   0x00000000014c764c <+12>:    push   %rbx
   0x00000000014c764d <+13>:    and    $0xfffffffffffff000,%rsp
   0x00000000014c7654 <+20>:    sub    $0x99b000,%rsp
=> 0x00000000014c765b <+27>:    mov    %rdi,0xfd8(%rsp)

Okay, to make it clear: Matrix's data lives on the heap. It basically holds a pointer to the data. The struct is small, 32 bytes. (just checked)

Now, I rebuilt the program with different optimization options:

-O0: the error does not show.

-O1: the error does show.

-O3: the error does show.

--update

-O3 -fno-inline -fno-inline-functions: the error does not show.

That explains it. Too many inlines into the function led to excessive stack usage.

The problem was due to a stack overflow

like image 515
ritter Avatar asked May 08 '12 13:05

ritter


1 Answers

What can cause a segmentation fault when just entering a function?

The most frequent cause is stack exhaustion. Do (gdb) disas at crash point. If the instruction that crashed is the first read or write to a stack location after %rsp has been decremented, then stack exhaustion is almost definitely the cause.

Solution usually involves creating threads with larger stacks, moving some large variables from stack to heap, or both.

Another possible cause: if Matrix contains very large array, you can't put it on stack: the kernel will not extend stack beyond current by more than 128K (or so, I don't remember exact value). If Matrix is bigger than that limit, you can't put it on stack.

Update:

   0x00000000014c7654 <+20>:    sub    $0x99b000,%rsp
=> 0x00000000014c765b <+27>:    mov    %rdi,0xfd8(%rsp)

This disassembly confirms the diagnosis.

In addition, you are reserving 0x99b000 bytes on stack (that's almost 10MB). There must be some humongous objects you are trying to locate on stack in the eesu3 routine. Don't do that.

What do you mean by "the kernel will not extend stack beyond current by more than"

When you extend stack (decrement %rsp) by e.g. 1MB, and then try to touch that stack location, the memory will not be accessible (the kernel grows stack on-demand). This will generate a hardware trap, and transfer control to the kernel. When the kernel decides what to do, it looks at

  1. Current %rsp
  2. Meemory location that the application tried to access
  3. Stack limit for the current thread

If faulting address is below current %rsp, but within 128K (or some other constant of similar magnitude), the kernel simply extends the stack (provided such extension will not exceed the stack limit).

If the faulting address is more than 128K below current %rsp (as appears to be the case here), you get SIGSEGV.

This all works nicely for most programs: even if they use a lot of stack in a recursive procedure, they usually extend stack in small chunks. But an equivalent program that tried to reserve all that stack in a single routine would have crashed.

Anyway, do (gdb) info locals at crash point, and see what locals might be requiring 10MB of stack. Then move them to heap.

Update 2:

No locals

Ah, the program has probably not made it far enough into eesu3 for there to be locals.

when building with -O0 the error disappears. GCC bug?

It could be a GCC bug, but more likely it's just that GCC is inlining a lot of other routines into eesu3, and each of the inlined routines needs its own N KBs of stack. Does the problem disappear if you build the source containing eesu3 with -fno-inline ?

Unfortunately, triage of such behavior and figuring out appropriate workarounds, or fixing GCC, requires compiler expertise. You could start by compiling with -fdump-tree-all and looking at generated <source>.*t.* files. These contain textual dumps of GCC internal representation at various stages of the compilation process. You may be able to understand enough of it to make further progress.

like image 78
Employed Russian Avatar answered Sep 20 '22 03:09

Employed Russian