Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tips on debugging segmentation faults when no leaks are found

I wrote a C-based application that appears to run fine, except on very large datasets as input.

With large input, I get a segmentation fault at the end steps of the binary's functionality.

I ran the binary (with the test input) with valgrind:

valgrind --tool=memcheck --leak-check=yes /foo/bar/baz inputDataset > outputAnalysis

This job normally takes a few hours, but with valgrind it took seven days.

Unfortunately, at this point, I don't know how to read the results I am getting from this run.

I get a lot of these warnings:

...
==4074== Conditional jump or move depends on uninitialised value(s)                                                                                                                  
==4074==    at 0x435900: ??? (in /foo/bar/baz)                                                                                   
==4074==    by 0x439CC5: ??? (in /foo/bar/baz)                                                                                   
==4074==    by 0x400BF2: ??? (in /foo/bar/baz)                                                                                   
==4074==    by 0x402086: ??? (in /foo/bar/baz)                                                                                   
==4074==    by 0x402A0F: ??? (in /foo/bar/baz)                                                                                   
==4074==    by 0x41684F: ??? (in /foo/bar/baz)                                                                                   
==4074==    by 0x4001B8: ??? (in /foo/bar/baz)                                                                                   
==4074==    by 0x7FEFFFF57: ???                                                                                                                                                      
==4074==  Uninitialised value was created                                                                                                                                            
==4074==    at 0x461D3A: ??? (in /foo/bar/baz)                                                                                   
==4074==    by 0x43F926: ??? (in /foo/bar/baz)                                                                                   
==4074==    by 0x416B9B: ??? (in /foo/bar/baz)                                                                                   
==4074==    by 0x416725: ??? (in /foo/bar/baz)                                                                                   
==4074==    by 0x4001B8: ??? (in /foo/bar/baz)                                                                                   
==4074==    by 0x7FEFFFF57: ???
...

There are no parts of code hinted at, no names of variables, etc. What can I do with this information?

At the end, I finally get the following error, but — as with smaller datasets that do not crash — valgrind finds no leaks:

...
==4074== Process terminating with default action of signal 11 (SIGSEGV)                                                                                                              
==4074==  Access not within mapped region at address 0x7158E7F7                                                                                                                      
==4074==    at 0x7158E7F7: ???                                                                                                                                                       
==4074==    by 0x4020B8: ??? (in /foo/bar/baz)                                                                                   
==4074==    by 0x6322203A22656D6E: ???                                                                                                                                               
==4074==    by 0x306C675F6E557267: ???                                                                                                                                               
==4074==    by 0x202C22373232302F: ???                                                                                                                                               
==4074==    by 0x6D616E656C696621: ???                                                                                                                                               
==4074==    by 0x72686322203A2264: ???                                                                                                                                               
==4074==    by 0x3030306C675F6E54: ???                                                                                                                                               
==4074==    by 0x346469702E373231: ???                                                                                                                                               
==4074==    by 0x646469662E34372F: ???                                                                                                                                               
==4074==    by 0x722E64616568656B: ???                                                                                                                                               
==4074==    by 0x63656D6F6C756764: ???                                                                                                                                               
==4074==  If you believe this happened as a result of a stack                                                                                                                        
==4074==  overflow in your program's main thread (unlikely but                                                                                                                       
==4074==  possible), you can try to increase the size of the                                                                                                                         
==4074==  main thread stack using the --main-stacksize= flag.                                                                                                                        
==4074==  The main thread stack size used in this run was 10485760.                                                                                                                  
==4074==                                                                                                                                                                             
==4074== HEAP SUMMARY:                                                                                                                                                               
==4074==     in use at exit: 0 bytes in 0 blocks                                                                                                                                     
==4074==   total heap usage: 0 allocs, 0 frees, 0 bytes allocated                                                                                                                    
==4074==                                                                                                                                                                             
==4074== All heap blocks were freed -- no leaks are possible                                                                                                                         
==4074==                                                                                                                                                                             
==4074== For counts of detected and suppressed errors, rerun with: -v                                                                                                                
==4074== ERROR SUMMARY: 1603141870 errors from 86 contexts (suppressed: 0 from 0)
Segmentation fault

Everything I allocate space for gets an equivalent free statement, after which I set pointers to NULL.

At this point, how can I best debug this application, to determine what else is causing the segmentation fault?


22 Dec 2011 - Edit

I compiled a debug-version of my binary, called debug-binary, using the following compilation flags:

-D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE=1 -DUSE_ZLIB -g -O0 -Wformat -Wall -pedantic -std=gnu99

When I run it with valgrind, I don't get much more information:

valgrind -v --tool=memcheck --leak-check=yes --error-limit=no --track-origins=yes debug-binary input > output

Here's a snippet of output:

==25116== 2 errors in context 14 of 14:                                                                                                                                                                                                      
==25116== Invalid read of size 4                                                                                                                                                                                                             
==25116==    at 0x4045E8: ??? (in /foo/bar/debug-binary)                                                                                                                                 
==25116==    by 0x40682F: ??? (in /foo/bar/debug-binary)                                                                                                                                 
==25116==    by 0x404F0C: ??? (in /foo/bar/debug-binary)                                                                                                                                 
==25116==    by 0x401FA4: ??? (in /foo/bar/debug-binary)                                                                                                                                 
==25116==    by 0x402016: ??? (in /foo/bar/debug-binary)                                                                                                                                 
==25116==    by 0x403B27: ??? (in /foo/bar/debug-binary)                                                                                                                                 
==25116==    by 0x40295E: ??? (in /foo/bar/debug-binary)                                                                                                                                 
==25116==    by 0x31A021D993: (below main) (in /lib64/libc-2.5.so)                                                                                                                                                                           
==25116==  Address 0x539f188 is 24 bytes inside a block of size 48 free'd                                                                                                                                                                    
==25116==    at 0x4A05D21: free (vg_replace_malloc.c:325)                                                                                                                                                                                    
==25116==    by 0x401F6B: ??? (in /foo/bar/debug-binary)                                                                                                                                 
==25116==    by 0x402016: ??? (in /foo/bar/debug-binary)                                                                                                                                 
==25116==    by 0x403B27: ??? (in /foo/bar/debug-binary)                                                                                                                                 
==25116==    by 0x40295E: ??? (in /foo/bar/debug-binary)                                                                                                                                 
==25116==    by 0x31A021D993: (below main) (in /lib64/libc-2.5.so) 

Is this an issue with my binary, or with a system library (libc) that my application is dependent upon?

I also don't know what to do about interpreting the ??? entries. Is there another compilation flag I need to get valgrind to provide more information?

like image 518
Alex Reynolds Avatar asked Dec 19 '11 22:12

Alex Reynolds


3 Answers

Valgrind basically says there are no notable heap management issues. The program is segfaulting from a less complex programming fault.

If it were me, I would

  • compile it with gcc -g,
  • enable core dump files (ulimit -c unlimited),
  • run the program normally,
  • and let it fault
  • use gdb to examine the core file and look at what it was doing when it faulted:

    gdb (programfile) (corefile)
    bt

like image 174
wallyk Avatar answered Oct 25 '22 01:10

wallyk


I don't believe valgrind is able to find all errors where you've overrun a value on the stack (but not overrun the stack itself). So, you may want to try gcc's -f-stack-protector-all option.

You should also try mudflap, with -fmudflap (single-threaded) or -fmudflapth (multi-threaded).

Both mudflap and stack protector should be much faster than valgrind.

In additional, it looks like you don't have debug symbols, making reading backtraces difficult. Add -ggdb. You probably also want to enable core-file generation (try ulimit -c unlimited). This way, you can try to debug the process post-crash by using gdb program core.

As @wallyk indicates, your segfault may actually be something fairly easy to find—e.g., maybe you're dereferencing NULL, and gdb can point you to the exact line (or, well, close unless you compile with -O0). This would make sense, for example, if you're just running of memory for your larger datasets, and thus malloc returns NULL, and you forgot to check that somewhere.

Lastly, if nothing else makes sense, there is always the possibility of hardware issues. But those would be expected to be fairly random, e.g., different values getting corrupted different runs. If you try a different machine, and it happens there, its extremely unlikely to be a hardware issue.

like image 21
derobert Avatar answered Oct 25 '22 02:10

derobert


The "Conditional jump or move depends on uninitialised value" is a serious bug you need to fix. It indicates that the behaviour of your program is affected by the contents of an uninitialised variable (including an uninitialised memory region returned by malloc()).

To get readable backtraces from valgrind you need to compile with -g.

like image 21
caf Avatar answered Oct 25 '22 02:10

caf