Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tools for Isolating a Stack smashing bug

To put it mildly I have a small memory issue and am running out of tools and ideas to isolate the cause.

I have a highly multi-threaded (pthreads) C/C++ program that has developed a stack smashing issue under optimized compiles with GCC after 4.4.4 and prior to 4.7.1.

The symptom is that during the creation of one of the threads, I get a full stack smash, not just %RIP, but all parent frames and most of the registers are 0x00 or other non-sense address. Which thread causes the issue is seemingly random, however judging by log messages it seems to be isolated to the same Hunk of code, and seems to come at a semi repeatable point in the creation of the new thread.

This has made it very hard to trap and isolate the offending code more narrowly than to a single compilation unit of may thousand lines, since print()'s with in the offending file have so far proved unreliable in trying to narrow down the active section.

The thread creation that leads off the thread that eventually smashes the stack is:

 
extern "C"
{
static ThreadReturnVal ThreadAPI WriterThread(void *act)
{
   Recorder       *rec = reinterpret_cast  (act);
   xuint64        writebytes;
   LoggerHandle m_logger = XXGetLogger("WriterThread");

   if (SetThreadAffinity(rec->m_cpu_mask))
   { ... }
   SetThreadPrio((xint32)rec->m_thread_priority);

   while (true)
   {
     ... poll a ring buffer ... Hard Spin 100% use on a single core, this is that sort of crazy code. 
   }
}

I have tried a debug build, but the symptom is only present in optimized builds, -O2 or better. I have tried Valgrind/memcheck and DRD but both fail to find any issue before the stack is blown away ( and takes about 12hr's to reach the failure )

A compile with -O2 -Wstack-protector sees nothing wrong, however a build with -fstack-protector-all does protect me from the bug, but emits no errors.

Electric-Fence also traps, but only after the stack is gone.

Question: What other tools or techniques would be useful in narrowing down the offending section ?

Many thanks, --Bill

like image 759
Bill N. Avatar asked Oct 03 '12 15:10

Bill N.


2 Answers

A couple of options for approaching this sort of problem:

You could try setting a hardware breakpoint on a stack address before the corruption occurs and hope the debugger breaks early enough in the corruption to provide a vaguely useful debugging state. The tricky part here is choosing the right stack address; depending on how random the 'choice' of offending thread is, this might not be practical. But from one of your comments it sounds like it is often the newly created thread that gets smashed, so this might be doable. Try to break during thread creation, grab the thread's stack location, offset by some wild guess, set the hardware BP, and continue. Based on whether you break too early, too late, or not at all, adjust your offset, rinse, and repeat. This is basically advanced guess and check, and can be heavily hindered or outright unpractical if the corruption pattern is too random, but it is surprising how often this can lead to a semi-legible stack and successful debugging efforts.

Another option would be to start collecting crash dumps. Try to look for patterns between the crash dumps that might help bring you closer to the source of the corruption. Perhaps you'll get lucky and one of the crash dumps will crash 'faster'/'closer to the source'.

Unfortunately, both of these techniques are more art that science; they're non-deterministic, rely on a healthy dose of luck, etc. (at least in my experience.. that being said, there are people out there who can do amazing things with crash dumps, but it takes a lot of time to get to that level of skill).

One more side note: as others have pointed out, uninitialized memory is a very typical source of debug vs release differences, and could easily be your problem here. However, another possibility to keep in mind is timing differences. The order that threads get scheduled in, and for how long, is often dramatically different in debug vs release, and can easily lead to synchronization bugs being masked in one but not the other. These differences can be just due to execution speed differences, but I think some runtimes intentionally mess with thread scheduling in a debug environment.

like image 50
WeirdlyCheezy Avatar answered Nov 19 '22 08:11

WeirdlyCheezy


You can use a static analysis tool to check for some sutble errors, maybe one of the found errors will be the cause of your bug. You can find some information on these tools here.

like image 2
Synxis Avatar answered Nov 19 '22 10:11

Synxis