Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the best way of finding a heap corruption that only occurs under a performance test?

The software I work (written in C++) on has a heap corruption problem at the moment. Our perf test team keep getting WER faults when the number of users logged on to the box reaches a certain threshhold but the dumps they've given me just show corruptions in inoncent areas (like when std::string frees it's underlying memory for example).

I've tried using Appverifier and this did throw up a number of issues which I've now fixed. However I'm now in the situation where the testers can load up the machine as much as possible with Appverifier and have a clean run but still get heap corruption when running without Appverifier (I guess since they can get more users on etc without). This has meant I've been unable to get a dump which actually shows the problem.

Does anyone have any other ideas for useful techniques or technologies I can use? I've done as much analysis as I can on the heap corruption dumps without appverifier but I can't see any common themes. No threads doing anything intersting at the same time as the crash, and the thread which crashes is innocent which makes me think the corruption occured some time before.

like image 365
Benj Avatar asked Aug 04 '11 12:08

Benj


2 Answers

The best tool is Appverifier in combination with gFlags but there are many other solutions that may help.

For example, you could specify a heap check every 16 malloc, realloc, free, and _msize operations with the following code:

#include <crtdbg.h>
int main( )
{
int tmp;

// Get the current bits
tmp = _CrtSetDbgFlag(_CRTDBG_REPORT_FLAG);

// Clear the upper 16 bits and OR in the desired freqency
tmp = (tmp & 0x0000FFFF) | _CRTDBG_CHECK_EVERY_16_DF;

// Set the new bits
_CrtSetDbgFlag(tmp);
}
like image 170
cprogrammer Avatar answered Oct 14 '22 02:10

cprogrammer


You have my sympathies: a very difficult problem to track down.

As you say normally these occur some time prior to the crash, generally as the result of a misbehaving write (e.g. writing to deleted memory, running off the end of an array, exceeding the allocated memory in a memcpy, etc).

In the past (on Linux, I gather you're on Windows) I've used heap-checking tools (valgrind, purify, intel inspector) but as you've observed these often affect the performance and thus obscure the bug. (You don't say whether its a multi-threaded app, or processing a variable dataset such as incoming messages).

I have also overloaded the new and delete operators to detect double deletes, but this is quite a specific situation.

If none of the available tools help, then you're on you're own and its going to be a long debugging process. The best advice for this I can offer is to work on reducing the test scenario which will reproduce it. Then attempt to reduce the amount of code being exercised, i.e. stubbing out parts of functionality. Eventually you'll zero-in on the problem, but I've seen very good guys spend 6 weeks or more tracking these down on a large application (~1.5 million LOC).

All the best.

like image 31
Dave Avatar answered Oct 14 '22 02:10

Dave