Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

possible heap corruption (win 32, native c++)

I'm working with a single-threaded native c++ application. There is a very hard to reproduce bug that I cannot reproduce locally. I enabled full page heap and debug information in the release executable, and obtained dumps from a client (which has to use the application many days to get the bug).

What the client reports: the application hangs and never recovers. It has to be killed from the task manager. What I see from the dumps: the application is stuck in an infinite loop.

The loop is from walking a double linked list which has become cyclic. There are signs of memory corruption, in that many data members have strange values, like no matching enumerant, values under 0000FFFF or the linked list itself is reported as being 300 million+ in size which is not normal.

The only other information I can get from the dumps is that a socket read operation failed with 0 data read. This causes the walking of the (now cyclic) list.

I have several dumps all hanging in the same infinite loop. I've tried to get the allocation stack trace, but !heap -p -a gives me "ReadMemory error for address eeddccee Use `!address eeddccee' to check validity of the address." for all addresses I try.

Currently I'm looking into fixing the L4 warnings (except I don't know which can be related to this, I have a bunch of C4100, C4511, C4512 which I don't know how to fix; I'm mostly fixing no-brainer's like C4244). DebugDiag did not find anything, except give me a "This thread is not fully resolved and may or may not be a problem. Further analysis of these threads may be required." on the single thread.

From what I see, my options are fixing more warnings, re-reading the code until something jumps at me or learning something new from here.

Is this really a memory corruption? Why does it hang in the same structure every time? How can I find the cause?

like image 392
Adrian P. Avatar asked Mar 23 '26 12:03

Adrian P.


1 Answers

Fixing the warning errors is a good idea - it may help you feel better and will certainly reduce confusion in the build - but it's unlikely to resolve the present issue, so may be better left as an out-of-band task for the future.

Socket read failure with 0 data may imply the socket got closed down. Perhaps you have a timing problem here where socket closedown logic is resulting in concurrent access to some shared data structure that is not properly locked. Take a good look at the socket code to make sure locking is correct and watertight. Make sure that all possible error codes are handled correctly in your sockets API calls (Winsock, presumably?). You can be sure that even the slightest window for concurrent access on a container or "that can't happen" error paths will eventually be hit in your production environment. I know you said the app is single-threaded but Windows has a funny habit of giving you extra threads that you did not start up yourself, for example if you are using DLL services that themselves kick off new threads.

It's hard when you cannot get good production diagnostics, but if you can narrow down the problem to a particular area, try to isolate the failing code in a unit test application that mimics the usage in real life, and stress the heck out of it on your desktop. I have had intermittent bugs like this that even under heavy load in a specialized test app took hours to reproduce the problem. Running in this mode (release build of course) in the debugger may expose the issue more quickly that you would think.

Another option may be to install the Process Dumper on the failing machine and instruct it to dump a full memory image (debuggable as per standard Windbg DMP file) on access violation and process exit. This may provide better information than a minidump postmortem debug. If your client is cooperative they can instruct the dump to be generated when the problem next occurs. This is the closest you can get to a live debug without being on the machine or having remote access to it.

You may want to consider generating extra diagnostics in the socket closedown logic as well to verify whether or not this is the proximate cause of the error condition.

Make sure your client's OS and other system software is up-to-date with all required patches. Maybe this is not even your fault (though it seems likely that you have a bug, to be sure).

like image 65
Steve Townsend Avatar answered Mar 25 '26 02:03

Steve Townsend



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!