Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Debug Win32 application hang

I'm having trouble finding the cause for a hang in a Win32 application. The software renders some data to an OpenGL visual in a tight loop:

std::vector<uint8_t> indices;
glPolygonMode(GL_FRONT_AND_BACK, GL_FILL);
glEnableClientState(GL_VERTEX_ARRAY);
glVertexPointer(2, GL_DOUBLE, 0, vertexDataBuffer);
while (...) {
    // get index type (1, 2, 4) and index count
    indices.resize(indexType * count);

    // get indices into "indices" buffer
    getIndices(indices.data(), indices.size()); //< seems to hang here!

    // draw (I'm using the correct parameters)
    glDrawElements(GL_TRIANGLES_*, count, GL_UNSIGNED_*);
}
glDisableClientState(GL_VERTEX_ARRAY);

The code is compiled using VC11 Update 1 (CTP 3). When running the optimized binary, it hangs inside the call to getIndices() (more about this below) after a few of those loops. I already have...

  • triple validated all buffers, even appended CRCs to make sure I'm not having any buffer overruns
  • Added a call to HeapValidate() inside the loop to ensure the heap is not corrupt
  • used ApplicationVerifier
  • Enabled heap allocation monitoring using GFlags and PageHeap.
  • broke into WinDbg when the application locks up

I did not find any problems with the code accessing the allocated buffer, nor any heap corruption. However, if I disable the low-fragmentation heap, the issue vanishes. It also vanishes, if I use a separate (low-fragmentation) heap for the indices buffer.

Anyway, here is the stack trace leading to the dead-lock:

0:000> kb
ChildEBP RetAddr  Args to Child              
0034e328 77b039c3 00000000 0034e350 00000000 ntdll!ZwWaitForKeyedEvent+0x15
0034e394 77b062bc 77b94724 080d36a8 0034e464 ntdll!RtlAcquireSRWLockExclusive+0x12e
0034e3c0 77aeb652 0034e464 0034e4b4 00000000 ntdll!RtlpCallVectoredHandlers+0x58
0034e3d4 77aeb314 0034e464 0034e4b4 77b94724 ntdll!RtlCallVectoredExceptionHandlers+0x12
0034e44c 77aa0133 0034e464 0034e4b4 0034e464 ntdll!RtlDispatchException+0x19
0034e44c 77b062c5 0034e464 0034e4b4 0034e464 ntdll!KiUserExceptionDispatcher+0xf
0034e7bc 77aeb652 0034e860 0034e8b0 00000000 ntdll!RtlpCallVectoredHandlers+0x61
0034e7d0 77aeb314 0034e860 0034e8b0 0034ec28 ntdll!RtlCallVectoredExceptionHandlers+0x12
0034e848 77aa0133 0034e860 0034e8b0 0034e860 ntdll!RtlDispatchException+0x19
0034e848 1c43c666 0034e860 0034e8b0 0034e860 ntdll!KiUserExceptionDispatcher+0xf
0034ebe8 1c43c4e5 0034ec28 080d35d0 080d35d6 lcdb4!lc::db::PackedIndices::unpackIndices<unsigned char>+0x86
0034ec14 1c45922d 0034ec28 080d35d0 00000006 lcdb4!lc::db::PackedIndices::unpack+0xb5
...
xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx getIndices

For completeness, I posted the code of lc::db::PackedIndices::unpackIndices(), including all code added for debugging, to http://ideone.com/sVVXX7.

The code triggering the call to KiUserExceptionDispatcher is (*p++) = static_cast<T>(index); (mov dword ptr [esp+10h],eax).

I just can't seem to figure out what's going on. An exception seems to have been thrown, but none of my exception handlers are called. The application just hangs. I checked for any deadlocked critical sections (!lock) but found none. Furthermore, I don't see why an exception should be raised, as the memory locations are all valid. Could anyone give me some hints?

Update

I tried to find the type of exception being thrown:

0:000> s -d esp L1000 1003f
0028ebdc  0001003f 00000000 00000000 00000000  ?...............
0028efd8  0001003f 00000000 00000000 00000000  ?...............
0:000> .cxr 0028ebdc
eax=77b94724 ebx=0804be30 ecx=00000002 edx=00000004 esi=77b94724 edi=0804be28
eip=77b062c5 esp=0028eec4 ebp=0028eee4 iopl=0         nv up ei ng nz na pe cy
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010287
ntdll!RtlpCallVectoredHandlers+0x61:
77b062c5 ff03            inc     dword ptr [ebx]      ds:002b:0804be30=00000001
0:000> .cxr 0028efd8
eax=0000003b ebx=00000001 ecx=0804bd98 edx=0028f340 esi=0028f340 edi=04b77580
eip=1c43c296 esp=0028f2c0 ebp=0028f2fc iopl=0         nv up ei pl nz na po nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010202
lcdb4!lc::db::PackedIndices::unpackIndices<unsigned char>+0x36:
1c43c296 8801            mov     byte ptr [ecx],al          ds:002b:0804bd98=3e
like image 258
Daniel Gehriger Avatar asked Nov 02 '12 10:11

Daniel Gehriger


1 Answers

The thread is hung awaiting for an exclusive lock on SRW (slim read write lock) belonging to the OS exception handling code. And that exception is caused by your code. The exact exception and details of it could be found using the following stack frame. 0034e848 77aa0133 0034e860 0034e8b0 0034e860 ntdll!RtlDispatchException+0x19 - the argument to RtlDispatchException is pointer to EXCEPTION_RECORD. So if you type .exr 0034e860 you can see the exception record. From the exception record you would know access to which address is causing the exception (if the exception is access violation exception).

As, after these steps, you had found that the access violation was happening due to a write to an address that you had rightfully allocated on the heap - you can find the protection attributes of the virtual page containing that address through the command !address "the virtual address"

As you had found out that the page protection attributes have been changed to (by some code) PAGE_READONLY on those heap addresses and after seeing the call stack of other threads I have the following conjecture which I think might help you find the root cause.

I am guessing that Windows Heap manager changes the page attributes before raising an exception to indicate heap corruption. There seems to be some corruption in the ole heap too - from the call stack of other threads you had showed. The root of the problem is probably a code corrupting a heap - which the heap finds subsequently and raises an exception for, following that the exception mechanism implementation code of the OS kicks-in and gets hung on the SWR lock before it is able to call the exception handler in your or other library code. Following this another ignorant thread in your code rightfully touches the heap memory, which the heap has already made protected due to the corruption it had already found out about, causing an exception and making the exception mechanism code to kick-in and fall into the same dead-lock. Given that you had said that problem is not reproducible when the program is run under the debugger, it would be anyone's guess that the problem has some timing issue or race condition.

like image 127
nanda Avatar answered Sep 27 '22 17:09

nanda