I'm using Windbg to analyse a deadlock occurring in an data-snap application server written in delphi.
When I run
!analyze -hang -v
I get this
:000:x86> !analyze -hang -v ******************************************************************************* * * * Exception Analysis * * * ******************************************************************************* GetPageUrlData failed, server returned HTTP status 404 URL requested: http://watson.microsoft.com/00000000.htm?Retriage=1 FAULTING_IP: +6ced240 00000000 ?? ??? EXCEPTION_RECORD: ffffffffffffffff -- (.exr 0xffffffffffffffff) ExceptionAddress: 0000000000000000 ExceptionCode: 80000003 (Break instruction exception) ExceptionFlags: 00000000 NumberParameters: 0 FAULTING_THREAD: 0000000000000000 BUGCHECK_STR: HANG DEFAULT_BUCKET_ID: APPLICATION_HANG PROCESS_NAME: ********.exe ERROR_CODE: (NTSTATUS) 0xcfffffff - EXCEPTION_CODE: (NTSTATUS) 0xcfffffff - MOD_LIST: NTGLOBALFLAG: 0 APPLICATION_VERIFIER_FLAGS: 0 DERIVED_WAIT_CHAIN: Dl Eid Cid WaitType -- --- ------- -------------------------- 0 c7c.2634 Critical Section WAIT_CHAIN_COMMAND: ~0s;k;; BLOCKING_THREAD: 0000000000002634 PRIMARY_PROBLEM_CLASS: APPLICATION_HANG LAST_CONTROL_TRANSFER: from 0000000077138df4 to 000000007711f8b1 STACK_TEXT: 0018fc50 77138df4 00000c6c 00000000 00000000 ntdll_77100000!NtWaitForSingleObject+0x15 0018fcb4 77138cd8 00000000 00000000 03fe0940 ntdll_77100000!RtlpWaitOnCriticalSection+0x13e 0018fcdc 7369324f 736a3134 00000000 03fe0940 ntdll_77100000!RtlEnterCriticalSection+0x150 WARNING: Stack unwind information not available. Following frames may be wrong. 0018fcec 7369af5f 00000388 00000000 003d1e00 mswsock!GetLspGuid+0x19af 0018fd08 76366958 00000388 0018fd84 0018fd9c mswsock!GetLspGuid+0x96bf 0018fd38 0018fd58 763668cd 00000388 0018fd84 ws2_32!WSAAccept+0x84 00000000 00000000 00000000 00000000 00000000 0x18fd58 FOLLOWUP_IP: mswsock!GetLspGuid+19af 7369324f 33db xor ebx,ebx SYMBOL_STACK_INDEX: 3 SYMBOL_NAME: mswsock!GetLspGuid+19af FOLLOWUP_NAME: MachineOwner MODULE_NAME: C:\Windows\System32\mswsock IMAGE_NAME: lld DEBUG_FLR_IMAGE_TIMESTAMP: 4ce7c83d STACK_COMMAND: ~0s ; kb FAILURE_BUCKET_ID: APPLICATION_HANG_cfffffff_lld!Unloaded BUCKET_ID: X64_HANG_mswsock!GetLspGuid+19af WATSON_STAGEONE_URL: http://watson.microsoft.com/00000000.htm?Retriage=1 Followup: MachineOwner ---------
I then did
!locks -V
to see which lock it was waiting on and to my surprise it returned this,
0:000:x86> !locks -V CritSec ntdll!RtlCriticalSectionLock+0 at 0000000077057060 LockCount NOT LOCKED RecursionCount 0 OwningThread 0 EntryCount 0 ContentionCount 0 CritSec ntdll!LdrpLoaderLock+0 at 0000000077057490 LockCount NOT LOCKED RecursionCount 0 OwningThread 0 EntryCount 0 ContentionCount 0 CritSec ntdll!RtlpDynamicFunctionTableLock+0 at 0000000077057468 LockCount NOT LOCKED RecursionCount 0 OwningThread 0 EntryCount 0 ContentionCount 0 CritSec ntdll!FastPebLock+0 at 000000007705a900 LockCount NOT LOCKED RecursionCount 0 OwningThread 0 EntryCount 0 ContentionCount 0 CritSec ntdll!RtlpProcessHeapsListLock+0 at 000000007705a240 LockCount NOT LOCKED RecursionCount 0 OwningThread 0 EntryCount 0 ContentionCount 0 CritSec +270208 at 0000000000270208 LockCount NOT LOCKED RecursionCount 0 OwningThread 0 EntryCount 0 ContentionCount 1 CritSec ntdll!EtwProvCritSect+0 at 000000007705a120 LockCount NOT LOCKED RecursionCount 0 OwningThread 0 EntryCount 0 ContentionCount 0 CritSec ntdll!EtwPrivSessionCritSect+0 at 000000007705a1e0 LockCount NOT LOCKED RecursionCount 0 OwningThread 0 EntryCount 0 ContentionCount 0 CritSec +10208 at 0000000000010208 LockCount NOT LOCKED RecursionCount 0 OwningThread 0 EntryCount 0 ContentionCount 0 CritSec +276f40 at 0000000000276f40 LockCount NOT LOCKED RecursionCount 0 OwningThread 0 EntryCount 0 ContentionCount 0 Scanned 10 critical sections
From looking at the call stack
STACK_TEXT: 0018fc50 77138df4 00000c6c 00000000 00000000 ntdll_77100000!NtWaitForSingleObject+0x15 0018fcb4 77138cd8 00000000 00000000 03fe0940 ntdll_77100000!RtlpWaitOnCriticalSection+0x13e 0018fcdc 7369324f 736a3134 00000000 03fe0940 ntdll_77100000!RtlEnterCriticalSection+0x150 WARNING: Stack unwind information not available. Following frames may be wrong. 0018fcec 7369af5f 00000388 00000000 003d1e00 mswsock!GetLspGuid+0x19af 0018fd08 76366958 00000388 0018fd84 0018fd9c mswsock!GetLspGuid+0x96bf 0018fd38 0018fd58 763668cd 00000388 0018fd84 ws2_32!WSAAccept+0x84 00000000 00000000 00000000 00000000 00000000 0x18fd58
I determined it was waiting on a critical section at address 0x736a3134 (First parameter passed to RtlEnterCriticalSection) so I ran this
!critsec 736a3134
That gave me this output
0:000:x86> !critsec 736a3134 DebugInfo for CritSec at 00000000736a3134 does not point back to the critical section NOT an initialized critical section. CritSec mswsock!WSPStartup+6f64 at 00000000736a3134 WaiterWoken Yes LockCount -1 RecursionCount 11028 OwningThread c6c EntryCount 1f49dad6 ContentionCount 88000000 *** Locked
Now the penny dropped, the pointer to the critical section has become corrupted, possibly due to concurrent thread access and lack of synchronisation elsewhere in the code
My question is how do I track down where this is or find out if it is another problem?
PS: this bug only appears when the application is under heavy load with maybe 700 clients connected
(it is using one thread per connection and I know 32bit applications will be limited to aprox 2000 threads at the default thread stack size and this is not the best approach)
PPS: I have multiple crash dumps where the application is hung waiting on different critical sections, in each case the pointer for the critical section appears not to point to a critical section.
Looking at the output of !analyze -hang -v, it seems you are not using Application Verifier. I would recommend you to collect a hang dump after enabling application verifier "Locks" option. It would certainly give you more information for troubleshooting.
You can download Application Verifier from here:
http://www.microsoft.com/en-us/download/details.aspx?id=20028
More information:
http://msdn.microsoft.com/en-us/library/windows/desktop/dd371695(v=vs.85).aspx
Just to let you know we gave up trying to find out what was causing this. As it only happened when the program was close to its Max virtual memory space (2.1GB 32bit app) due to the one thread per connection approach we where using.
In the end we re-designed the clients so they are not using this server application any-more but now use a SOAP server instead.
The SOAP server seams to scale much better than the datasnap/Midas we where using although we are still to test it at the clients site where the initial problem appeared.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With