In the last few months I've received few reports from QA about one of our services hanging. Upon examining a hang dump using WinDbg, every time I discovered the same thing: Loader lock critical section is locked but owning thread is nowhere to be found. Since the thread is gone and the only trace that I can see is a global critical section it left behind, I don't see what code ran on thread thread, or even what DLL that thread came from, it may not even be one of ours (i.e. third party vendor).
This issue is very sporadic, only seen it maybe 3-4 times over the last 6 months occurring naturally in the wild. All other times, service runs perfectly. So this makes me believe it's some kind of timing/race condition thing.
Recently, I've decided to take it upon myself to figure this one out. I setup a machine with WinTask script that constantly starts/stops the said service. Good news is that within 5-6 hours I can reproduce the problem.
Now for next part: how do I isolate it?
This is what I've tried so far:
used "debugger" field in gflags image settings to automagically run the service under cdb whenever it starts. So far this has been running for two days and never hung, so I'm thinking debugger introduced just enough of a timing change to make the issue invisible.
Downloaded Application Verifier and configured the process to run with that. Found a completely unrelated bug where we create CComBSTR temporary variable, assign it to a VARIANT and pass the variant into a function call even though CComBSTR long deleted the allocated string by that point. Don't believe this bug is related because string is read-only and the thread it's running on isn't the one that's dying.
I'm making this post in case you guys could think of something that I'm not considering.
I though there was a windows utility that artificially put load on the CPU and did other things to make race conditions pop up and I thought application verifier did such a thing, but apparently it doesn't. Does anyone know what I'm taking about, or did I just dream that up?
Unless something happens over the weekend my next step would be to disable all debuggers, go back to stock and hack one of DllMains to record THREAD_ATTACH/THREAD_DETACH events. At least I'll be able to intercept the thread that's dying when it gets created. That might shed some light.
Something I might try is attaching a kernel debugger, then run the process under Appilcation Verifier. AV has checks for unloading a DLL while it holds a CS and terminating threads that still hold CS. So those breakpoints should trigger in the kernel debugger and then hopefully you can catch it in the act. Running it under the KD hopefully won't slow it down like the user-mode debugger does.
So turns out I was closer to the solution than I realized. With the service running under cdb, which altered the timing and then running it with application verifier, which altered the timing even more (page heap enabled makes allocation slower), the secret ingredient I was missing was prime95.exe. Running prime95.exe at above normal priority, really screwed up whatever timing I was trying not to change, but it made the problem show up in under 15 minutes.
The cause:
3rd party SDK for acquiring data from hardware boards. When our service starts up, we would query different capture components for their capabilities. After the query is done, we release the component instance. Apparently this one DLL started a separate thread, which acquired a loader lock and then proceeded to do a bunch of initialization in that thread. If during that time, our capability query got done and we released the component, their code would call TerminateThread() on this other thread leaving the loader lock permanently locked. Prime95 slowed everything down just enough for me to catch this race condition and get the following verifier stop message:
=======================================
VERIFIER STOP 00000200: pid 0x1A8C: Thread cannot own a critical section.
0000091C : Thread ID.
77E17340 : Critical section address.
00000000 : Critical section debug information address.
00000000 : Critical section initialization stack trace.
Funny part is that this thread was "disappearing" without any exception of any kind, so debugger wouldn't even catch first chance anything. Who uses TerminateThread????
Thank you, everyone for suggestions and support. I was actually starting to look forward to driving to Radioshack during lunch to buy a serial cable and then spending a few days playing with KD. Looks like that will have to wait till next time :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With