Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Windows SuspendThread doesn't? (GetThreadContext fails)

We have an Windows32 application in which one thread can stop another to inspect its state [PC, etc.], by doing SuspendThread/GetThreadContext/ResumeThread.

if (SuspendThread((HANDLE)hComputeThread[threadId])<0)  // freeze thread
   ThreadOperationFault("SuspendThread","InterruptGranule");
CONTEXT Context, *pContext;
Context.ContextFlags = (CONTEXT_INTEGER | CONTEXT_CONTROL);
if (!GetThreadContext((HANDLE)hComputeThread[threadId],&Context))
   ThreadOperationFault("GetThreadContext","InterruptGranule");

Extremely rarely, on a multicore system, GetThreadContext returns error code 5 (Windows system error code "Access Denied").

The SuspendThread documentation seems to clearly indicate that the targeted thread is suspended, if no error is returned. We are checking the return status of SuspendThread and ResumeThread; they aren't complaining, ever.

How can it be the case that I can suspend a thread, but can't access its context?

This blog http://www.dcl.hpi.uni-potsdam.de/research/WRK/2009/01/what-does-suspendthread-really-do/

suggests that SuspendThread, when it returns, may have started the suspension of the other thread, but that thread hasn't yet suspended. In this case, I can kind of see how GetThreadContext would be problematic, but this seems like a stupid way to define SuspendThread. (How would the call of SuspendThread know when the target thread was actually suspended?)

EDIT: I lied. I said this was for Windows.

Well, the strange truth is that I don't see this behavior under Windows XP 64 (at least not in the last week and I don't really know what happened before that)... but we have been testing this Windows application under Wine on Ubuntu 10.x. The Wine source for the guts of GetThreadContext contains an Access Denied return response on line 819 when an attempt to grab the thread state fails for some reason. I'm guessing, but it appears that Wine GetThreadStatus believes that a thread just might not be accessible repeatedly. Why that would be true after a SuspendThead is beyond me, but there's the code. Thoughts?

EDIT2: I lied again. I said we only saw the behavior on Wine. Nope... we have now found a Vista Ultimate system that seems to produce the same error (again, rarely). So, it appears that Wine and Windows agree on an obscure case. It also appears that the mere enabling of the Sysinternals Process monitor program aggravates the situation and causes the problem to appear on Windows XP 64; I suspect a Heisenbug. (The Process Monitor doesn't even exist on the Wine-tasting (:-) machine or the XP 64 system I use for development).

What on earth is it?

EDIT3: Sept 15 2010. I've added careful checking to the error return status, without otherwise disturbing the code, for SuspendThread, ResumeThread, and GetContext. I haven't seen any hint of this behavior on Windows systems since I did that. Haven't gotten back to the Wine experiment.

Nov 2010: Strange. It seems that if I compile this under VisualStudio 2005, it fails on Windows Vista and 7, but not earlier OSes. If I compile under VisualStudio 2010, it doesn't fail anywhere. One might point a finger at VisualStudio2005, but I'm suspicious of a location-sensitivve problem, and different optimizers in VS 2005 and VS 2010 place the code a slightly different places.

Nov 2012: Saga continues. We see this failure on a number of XP and Windows 7 machines, at a pretty low rate (once every several thousand runs). Our Suspend activities are applied to threads that mostly execute pure computational code but that sometimes make calls into Windows. I don't recall seeing this issue when the PC of the thread was in our computational code. Of course, I can't see the PC of the thread when it hangs because GetContext won't give it to me, so I can't directly confirm that the problem only happens when executing system calls. But, all our system calls are channeled through one point, and so far the evidence is that point was executed when we get the hang. So the indirect evidence suggests GetContext on a thread only fails if a system call is being executed by that thread. I haven't had the energy to build a critical experiment to test this hypothesis yet.

like image 644
Ira Baxter Avatar asked Aug 09 '10 21:08

Ira Baxter


3 Answers

Let me quote from Richter/Nassare's "Windows via C++ 5Ed" which may shed some light:

DWORD SuspendThread(HANDLE hThread);

Any thread can call this function to suspend another thread (as long as you have the thread's handle). It goes without saying (but I'll say it anyway) that a thread can suspend itself but cannot resume itself. Like ResumeThread, SuspendThread returns the thread's previous suspend count. A thread can be suspended as many as MAXIMUM_SUSPEND_COUNT times (defined as 127 in WinNT.h). Note that SuspendThread is asynchronous with respect to kernel-mode execution, but user-mode execution does not occur until the thread is resumed.

In real life, an application must be careful when it calls SuspendThread because you have no idea what the thread might be doing when you attempt to suspend it. If the thread is attempting to allocate memory from a heap, for example, the thread will have a lock on the heap. As other threads attempt to access the heap, their execution will be halted until the first thread is resumed. SuspendThread is safe only if you know exactly what the target thread is (or might be doing) and you take extreme measures to avoid problems or deadlocks caused by suspending the thread.

...

Windows actually lets you look inside a thread's kernel object and grab its current set of CPU registers. To do this, you simply call GetThreadContext:

BOOL GetThreadContext( HANDLE hThread, PCONTEXT pContext);

To call this function, just allocate a CONTEXT structure, initialize some flags (the structure's ContextFlags member) indicating which registers you want to get back, and pass the address of the structure to GetThreadContext. The function then fills in the members you've requested.

You should call SuspendThread before calling GetThreadContext; otherwise, the thread might be scheduled and the thread's context might be different from what you get back. A thread actually has two contexts: user mode and kernel mode. GetThreadContext can return only the user-mode context of a thread. If you call SuspendThread to stop a thread but that thread is currently executing in kernel mode, its user-mode context is stable even though SuspendThread hasn't actually suspended the thread yet. But the thread cannot execute any more user-mode code until it is resumed, so you can safely consider the thread suspended and GetThreadContext will work.

My guess is that GetThreadContext may fail if you just called SuspendThread, while the thread is in kernel mode, and the kernel is locking the thread context block at this time.

Maybe on multicore systems, one core is handling the kernel-mode execution of the thread that it's user mode was just suspended, keep locking the CONTEXT structure of the thread, exactly when the other core is calling GetThreadContext.

Since this behaviour is not documented, I suggest contacting microsoft.

like image 154
Lior Kogan Avatar answered Nov 07 '22 04:11

Lior Kogan


There are some particular problems surrounding suspending a thread that owns a CriticalSection. I can't find a good reference to it now, but there is one mention of it on Raymond Chen's blog and another mention on Chris Brumme's blog. Basically, if you are unlucky enough to call SuspendThread while the thread is accessing an OS lock (e.g., heap lock, DllMain lock, etc.), then really strange things can happen. I would assume that this is the case that you are running into extremely rarely.

Does retrying the call to GetThreadContext work after a processor yield like Sleep(0)?

like image 33
D.Shawley Avatar answered Nov 07 '22 03:11

D.Shawley


Old issue but good to see you still kept it updated with status changes after experiencing the issue for another more than 2 years.

The cause of your problem is that there is a bug in the translation layer of the x64 version of WoW64, as per:

http://social.msdn.microsoft.com/Forums/en/windowscompatibility/thread/1558e9ca-8180-4633-a349-534e8d51cf3a

There is a rather critical bug in GetThreadContext under WoW64 which makes it return stale contents which makes it unusable in many situations. The contents is stored in user-mode This is why you think the value is not-null but in the stale contents it is still null.

This is why it fails on newer OS but not older ones, try running it on Windows 7 32bit OS.

As for why this bug seems to happen less often with solutions built on Visual Studio 2010 / 2012 it is likely that there is something the compiler is doing which is mitigating most of the problem, for this you should inspect the IL generated from both 2005 and 2010 and see what the differences are. For example does the problem happen if the project is built without optimizations perhaps?

Finally, some further reading:

http://www.nynaeve.net/?p=129

like image 3
Seph Avatar answered Nov 07 '22 03:11

Seph