Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Long running process suspended

I have a .NET 2.0 console application running on a Windows Server GoDaddy VPS in the Visual Studio 2010 IDE in debug mode (F5).

The application periodically freezes (as if the garbage collector has temporarily suspended execution) however on the rare occasion it never resumes execution!

I've been diagonosing this for months, and am running out of ideas.

  • The application runs as fast as it can (it uses 100% CPU usage), but at normal priority. It is also multi-threaded.
  • When the application freezes, I can unfreeze it using the VS2010 IDE by pausing/unpausing the process (since it's running in the debugger).
  • The location of last execution, when I pause the frozen process, seems irrelevant.
  • While frozen, the CPU usage is still 100%.
  • Upon unfreezing it, it runs perfectly fine until the next freeze.
  • The server might run 70 days between freezes, or it might only make it 24 hours.
  • Memory usage remains relatively constant; no evidence of any sort of memory leak.

Anyone have any tips for diagnosing what exactly is happening?

like image 295
Mr. Smith Avatar asked Jan 29 '13 06:01

Mr. Smith


People also ask

What is the difference between suspended and blocked?

Hunting to know BLOCKING vs SUSPENDINGA process is blocked when there is some external reason that it can not be restarted, e.g., an I/O device is unavailable, or a semaphore file is locked. A process is suspended means that the OS has stopped executing it, but that could just be for time-slicing (multitasking).

What does it mean when a process is suspended?

Whenever the processes in main memory are entered into the blocked state, the operating system suspends one process by putting it in the Suspended state and transferring it to disk. The free space present in the memory is used for bringing another process.


2 Answers

It is also multi-threaded

That's the key part of the problem. You are describing a very typical way in which a multi-threaded program can misbehave. It is suffering from deadlock, one of the typical problems with threading.

It can be narrowed down a bit further from the info, clearly your process isn't completely frozen since it still consumes 100% cpu. You probably have a hot wait-loop in your code, a loop that spins on another thread signaling an event. Which is likely to induce an especially nasty variety of deadlock, a live-lock. Live-locks are very sensitive to timing, minor changes in the order in which code runs can bump it into a live-lock. And back out again.

Live-locks are extraordinarily difficult to debug since attempting to do so makes the condition disappear. Like attaching a debugger or breaking the code, enough to alter the thread timing and bump it out of the condition. Or adding logging statements to your code, a common strategy to debug threading problems. Which alters the timing due to the logging overhead which in turn can make the live-lock entirely disappear.

Nasty stuff and impossible to get help with such a problem from a site like SO since it is extremely dependent on the code. A thorough review of the code is often required to find the reason. And not infrequently a drastic rewrite. Good luck with it.

like image 127
Hans Passant Avatar answered Oct 23 '22 07:10

Hans Passant


Does the application have "dead lock recover/prevention" code? That is, locking with timout, then trying again, perhaps after sleep?

Does the application check error codes (return values or exceptions) and repeatedly retry in case of error anywhere?

Note that such looping can also happen through event loop, where your code is only in some event handler. It does not have to be an actual loop in your own code. Though this is probably not the case, if application is frozen, indicating blocked event loop.

If you have anything like above, you could try to mitigate the problem by making timeouts and sleeps to be of random interval, as well as adding short random-duration sleeps to cases where error might produce dead-/livelock. If such a loop is performance-sensitive, add a counter and only start sleeping with random, perhaps increasing interval after some number of failed retries. And make sure any sleep you add does not sleep while something is locked.

If the situation would happen more often, you could also use this to bisect your code and pinpoint which loops (because 100% CPU usage means, some very busy loops are spinning) are responsible. But from the rarity of issue, I gather you're going to be happy if the problem just goes away in practice ;)

like image 41
hyde Avatar answered Oct 23 '22 08:10

hyde