Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

JVM OutOfMemory error "death spiral" (not memory leak)

We have recently been migrating a number of applications from running under RedHat linux JDK1.6.0_03 to Solaris 10u8 JDK1.6.0_16 (much higher spec machines) and we have noticed what seems to be a rather pressing problem: under certain loads our JVMs get themselves into a "Death Spiral" and eventually go out of memory. Things to note:

  • this is not a case of a memory leak. These are applications which have been running just fine (in one case for over 3 years) and the out-of-memory errors are not certain in any case. Sometimes the applications work, sometimes they don't
  • this is not us moving to a 64-bit VM - we are still running 32 bit
  • In one case, using the latest G1 garbage collector on 1.6.0_18 seems to have solved the problem. In another, moving back to 1.6.0_03 has worked
  • Sometimes our apps are falling over with HotSpot SIGSEGV errors
  • This is affecting applications written in Java as well as Scala

The most important point is this: the behaviour manifests itself in those applications which suddenly get a deluge of data (usually via TCP). It's as if the VM decides to keep adding more data (possibly progressing it to the TG) rather than running a GC on "newspace" until it realises that it has to do a full GC and then, despite practically everything in the VM being garbage, it somehow decides not to collect it!

It sounds crazy but I just don't see what else it is. How else can you explain an app which one minute falls over with a max heap of 1Gb and the next works just fine (never going about 256M when the app is doing exactly the same thing)

So my questions are:

  1. Has anyone else observed this kind of behaviour?
  2. has anyone any suggestions as to how I might debug the JVM itself (as opposed to my app)? How do I prove this is a VM issue?
  3. Are there any VM-specialist forums out there where I can ask the VM's authors (assuming they aren't on SO)? (We have no support contract)
  4. If this is a bug in the latest versions of the VM, how come no-one else has noticed it?
like image 621
oxbow_lakes Avatar asked Feb 19 '10 16:02

oxbow_lakes


1 Answers

Interesting problem. Sounds like one of the garbage collectors works poorly on your particular situation.

Have you tried changing the garbage collector being used? There are a LOT of GC options, and figuring out which ones are optimal seems to be a bit of a black art, but I wonder if a basic change would work for you.

I know there is a "Server" GC that tends to work a lot better than the default ones. Are you using that?

Threaded GC (which I believe is the default) is probably the worst for your particular situation, I've noticed that it tends to be much less aggressive when the machine is busy.

One thing I've noticed, it often takes two GCs to convince Java to actually take out the trash. I think the first one tends to unlink a bunch of objects and the second actually deletes them. What you might want to do is occasionally force two garbage collections. This WILL cause a significant GC pause, but I've never seen a case where it took more than two to clean out the entire heap.

like image 150
Bill K Avatar answered Oct 17 '22 13:10

Bill K