Recently we just noticed many of our servers sporadically and abruptly (no discernible gradual degradation) locking up with the following stack (all other theads are BLOCKED, IN_NATIVE, or IN_VM) (truncated where our code starts), obtained using jstack -F
Thread 18334: (state = IN_JAVA)
- java.util.Calendar.updateTime() @bci=1, line=2469 (Compiled frame; information may be imprecise)
- java.util.Calendar.getTimeInMillis() @bci=8, line=1088 (Compiled frame)
(truncated)
The failure seems to occur shortly after a full gc, and top -H -p shows that there are two threads, one seems to be the above thread, and the other is a gc thread or a jitc, per the following output of pstack (not VMThread::run()):
Thread 331 (Thread 0x7f59641bc700 (LWP 16461)):
#0 0x00007f63f9ed0ef8 in SafepointSynchronize::begin() () from /usr/java/jdk1.6.0_33/jre/lib/amd64/server/libjvm.so
#1 0x00007f63f9fbab7c in VMThread::loop() () from /usr/java/jdk1.6.0_33/jre/lib/amd64/server/libjvm.so
#2 0x00007f63f9fba68e in VMThread::run() () from /usr/java/jdk1.6.0_33/jre/lib/amd64/server/libjvm.so
#3 0x00007f63f9e5e7af in java_start(Thread*) () from /usr/java/jdk1.6.0_33/jre/lib/amd64/server/libjvm.so
#4 0x00000035bb807851 in start_thread () from /lib64/libpthread.so.0
#5 0x00000035bb4e811d in clone () from /lib64/libc.so.6
Does anyone have any ideas why this may have started happening?
We are using jdk1.6.0_33 on CentOS version 5.7 and 6.3 on servers with 24 cores (12 physical).
Here are a few more stacks, with our code truncated:
Thread 22561: (state = IN_VM)
- java.lang.String.toLowerCase(java.util.Locale) @bci=428, line=2782 (Compiled frame; information may be imprecise)
- java.lang.String.toLowerCase() @bci=4, line=2847 (Compiled frame)
(truncated)
Thread 22562: (state = IN_VM)
- java.util.HashMap.put(java.lang.Object, java.lang.Object) @bci=20, line=403 (Compiled frame; information may be imprecise)
- java.util.HashSet.add(java.lang.Object) @bci=8, line=200 (Compiled frame)
(truncated)
Thread 22558: (state = BLOCKED)
- sun.nio.ch.EPollSelectorImpl.wakeup() @bci=6, line=173 (Compiled frame)
- org.mortbay.io.nio.SelectorManager$SelectSet.wakeup() @bci=10, line=706 (Compiled frame)
- org.mortbay.io.nio.SelectChannelEndPoint.updateKey() @bci=135, line=344 (Compiled frame)
- org.mortbay.io.nio.SelectChannelEndPoint.undispatch() @bci=10, line=204 (Compiled frame)
- org.mortbay.jetty.nio.SelectChannelConnector$ConnectorEndPoint.undispatch() @bci=54, line=382 (Compiled frame)
- org.mortbay.io.nio.SelectChannelEndPoint.run() @bci=44, line=449 (Compiled frame)
- org.mortbay.thread.QueuedThreadPool$PoolThread.run() @bci=25, line=534 (Compiled frame)
Thread 22557: (state = BLOCKED)
- java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise)
- java.lang.Object.wait(long, int) @bci=58, line=443 (Compiled frame)
- com.stumbleupon.async.Deferred.doJoin(boolean, long) @bci=244, line=1148 (Compiled frame)
- com.stumbleupon.async.Deferred.join(long) @bci=3, line=1028 (Compiled frame)
(truncated)
Thread 20907: (state = IN_NATIVE)
- java.net.PlainSocketImpl.socketAccept(java.net.SocketImpl) @bci=0 (Interpreted frame)
- java.net.PlainSocketImpl.accept(java.net.SocketImpl) @bci=7, line=408 (Interpreted frame)
- java.net.ServerSocket.implAccept(java.net.Socket) @bci=60, line=462 (Interpreted frame)
- java.net.ServerSocket.accept() @bci=48, line=430 (Interpreted frame)
- sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop() @bci=55, line=369 (Interpreted frame)
- sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run() @bci=1, line=341 (Interpreted frame)
- java.lang.Thread.run() @bci=11, line=662 (Interpreted frame)
Thread 22901: (state = IN_NATIVE)
- sun.nio.ch.EPollArrayWrapper.epollWait(long, int, long, int) @bci=0 (Compiled frame; information may be imprecise)
- sun.nio.ch.EPollArrayWrapper.poll(long) @bci=18, line=210 (Compiled frame)
- sun.nio.ch.EPollSelectorImpl.doSelect(long) @bci=28, line=65 (Compiled frame)
- sun.nio.ch.SelectorImpl.lockAndDoSelect(long) @bci=37, line=69 (Compiled frame)
- sun.nio.ch.SelectorImpl.select(long) @bci=30, line=80 (Compiled frame)
- net.spy.memcached.MemcachedConnection.handleIO() @bci=126, line=188 (Compiled frame)
- net.spy.memcached.MemcachedClient.run() @bci=11, line=1591 (Compiled frame)
A garbage collection pause, also known as a stop-the-world event, happens when a region of memory is full and the JVM requires space to continue. During a pause all operations are suspended. Because a pause affects networking, the node can appear as down to other nodes in the cluster.
If your application's object creation rate is very high, then to keep up with it, the garbage collection rate will also be very high. A high garbage collection rate will increase the GC pause time as well. Thus, optimizing the application to create fewer objects is THE EFFECTIVE strategy to reduce long GC pauses.
When the eden space becomes full, minor gc takes place. During a minor GC event, objects surviving the eden space are moved to the survivor space.
When a JVM runs out of space in the storage heap and is unable to allocate any more objects (an allocation failure), a garbage collection is triggered. The Garbage Collector cleans up objects in the storage heap that are no longer being referenced by applications and frees some of the space.
Answering my own question since we partially found the source of the problem. We had a piece of code like the following in our system:
LinkedList<Foo> foo = getSomePotentiallyLargeList();
long someValue = someCalendar.getTimeInMillis();
for (int i = 0; i < foo.size; i++) {
if (foo.get(i).someField < someValue) break;
}
This is essentially a bug in our code because the above for loop takes n^2 time to execute potentially since foo is a LinkedList. However, it should not have locked up all of our threads if we encountered one long list in one thread (that thread should have been stuck for a long time, while the others continue making progress, and the jvm occasionally pauses for gc, etc.).
The reason our app froze up was that when the it hit a gc, all gc threads were to block until all threads reached a safe point and all java threads were to block upon reaching a safe point until gc completed. It appears that somehow the JVM failed to put a safe point inside the for loop so that it would need to continue executing, potentially for days or more, until the loop finished and a safe point could be reached.
The last safe point that was reached was inside the call to getTimeInMillis(), so that's why jstack -F was reporting an approximate execution location around there. It seems like this must be a JVM bug since it's my understanding that safe points are supposed to be located in every branch in execution to prevent this sort of problem where gc is stuck waiting on one looping thread.
Unfortunately, I've been unable to reproduce the problem on my own desktop with a small example. For example, if I run two threads, one of which is executing in the above way, the other of which is just allocating a modest amount of memory, gc does not block the second thread while the first thread is stuck in a long loop.
It would be nice to verify that this is indeed the case and isolate the problem, or get a better understanding of how safe points are ensured to be quickly reachable once gc is triggered. Needless to say, our fix was not to spend n^2 time in the loop, but finding this specific problem was very difficult given our output. Not only was gc stuck, but since jstack was unable to report the jvm's execution location inside the loop, it was difficult to zero in on this bug in our code.
It can help in debugging this kind of thing if you collect a bunch of VM info every 5 minutes using something like jmxtrans and graph the data in something like Graphite.
You might think that there is nothing discernable but that is probably because you are looking at only one data point, namely response time. collect all the different GC related datapoints that the JVM exposes through JMX and see if one of them does give some warning. It might be related to getting within x% of available heap space, if your app regularly allocates and frees that same amount (x%) of heap. You need to study the charts at various scales (zoomed in and zoomed out) to get a feel for what is normal behavior of your app.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With