JVM locks up after full gc sporadically

Tags:

Recently we just noticed many of our servers sporadically and abruptly (no discernible gradual degradation) locking up with the following stack (all other theads are BLOCKED, IN_NATIVE, or IN_VM) (truncated where our code starts), obtained using jstack -F

Thread 18334: (state = IN_JAVA)
 - java.util.Calendar.updateTime() @bci=1, line=2469 (Compiled frame; information may be imprecise)
 - java.util.Calendar.getTimeInMillis() @bci=8, line=1088 (Compiled frame)
(truncated)

The failure seems to occur shortly after a full gc, and top -H -p shows that there are two threads, one seems to be the above thread, and the other is a gc thread or a jitc, per the following output of pstack (not VMThread::run()):

Thread 331 (Thread 0x7f59641bc700 (LWP 16461)):
#0  0x00007f63f9ed0ef8 in SafepointSynchronize::begin() () from /usr/java/jdk1.6.0_33/jre/lib/amd64/server/libjvm.so
#1  0x00007f63f9fbab7c in VMThread::loop() () from /usr/java/jdk1.6.0_33/jre/lib/amd64/server/libjvm.so
#2  0x00007f63f9fba68e in VMThread::run() () from /usr/java/jdk1.6.0_33/jre/lib/amd64/server/libjvm.so
#3  0x00007f63f9e5e7af in java_start(Thread*) () from /usr/java/jdk1.6.0_33/jre/lib/amd64/server/libjvm.so
#4  0x00000035bb807851 in start_thread () from /lib64/libpthread.so.0
#5  0x00000035bb4e811d in clone () from /lib64/libc.so.6

Does anyone have any ideas why this may have started happening?

We are using jdk1.6.0_33 on CentOS version 5.7 and 6.3 on servers with 24 cores (12 physical).

Here are a few more stacks, with our code truncated:

Thread 22561: (state = IN_VM)
 - java.lang.String.toLowerCase(java.util.Locale) @bci=428, line=2782 (Compiled frame; information may be imprecise)
 - java.lang.String.toLowerCase() @bci=4, line=2847 (Compiled frame)
(truncated)

Thread 22562: (state = IN_VM)
 - java.util.HashMap.put(java.lang.Object, java.lang.Object) @bci=20, line=403 (Compiled frame; information may be imprecise)
 - java.util.HashSet.add(java.lang.Object) @bci=8, line=200 (Compiled frame)
(truncated)

Thread 22558: (state = BLOCKED)
 - sun.nio.ch.EPollSelectorImpl.wakeup() @bci=6, line=173 (Compiled frame)
 - org.mortbay.io.nio.SelectorManager$SelectSet.wakeup() @bci=10, line=706 (Compiled frame)
 - org.mortbay.io.nio.SelectChannelEndPoint.updateKey() @bci=135, line=344 (Compiled frame)
 - org.mortbay.io.nio.SelectChannelEndPoint.undispatch() @bci=10, line=204 (Compiled frame)
 - org.mortbay.jetty.nio.SelectChannelConnector$ConnectorEndPoint.undispatch() @bci=54, line=382 (Compiled frame)
 - org.mortbay.io.nio.SelectChannelEndPoint.run() @bci=44, line=449 (Compiled frame)
 - org.mortbay.thread.QueuedThreadPool$PoolThread.run() @bci=25, line=534 (Compiled frame)

Thread 22557: (state = BLOCKED)
 - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise)
 - java.lang.Object.wait(long, int) @bci=58, line=443 (Compiled frame)
 - com.stumbleupon.async.Deferred.doJoin(boolean, long) @bci=244, line=1148 (Compiled frame)
 - com.stumbleupon.async.Deferred.join(long) @bci=3, line=1028 (Compiled frame)
(truncated)

Thread 20907: (state = IN_NATIVE)
 - java.net.PlainSocketImpl.socketAccept(java.net.SocketImpl) @bci=0 (Interpreted frame)
 - java.net.PlainSocketImpl.accept(java.net.SocketImpl) @bci=7, line=408 (Interpreted frame)
 - java.net.ServerSocket.implAccept(java.net.Socket) @bci=60, line=462 (Interpreted frame)
 - java.net.ServerSocket.accept() @bci=48, line=430 (Interpreted frame)
 - sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop() @bci=55, line=369 (Interpreted frame)
 - sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run() @bci=1, line=341 (Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=662 (Interpreted frame)

Thread 22901: (state = IN_NATIVE)
 - sun.nio.ch.EPollArrayWrapper.epollWait(long, int, long, int) @bci=0 (Compiled frame; information may be imprecise)
 - sun.nio.ch.EPollArrayWrapper.poll(long) @bci=18, line=210 (Compiled frame)
 - sun.nio.ch.EPollSelectorImpl.doSelect(long) @bci=28, line=65 (Compiled frame)
 - sun.nio.ch.SelectorImpl.lockAndDoSelect(long) @bci=37, line=69 (Compiled frame)
 - sun.nio.ch.SelectorImpl.select(long) @bci=30, line=80 (Compiled frame)
 - net.spy.memcached.MemcachedConnection.handleIO() @bci=126, line=188 (Compiled frame)
 - net.spy.memcached.MemcachedClient.run() @bci=11, line=1591 (Compiled frame)

837

asked Nov 14 '12 18:11

jonderry

2 Answers

Answering my own question since we partially found the source of the problem. We had a piece of code like the following in our system:

LinkedList<Foo> foo = getSomePotentiallyLargeList();
long someValue = someCalendar.getTimeInMillis();
for (int i = 0; i < foo.size; i++) {
    if (foo.get(i).someField < someValue) break;
}

This is essentially a bug in our code because the above for loop takes n^2 time to execute potentially since foo is a LinkedList. However, it should not have locked up all of our threads if we encountered one long list in one thread (that thread should have been stuck for a long time, while the others continue making progress, and the jvm occasionally pauses for gc, etc.).

The reason our app froze up was that when the it hit a gc, all gc threads were to block until all threads reached a safe point and all java threads were to block upon reaching a safe point until gc completed. It appears that somehow the JVM failed to put a safe point inside the for loop so that it would need to continue executing, potentially for days or more, until the loop finished and a safe point could be reached.

The last safe point that was reached was inside the call to getTimeInMillis(), so that's why jstack -F was reporting an approximate execution location around there. It seems like this must be a JVM bug since it's my understanding that safe points are supposed to be located in every branch in execution to prevent this sort of problem where gc is stuck waiting on one looping thread.

Unfortunately, I've been unable to reproduce the problem on my own desktop with a small example. For example, if I run two threads, one of which is executing in the above way, the other of which is just allocating a modest amount of memory, gc does not block the second thread while the first thread is stuck in a long loop.

It would be nice to verify that this is indeed the case and isolate the problem, or get a better understanding of how safe points are ensured to be quickly reachable once gc is triggered. Needless to say, our fix was not to spend n^2 time in the loop, but finding this specific problem was very difficult given our output. Not only was gc stuck, but since jstack was unable to report the jvm's execution location inside the loop, it was difficult to zero in on this bug in our code.

102

answered Oct 18 '22 22:10

jonderry

It can help in debugging this kind of thing if you collect a bunch of VM info every 5 minutes using something like jmxtrans and graph the data in something like Graphite.

You might think that there is nothing discernable but that is probably because you are looking at only one data point, namely response time. collect all the different GC related datapoints that the JVM exposes through JMX and see if one of them does give some warning. It might be related to getting within x% of available heap space, if your app regularly allocates and frees that same amount (x%) of heap. You need to study the charts at various scales (zoomed in and zoomed out) to get a feel for what is normal behavior of your app.

answered Oct 18 '22 21:10

Michael Dillon

Related questions
                            
                                creating jsf view/Component tree from the xhtml file
                            
                                Code generated by wsimport - best practice for packing the code
                            
                                CodeNameOne Dynamically created Form, how to "Back"
                            
                                Rounded buttons
                            
                                Why fill() , copy() ,reverse() and shuffle() of Collections in java is implemented this way
                            
                                Android How to calculate network usage packet/data
                            
                                jaxb marshalling skip empty elements
                            
                                Java wildcard in multi-level generic type
                            
                                Getting URL of an CXF endpoint
                            
                                Generics wildcard instantiation
                            
                                deserialize lazy loading in hibernate and jackson
                            
                                Should one use Disruptor (LMAX) with a big model in memory and CQRS?
                            
                                Can I force an extra run of a scheduled execution?
                            
                                avoiding problems with spring cglib proxy
                            
                                How can I increase the global heap size for java on Mac OSX
                            
                                XStream serialize null values
                            
                                How does copying/passing instances of a WebDriver work, and is it dangerous?
                            
                                I need help comparing files in a directory recursively to find duplicates
                            
                                Valid RSS 2.0 Using Rome
                            
                                List iterator causes heap allocations?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

JVM locks up after full gc sporadically

Tags:

java

jvm

jonderry

People also ask

2 Answers

jonderry

Michael Dillon

Recent Activity

Donate For Us