Hung JVM consuming 100% CPU

Tags:

java

We have a JAVA server running on Sun JRE 6u20 on Linux 32-bit (CentOS). We use the Server Hotspot with CMS collector with the following options (I've only provided the relevant ones):

-Xmx896m -Xss128k -XX:NewSize=384M -XX:MaxPermSize=96m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC

Sometimes, after running for a while, the JVM seems to slip into a hung state, whereby even though we don't make any requests to the application, the CPU continues to spin at 100% (we have 8 logical CPUs, so it looks like only one CPU does the spinning). In this state the JVM doesn't respond to SIGHUP signals (kill -3) and we can't connect to it normally with jstack. We CAN connect with "jstack -F", but the output is dodgy (we can see lots of NullPointerExceptions from JStack apparently because it wasn't able to 'walk' some stacks). So the "jstack -F" output seems to be useless.

We have run a stack dump from "gdb" though, and we were able to match the thread id that spins the CPU (we found that using "top" with a per-thread view - "H" option) with a thread stack that appears in the gdb result and this is how it looks like:

Thread 443 (Thread 0x7e5b90 (LWP 26310)):
#0  0x0115ebd3 in CompactibleFreeListSpace::block_size(HeapWord const*) const () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#1  0x01160ff9 in CompactibleFreeListSpace::prepare_for_compaction(CompactPoint*) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#2  0x0123456c in Generation::prepare_for_compaction(CompactPoint*) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#3  0x01229b2c in GenCollectedHeap::prepare_for_compaction() () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#4  0x0122a7fc in GenMarkSweep::invoke_at_safepoint(int, ReferenceProcessor*, bool) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#5  0x01186024 in CMSCollector::do_compaction_work(bool) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#6  0x011859ee in CMSCollector::acquire_control_and_collect(bool, bool) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#7  0x01185705 in ConcurrentMarkSweepGeneration::collect(bool, bool, unsigned int, bool) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#8  0x01227f53 in GenCollectedHeap::do_collection(bool, bool, unsigned int, bool, int) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#9  0x0115c7b5 in GenCollectorPolicy::satisfy_failed_allocation(unsigned int, bool) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#10 0x0122859c in GenCollectedHeap::satisfy_failed_allocation(unsigned int, bool) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#11 0x0158a8ce in VM_GenCollectForAllocation::doit() () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#12 0x015987e6 in VM_Operation::evaluate() () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#13 0x01597c93 in VMThread::evaluate_operation(VM_Operation*) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#14 0x01597f0f in VMThread::loop() () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#15 0x015979f0 in VMThread::run() () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#16 0x0145c24e in java_start(Thread*) () from /usr/java/jdk1.6.0_20/jre/lib/i386/server/libjvm.so
#17 0x00ccd46b in start_thread () from /lib/libpthread.so.0
#18 0x00bc2dbe in clone () from /lib/libc.so.6

It seems that a JVM thread is spinning while doing some CMS related work. We have checked the memory usage on the box, there seems to be enough memory available and the system is not swapping. Has anyone come across such a situation? Does it look like a JVM bug?

UPDATE

I've obtained some more information about this problem (it happened again on a server that has been running for more than 7 days). When the JVM entered the "hung" state it stayed like that for 2 hours until the server was manually restarted. We have obtained a core dump of the process and the gc log. We tried to get a heap dump as well, but "jmap" failed. We tried to use jmap -F but then only a 4Mb file was written before the program aborted with an exception (something about the a memory location not being accessible). So far I think the most interesting information comes from the gc log. It seems that the GC logging stopped as well (possibly at the time when the VM thread went into the long loop):

657501.199: [Full GC (System) 657501.199: [CMS: 400352K->313412K(524288K), 2.4024120 secs] 660634K->313412K(878208K), [CMS Perm : 29455K->29320K(68568K)], 2.4026470 secs] [Times: user=2.39 sys=0.01, real=2.40 secs] 
657513.941: [GC 657513.941: [ParNew: 314624K->13999K(353920K), 0.0228180 secs] 628036K->327412K(878208K), 0.0230510 secs] [Times: user=0.08 sys=0.00, real=0.02 secs] 
657523.772: [GC 657523.772: [ParNew: 328623K->17110K(353920K), 0.0244910 secs] 642036K->330523K(878208K), 0.0247140 secs] [Times: user=0.08 sys=0.00, real=0.02 secs] 
657535.473: [GC 657535.473: [ParNew: 331734K->20282K(353920K), 0.0259480 secs] 645147K->333695K(878208K), 0.0261670 secs] [Times: user=0.11 sys=0.00, real=0.02 secs] 
....
....
688346.765: [GC [1 CMS-initial-mark: 485248K(524288K)] 515694K(878208K), 0.0343730 secs] [Times: user=0.03 sys=0.00, real=0.04 secs] 
688346.800: [CMS-concurrent-mark-start]
688347.964: [CMS-concurrent-mark: 1.083/1.164 secs] [Times: user=2.52 sys=0.09, real=1.16 secs] 
688347.964: [CMS-concurrent-preclean-start]
688347.969: [CMS-concurrent-preclean: 0.004/0.005 secs] [Times: user=0.00 sys=0.01, real=0.01 secs] 
688347.969: [CMS-concurrent-abortable-preclean-start]
 CMS: abort preclean due to time 688352.986: [CMS-concurrent-abortable-preclean: 2.351/5.017 secs] [Times: user=3.83 sys=0.38, real=5.01 secs] 
688352.987: [GC[YG occupancy: 297806 K (353920 K)]688352.987: [Rescan (parallel) , 0.1815250 secs]688353.169: [weak refs processing, 0.0312660 secs] [1 CMS-remark: 485248K(524288K)] 783055K(878208K), 0.2131580 secs] [Times: user=1.13 sys
=0.00, real=0.22 secs] 
688353.201: [CMS-concurrent-sweep-start]
688353.903: [CMS-concurrent-sweep: 0.660/0.702 secs] [Times: user=0.91 sys=0.07, real=0.70 secs] 
688353.903: [CMS-concurrent-reset-start]
688353.912: [CMS-concurrent-reset: 0.008/0.008 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
688354.243: [GC 688354.243: [ParNew: 344928K->30151K(353920K), 0.0305020 secs] 681955K->368044K(878208K), 0.0308880 secs] [Times: user=0.15 sys=0.00, real=0.03 secs]
....
....
688943.029: [GC 688943.029: [ParNew: 336531K->17143K(353920K), 0.0237360 secs] 813250K->494327K(878208K), 0.0241260 secs] [Times: user=0.10 sys=0.00, real=0.03 secs] 
688950.620: [GC 688950.620: [ParNew: 331767K->22442K(353920K), 0.0344110 secs] 808951K->499996K(878208K), 0.0347690 secs] [Times: user=0.11 sys=0.00, real=0.04 secs] 
688956.596: [GC 688956.596: [ParNew: 337064K->37809K(353920K), 0.0488170 secs] 814618K->515896K(878208K), 0.0491550 secs] [Times: user=0.18 sys=0.04, real=0.05 secs] 
688961.470: [GC 688961.471: [ParNew (promotion failed): 352433K->332183K(353920K), 0.1862520 secs]688961.657: [CMS

I suspect this problem has something to do with the last line in the log (I've added some "...." in order to skip some lines that were not interesting). The fact that the server stayed in the hung state for 2 hours (probably trying to GC and compact the old generation) seems quite strange to me. Also, the gc log stops suddenly with that message and nothing else gets printed any more, probably because the VM Thread gets into some sort of infinite loop (or something that takes 2+ hours).

614

asked Feb 03 '11 09:02

Bogdan

1 Answers

It looks like a GC tuning problem, possibly triggered by a memory leak. I suggest that you add the JVM parameter that turns on GC logging, and see if that gives you any insights.

Does it look like a JVM bug?

Not to me.

When you run out of heap space (for example, as a result of a storage leak) it is common to see the GC running more and more frequently and using a larger and larger percentage of the CPU. There are some JVM parameters that you can set to avoid this, but they have the effect of causing your JVM to keel over sooner with an OOME. (This may sound strange, but it in many cases this is a good thing because it reduces the time until your system is restarted and is able to run normally again.)

128

answered Oct 21 '22 08:10

Stephen C

Related questions
                            
                                Spring 4.3.7 is throwing java.lang.NoClassDefFoundError: com/fasterxml/jackson/core/util/DefaultIndenter
                            
                                How does dependency injection work in Cucumber?
                            
                                ORA-12704: character set mismatch when performing multi-row INSERT of nullable NVARCHAR's
                            
                                How to implement a custom SamplingService for a custom message processor?log after that an element is retrieved and before the sequence is performed
                            
                                How to get the user's country for a HTTP request in the Backend?
                            
                                printing from a ListView Javafx
                            
                                IntelliJ idea can't update any maven repositories
                            
                                Android 8.1 W/Notification: Use of stream types is deprecated
                            
                                Spring RestController returning error as html instead of json
                            
                                How to retrieve Weblogic's Deployment test points through Weblogic Java API
                            
                                FATAL EXCEPTION: java.lang.OutOfMemoryError: EnsureLocalCapacity (FileInputStream.open0, android.graphics.Typeface.getFullFlipFont)
                            
                                Replace access to sun.misc.VM for JDK 11
                            
                                How to correct Greek accented characters in a TextField
                            
                                How to enable logging in org.slf4j for the version: '2.0.0-alpha1' in Spring boot
                            
                                A methology that allows for a single Java code base covering many different versions? [closed]
                            
                                Date and Time helper for PHP (like Joda-Time in Java) [closed]
                            
                                Audio fingerprinting system in Java [closed]
                            
                                How to begin creating 3D (Game-) GUIs for Android Applications (e.g. with OpenGL)? [closed]
                            
                                Debugging project generated by Gradle on Eclipse
                            
                                Get private key from MacOS X keystore using Java

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With