Trying to debug a misbehaving Java VM. The process in question is a large VM (100GB heap) running Sun VM 1.6u24 on Centos 5 that is doing routine back-end work - i.e. database access, file I/O and so forth.
After the process was restarted for a software version upgrade, we noticed that its throughput has dropped significantly. Most of the time, top reports the Java process is fully utilizing 2 cores. During that time, the VM is totally non-responsible: no logs are written and it doesn't respond to outside tools such as jstack or kill -3. Once the VM recovers, the process continues as per normal, until the next hang.
strace shows that during these hangs, only 2 threads make system calls. These were the VM threads "VM Thread" (21776) and "VM Periodic Task Thread" (21786). Presumably, these 2 threads are using up the CPU time. The application threads occasionally wake up and do their work. The rest of the time they seem to be waiting on various futexes. Incidentally, the first line of the normal phase is always a SIGSEGV.
[pid 21776] sched_yield() = 0
[pid 21776] sched_yield() = 0
[pid 21776] sched_yield( <unfinished ...>
[pid 21786] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 21776] <... sched_yield resumed> ) = 0
[pid 21786] futex(0x2aabac71ef28, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 21776] sched_yield( <unfinished ...>
[pid 21786] <... futex resumed> ) = 0
[pid 21786] clock_gettime(CLOCK_MONOTONIC, {517080, 280918033}) = 0
[pid 21786] clock_gettime(CLOCK_REALTIME, {1369750039, 794028000}) = 0
[pid 21786] futex(0x2aabb81b94c4, FUTEX_WAIT_PRIVATE, 1, {0, 49923000} <unfinished ...>
[pid 21776] <... sched_yield resumed> ) = 0
[pid 21776] sched_yield() = 0
[pid 21776] sched_yield() = 0
[pid 21955] --- SIGSEGV (Segmentation fault) @ 0 (0) ---
[pid 21955] rt_sigreturn(0x2b1cde2f54ad <unfinished ...>
The problem manifests itself in 2 different servers. Rolling back our code version only worked for one of the 2 servers. No error messages were reported in system logs, and another Java process on the affected machine is behaving correctly.
This following output was obtained with gstack and shows 2 typical waiting application threads:
Thread 552 (Thread 0x4935f940 (LWP 21906)):
#0 0x00000030b040ae00 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00002b1cdd8548d6 in os::PlatformEvent::park(long) () from /usr/lib/jvm/java/jre/lib/amd64/server/libjvm.so
#2 0x00002b1cdd92b230 in ObjectMonitor::wait(long, bool, Thread*) () from /usr/lib/jvm/java/jre/lib/amd64/server/libjvm.so
#3 0x00002b1cdd928853 in ObjectSynchronizer::wait(Handle, long, Thread*) () from /usr/lib/jvm/java/jre/lib/amd64/server/libjvm.so
#4 0x00002b1cdd69b716 in JVM_MonitorWait () from /usr/lib/jvm/java/jre/lib/amd64/server/libjvm.so
#5 0x00002b1cde193cc8 in ?? ()
#6 0x00002b1ce2552d90 in ?? ()
#7 0x00002b1cdd84fc23 in os::javaTimeMillis() () from /usr/lib/jvm/java/jre/lib/amd64/server/libjvm.so
#8 0x00002b1cde188a82 in ?? ()
#9 0x0000000000000000 in ?? ()
Thread 551 (Thread 0x49460940 (LWP 21907)):
#0 0x00000030b040ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00002b1cdd854d6f in Parker::park(bool, long) () from /usr/lib/jvm/java/jre/lib/amd64/server/libjvm.so
#2 0x00002b1cdd98a1c8 in Unsafe_Park () from /usr/lib/jvm/java/jre/lib/amd64/server/libjvm.so
#3 0x00002b1cde193cc8 in ?? ()
#4 0x000000004945f798 in ?? ()
#5 0x00002b1cde188a82 in ?? ()
#6 0x0000000000000000 in ?? ()
We looked at issues with NTPD, including leap second bugs, but the suggested workarounds didn't help, neither did using external NTPD servers. Restarting the machine itself didn't help as well. We have GC logging enabled, and it doesn't look it a GC issue, as there are no messages indicating it. Looking for any suggestions that can help with this issue, any help is much appreciated.
timeout property. The wrapper software communicates with your JVM every so often to make sure that it is alive. If that communication fails for whatever reason, the wrapper deems the process hung and attempts to restart it.
Reference: FreezesConfigures the number of seconds between two ping requests to the JVM.
Here are a couple of things I'd look at:
When the JVM is unresponsive, use iostat
and vmstat
to see if the system is thrashing. This can happen when you over-allocate memory; i.e. your overall system is using significantly more virtual memory than physical memory.
Turn on the JVM's GC logging, and see if there is a correlation between the JVM going unresponsive and GC runs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With