Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get Java stacks when JVM can't reach a safepoint

We recently had a situation where one of our production JVMs would randomly freeze. The Java process was burning CPU, but all visible activity would cease: no log output, nothing written to the GC log, no response to any network request, etc. The process would persist in this state until restarted.

It turned out that the org.mozilla.javascript.DToA class, when invoked on certain inputs, will get confused and call BigInteger.pow with enormous values (e.g. 5^2147483647), which triggers the JVM freeze. My guess is that some large loop, perhaps in java.math.BigInteger.multiplyToLen, was JIT'ed without a safepoint check inside the loop. The next time the JVM needed to pause for garbage collection, it would freeze, because the thread running the BigInteger code wouldn't be reaching a safepoint for a very long time.

My question: in the future, how can I diagnose a safepoint problem like this? kill -3 didn't produce any output; I presume it relies on safepoints to generate accurate stacks. Is there any production-safe tool which can extract stacks from a running JVM without waiting for a safepoint? (In this case, I got lucky and managed to grab a set of stack traces just after BigInteger.pow was invoked, but before it worked its way up to a sufficiently large input to completely wedge the JVM. Without that stroke of luck, I'm not sure how we would ever have diagnosed the problem.)

Edit: the following code illustrates the problem.

// Spawn a background thread to compute an enormous number.
new Thread(){ @Override public void run() {
  try {
    Thread.sleep(5000);
  } catch (InterruptedException ex) {
  }
  BigInteger.valueOf(5).pow(100000000);
}}.start();

// Loop, allocating memory and periodically logging progress, so illustrate GC pause times.
byte[] b;
for (int outer = 0; ; outer++) {
  long startMs = System.currentTimeMillis();
  for (int inner = 0; inner < 100000; inner++) {
    b = new byte[1000];
  }

  System.out.println("Iteration " + outer + " took " + (System.currentTimeMillis() - startMs) + " ms");
}

This launches a background thread which waits 5 seconds and then starts an enormous BigInteger computation. In the foreground, it then repeatedly allocates a series of 100,000 1K blocks, logging the elapsed time for each 100MB series. During the 5 second period, each 100MB series runs in about 20 milliseconds on my MacBook Pro. Once the BigInteger computation begins, we begin to see long pauses interleaved. In one test, the pauses were successively 175ms, 997ms, 2927ms, 4222ms, and 22617ms (at which point I aborted the test). This is consistent with BigInteger.pow() invoking a series of ever-larger multiply operations, each taking successively longer to reach a safepoint.

like image 318
Steve Avatar asked Nov 22 '13 00:11

Steve


3 Answers

Your problem interested me very much. You were right about JIT. First I tried to play with GC types, but this did not have any effect. Then I tried to disable JIT and everything worked fine:

java -Djava.compiler=NONE Tests

Then printed out JIT compilations:

java -XX:+PrintCompilation Tests

And noticed that problem starts after some compilations in BigInteger class, I tried to exclude methods one by one from compilation and finally found the cause:

java -XX:CompileCommand=exclude,java/math/BigInteger,multiplyToLen -XX:+PrintCompilation Tests

For large arrays this method could work long, and problem might really be in safepoints. For some reason they are not inserted, but should be even in compiled code. Looks like a bug. The next step should be to analyze assembly code, I did not do it yet.

like image 141
Mikhail Avatar answered Nov 15 '22 23:11

Mikhail


It's not a bug, it's a performance feature. JVM eliminates safepoint check from counted loops, making them run faster. It expects that either

  • you care about STW pauses and don't have extra long loops
  • or you have extra long loops, but fine with safepoints being eventual

If it doesn't fit you, it can be switched off with this flag: -XX:+UseCountedLoopSafepoints

And answering the title question, you can still stop and explore a program with gdb, but the stack traces wouldn't be so nice.

like image 41
Imaskar Avatar answered Nov 15 '22 23:11

Imaskar


Perhaps that's what the "-F" option of jstack is good for:

OPTIONS
  -F
     Force a stack dump when 'jstack [-l] pid' does not respond.

I always wondered when an why that could help.

like image 1
jfrantzius Avatar answered Nov 15 '22 22:11

jfrantzius