I am trying to demonstrate an "anytime algorithm" - an algorithm that can be stopped at any time and returns its current result. The demo algorithm just returns some mathematical function of i, where i is increasing. It chcecks whether it is interrupted, and if so, returns the current value:
static int algorithm(int n) {
int bestSoFar = 0;
for (int i=0; i<n; ++i) {
if (Thread.interrupted())
break;
bestSoFar = (int)Math.pow(i, 0.3);
}
return bestSoFar;
}
In the main program, I use it like this:
Runnable task = () -> {
Instant start = Instant.now();
int bestSoFar = algorithm(1000000000);
double durationInMillis = Duration.between(start, Instant.now()).toMillis();
System.out.println("after "+durationInMillis+" ms, the result is "+bestSoFar);
};
Thread t = new Thread(task);
t.start();
Thread.sleep(1);
t.interrupt();
t = new Thread(task);
t.start();
Thread.sleep(10);
t.interrupt();
t = new Thread(task);
t.start();
Thread.sleep(100);
t.interrupt();
t = new Thread(task);
t.start();
Thread.sleep(1000);
t.interrupt();
}
}
When I run this program, I get the following input:
after 0.0 ms, the result is 7
after 10.0 ms, the result is 36
after 100.0 ms, the result is 85
after 21952.0 ms, the result is 501
I.e, the first three threads are really interrupted when I tell them to, but the last thread is not interrupted after 1 second - it continues working for almost 22 seconds. Why does this happen?
EDIT: I get similar results using Future.get with a timeout. In this code:
Instant start = Instant.now();
ExecutorService executor = Executors.newCachedThreadPool();
Future<?> future = executor.submit(task);
try {
future.get(800, TimeUnit.MILLISECONDS);
} catch (TimeoutException e) {
future.cancel(true);
double durationInMillis = Duration.between(start, Instant.now()).toMillis();
System.out.println("Timeout after "+durationInMillis+" [ms]");
}
if the timeout is at most 800 then everything works find and it prints something like "Timeout after 806.0 [ms]". But if the timeout is 900 it prints "Timeout after 5084.0 [ms]".
EDIT 2: my computer has 4 cores. The program rusn on Open JDK 8.
I can confirm this is a HotSpot JVM bug. Here is my initial analysis of the problem.
@AdamSkywalker was absolutely right in the assumption that the issue is related to safepoint elimination optimization in HotSpot HIT compiler. Though the bug JDK-8154302 looks similar, in fact it is a different problem.
Safepoint is the JVM mechanism to stop application threads in order to perform an operation that requires stop-the-world pause. Safepoints in HotSpot are collaborative, i.e. application threads periodically check if they need to stop. This check typically happens at method exits and inside loops.
Of course, this check is not free. So, for performance reasons JVM tries to eliminate redundant safepoint polls. One of such optimizations is to remove safepoint polls from the counted loops - the loops of the form
for (int i = 0; i < N; i++)
or equivalent. Here N is a loop invariant of int
type.
Usually these loops are short-running, but in some cases they can take really long, e.g. when N = 2_000_000_000. Safepoint operation requires that all Java threads are stopped (excluding threads runnings native methods). That is, a single long-running counted loop may delay the whole safepoint operation, and all other threads will wait for this one to stop.
That's exactly what happens in JDK-8154302. Note that
int l = 0;
while (true) {
if (++l == 0) ...
}
is just another way to express a counted loop of 232 iterations. When Thread.sleep
returns from a native function and finds that safepoint operation is requested, it stops and waits until the long-running counted loop also completes. That's where the weird delays come from.
There is a task to fix this problem - JDK-8186027. The idea is to split one long loop into two parts:
for (int i = 0; i < N; i += step) {
for (int j = 0; j < step; j++) {
// loop body
}
safepoint_poll();
}
It is not yet implemented, but the fix is targeted to JDK 10. Meanwhile there is a workaround: the JVM flag -XX:+UseCountedLoopSafepoints
will force safepoint checks inside counted loops, too.
I'm pretty sure Thread.sleep bug will be closed as a duplicate of Loop strip mining issue. You may verify that this bug goes away with -XX:+UseCountedLoopSafepoints
option.
Unfortunately, this option does not help with the original problem. I caught the moment when algorithm
in the original question hangs and looked at the code being executed under gdb:
loop_begin:
0x00002aaaabe903d0: mov %ecx,%r11d
0x00002aaaabe903d3: inc %r11d ; i++
0x00002aaaabe903d6: cmp %ebp,%r11d ; if (i >= n)
0x00002aaaabe903d9: jge 0x2aaaabe90413 ; break;
0x00002aaaabe903db: mov %ecx,%r8d
0x00002aaaabe903de: mov %r11d,%ecx
0x00002aaaabe903e1: mov 0x1d0(%r15),%rsi ; rsi = Thread.current();
0x00002aaaabe903e8: mov 0x1d0(%r15),%r10 ; r10 = Thread.current();
0x00002aaaabe903ef: cmp %rsi,%r10 ; if (rsi != r10)
0x00002aaaabe903f2: jne 0x2aaaabe903b9 ; goto slow_path;
0x00002aaaabe903f4: mov 0x128(%r15),%r10 ; r10 = current_os_thread();
0x00002aaaabe903fb: mov 0x14(%r10),%r11d ; isInterrupted = r10.interrupt_flag;
0x00002aaaabe903ff: test %r11d,%r11d ; if (!isInterrupted)
0x00002aaaabe90402: je 0x2aaaabe903d0 ; goto loop_begin
This is how the loop in algorithm
method is compiled. There is no safepoint poll here, even when -XX:+UseCountedLoopSafepoints
is set.
Looks like safepoint checks were incorrectly eliminated because of Thread.isInterrupted
call which was supposed to check the safepoint itself. However, Thread.isInterrupted
is HotSpot intrinsic method. It means there is no real native method call, but JIT replaces a call to Thread.isInterrupted
with a sequence of machine instructions with no safepoint checks inside.
I'll report the bug to Oracle shortly. Meanwhile, the workaround is to change the type of loop counter from int
to long
. If you rewrite the loop as
for (long i=0; i<n; ++i) { ...
there will be no strange delays anymore.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With