Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why isn't the last thread interrupted?

I am trying to demonstrate an "anytime algorithm" - an algorithm that can be stopped at any time and returns its current result. The demo algorithm just returns some mathematical function of i, where i is increasing. It chcecks whether it is interrupted, and if so, returns the current value:

    static int algorithm(int n) {
        int bestSoFar = 0;
        for (int i=0; i<n; ++i) {
            if (Thread.interrupted())
                break;
            bestSoFar = (int)Math.pow(i, 0.3);
        }
        return bestSoFar;
    }

In the main program, I use it like this:

        Runnable task = () -> {
            Instant start = Instant.now();
            int bestSoFar = algorithm(1000000000);
            double durationInMillis = Duration.between(start, Instant.now()).toMillis();
            System.out.println("after "+durationInMillis+" ms, the result is "+bestSoFar);
        };

        Thread t = new Thread(task);
        t.start();
        Thread.sleep(1);
        t.interrupt();

        t = new Thread(task);
        t.start();
        Thread.sleep(10);
        t.interrupt();

        t = new Thread(task);
        t.start();
        Thread.sleep(100);
        t.interrupt();

        t = new Thread(task);
        t.start();
        Thread.sleep(1000);
        t.interrupt();
    }
}

When I run this program, I get the following input:

after 0.0 ms, the result is 7
after 10.0 ms, the result is 36
after 100.0 ms, the result is 85
after 21952.0 ms, the result is 501

I.e, the first three threads are really interrupted when I tell them to, but the last thread is not interrupted after 1 second - it continues working for almost 22 seconds. Why does this happen?

EDIT: I get similar results using Future.get with a timeout. In this code:

    Instant start = Instant.now();
    ExecutorService executor = Executors.newCachedThreadPool();
    Future<?> future = executor.submit(task);
    try {
        future.get(800, TimeUnit.MILLISECONDS);
    } catch (TimeoutException e) {
        future.cancel(true);
        double durationInMillis = Duration.between(start, Instant.now()).toMillis();
        System.out.println("Timeout after "+durationInMillis+" [ms]");
    }

if the timeout is at most 800 then everything works find and it prints something like "Timeout after 806.0 [ms]". But if the timeout is 900 it prints "Timeout after 5084.0 [ms]".

EDIT 2: my computer has 4 cores. The program rusn on Open JDK 8.

like image 234
Erel Segal-Halevi Avatar asked Nov 26 '17 20:11

Erel Segal-Halevi


Video Answer


1 Answers

I can confirm this is a HotSpot JVM bug. Here is my initial analysis of the problem.

@AdamSkywalker was absolutely right in the assumption that the issue is related to safepoint elimination optimization in HotSpot HIT compiler. Though the bug JDK-8154302 looks similar, in fact it is a different problem.

What is the safepoint problem

Safepoint is the JVM mechanism to stop application threads in order to perform an operation that requires stop-the-world pause. Safepoints in HotSpot are collaborative, i.e. application threads periodically check if they need to stop. This check typically happens at method exits and inside loops.

Of course, this check is not free. So, for performance reasons JVM tries to eliminate redundant safepoint polls. One of such optimizations is to remove safepoint polls from the counted loops - the loops of the form

    for (int i = 0; i < N; i++)

or equivalent. Here N is a loop invariant of int type.

Usually these loops are short-running, but in some cases they can take really long, e.g. when N = 2_000_000_000. Safepoint operation requires that all Java threads are stopped (excluding threads runnings native methods). That is, a single long-running counted loop may delay the whole safepoint operation, and all other threads will wait for this one to stop.

That's exactly what happens in JDK-8154302. Note that

    int l = 0;
    while (true) {
        if (++l == 0) ...
    }

is just another way to express a counted loop of 232 iterations. When Thread.sleep returns from a native function and finds that safepoint operation is requested, it stops and waits until the long-running counted loop also completes. That's where the weird delays come from.

There is a task to fix this problem - JDK-8186027. The idea is to split one long loop into two parts:

    for (int i = 0; i < N; i += step) {
        for (int j = 0; j < step; j++) {
            // loop body
        }
        safepoint_poll();
    }

It is not yet implemented, but the fix is targeted to JDK 10. Meanwhile there is a workaround: the JVM flag -XX:+UseCountedLoopSafepoints will force safepoint checks inside counted loops, too.

What is wrong with Thread.interrupted()

I'm pretty sure Thread.sleep bug will be closed as a duplicate of Loop strip mining issue. You may verify that this bug goes away with -XX:+UseCountedLoopSafepoints option.

Unfortunately, this option does not help with the original problem. I caught the moment when algorithm in the original question hangs and looked at the code being executed under gdb:

loop_begin:
  0x00002aaaabe903d0:  mov    %ecx,%r11d
  0x00002aaaabe903d3:  inc    %r11d             ; i++
  0x00002aaaabe903d6:  cmp    %ebp,%r11d        ; if (i >= n)
  0x00002aaaabe903d9:  jge    0x2aaaabe90413    ;     break;
  0x00002aaaabe903db:  mov    %ecx,%r8d
  0x00002aaaabe903de:  mov    %r11d,%ecx
  0x00002aaaabe903e1:  mov    0x1d0(%r15),%rsi  ; rsi = Thread.current();
  0x00002aaaabe903e8:  mov    0x1d0(%r15),%r10  ; r10 = Thread.current();
  0x00002aaaabe903ef:  cmp    %rsi,%r10         ; if (rsi != r10)
  0x00002aaaabe903f2:  jne    0x2aaaabe903b9    ;     goto slow_path;
  0x00002aaaabe903f4:  mov    0x128(%r15),%r10  ; r10 = current_os_thread();
  0x00002aaaabe903fb:  mov    0x14(%r10),%r11d  ; isInterrupted = r10.interrupt_flag;
  0x00002aaaabe903ff:  test   %r11d,%r11d       ; if (!isInterrupted)
  0x00002aaaabe90402:  je     0x2aaaabe903d0    ;     goto loop_begin

This is how the loop in algorithm method is compiled. There is no safepoint poll here, even when -XX:+UseCountedLoopSafepoints is set.

Looks like safepoint checks were incorrectly eliminated because of Thread.isInterrupted call which was supposed to check the safepoint itself. However, Thread.isInterrupted is HotSpot intrinsic method. It means there is no real native method call, but JIT replaces a call to Thread.isInterrupted with a sequence of machine instructions with no safepoint checks inside.

I'll report the bug to Oracle shortly. Meanwhile, the workaround is to change the type of loop counter from int to long. If you rewrite the loop as

    for (long i=0; i<n; ++i) { ...

there will be no strange delays anymore.

like image 110
apangin Avatar answered Oct 17 '22 00:10

apangin