Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance Explanation: code runs slower after warm up

The code below runs the exact same calculation 3 times (it does not do much: basically adding all numbers from 1 to 100m). The first 2 blocks run approximately 10 times faster than the third one. I have run this test program more than 10 times and the results show very little variance.

If anything, I would expect the third block to run faster (JIT compilation), but the typical output is:

35974537
36368455
296471550

Can somebody explain what is happening? (Just to be clear, I'm not trying to fix anything here, just trying to better understand what is going on)

Note:

  • no GC is run during the program (monitored with -XX:+PrintGC)
  • tested with Oracle JDK versions 1.6.0_30, 1.7.0_02 and 1.7.0_05
  • also tested with the following parameters: -XX:+PrintGC -Xms1000m -Xmx1000m -XX:NewSize=900m => same result
  • it the block is put in a loop instead, all runs are fast
  • if the block is extracted to a method, all runs are fast (whether the method is called 3 times or in a loop makes no difference)
public static void main(String... args) {
    //three identical blocks
    {
        long start = System.nanoTime();
        CountByOne c = new CountByOne();
        int sum = 0;
        for (int i = 0; i < 100000000; i++) {
            sum += c.getNext();
        }
        if (sum != c.getSum()) throw new IllegalStateException(); //use sum
        long end = System.nanoTime();
        System.out.println((end - start));
    }
    {
        long start = System.nanoTime();
        CountByOne c = new CountByOne();
        int sum = 0;
        for (int i = 0; i < 100000000; i++) {
            sum += c.getNext();
        }
        if (sum != c.getSum()) throw new IllegalStateException(); //use sum
        long end = System.nanoTime();
        System.out.println((end - start));
    }
    {
        long start = System.nanoTime();
        CountByOne c = new CountByOne();
        int sum = 0;
        for (int i = 0; i < 100000000; i++) {
            sum += c.getNext();
        }
        if (sum != c.getSum()) throw new IllegalStateException(); //use sum
        long end = System.nanoTime();
        System.out.println((end - start));
    }
}

public static class CountByOne {

    private int i = 0;
    private int sum = 0;

    public int getSum() {
        return sum;
    }

    public int getNext() {
        i += 1;
        sum += i;
        return i;
    }
}
like image 507
assylias Avatar asked Aug 20 '12 10:08

assylias


2 Answers

Short: The Just In Time Compiler is dumb.

First of all you can use the option -XX:+PrintCompilation to see WHEN the JIT is doing something. Then you will see something like this:

$ java -XX:+PrintCompilation weird
    168    1             weird$CountByOne::getNext (28 bytes)
    174    1 %           weird::main @ 18 (220 bytes)
    279    1 %           weird::main @ -2 (220 bytes)   made not entrant
113727636
    280    2 %           weird::main @ 91 (220 bytes)
106265475
427228826

So you see that the method main is compiled sometimes during the first and the second block.

Adding the options -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOption will give you more information about what the JIT is doing. Note, it requires hsdis-amd64.so which seems to be not very well available on common Linux distributions. You might have tom compile it on your own from the OpenJDK.

What you get is a huge chunk of assembler code for getNext and main.

For me, in the first compilation it seems that only the first block in main is actually compiled, you can tell by the line numbers. It contains funny things like this:

  0x00007fa35505fc5b: add    $0x1,%r8           ;*ladd
                                                ; - weird$CountByOne::getNext@6 (line 12)
                                                ; - weird::main@28 (line 31)
  0x00007fa35505fc5f: mov    %r8,0x10(%rbx)     ;*putfield i
                                                ; - weird$CountByOne::getNext@7 (line 12)
                                                ; - weird::main@28 (line 31)
  0x00007fa35505fc63: add    $0x1,%r14          ;*ladd
                                                ; - weird::main@31 (line 31)

(Indeed it is very long, due to unrolling and inlining of the loop)

Appearently during the recompile of main, the second AND third block is compiled. The second block there looks very similar to the first version. (Again just an excerpt)

 0x00007fa35505f05d: add    $0x1,%r8           ;*ladd
                                                ; - weird$CountByOne::getNext@6 (line 12)
                                                ; - weird::main@101 (line 42)
  0x00007fa35505f061: mov    %r8,0x10(%rbx)     ;*putfield i
                                                ; - weird$CountByOne::getNext@7 (line 12)
                                                ; - weird::main@101 (line 42)
  0x00007fa35505f065: add    $0x1,%r13          ;*ladd

HOWEVER the third block is compiled differently. Without inlining and unrolling

This time the entire loop looks like this:

  0x00007fa35505f20c: xor    %r10d,%r10d
  0x00007fa35505f20f: xor    %r8d,%r8d          ;*lload
                                                ; - weird::main@171 (line 53)
  0x00007fa35505f212: mov    %r8d,0x10(%rsp)
  0x00007fa35505f217: mov    %r10,0x8(%rsp)
  0x00007fa35505f21c: mov    %rbp,%rsi
  0x00007fa35505f21f: callq  0x00007fa355037c60  ; OopMap{rbp=Oop off=580}
                                                ;*invokevirtual getNext
                                                ; - weird::main@174 (line 53)
                                                ;   {optimized virtual_call}
  0x00007fa35505f224: mov    0x8(%rsp),%r10
  0x00007fa35505f229: add    %rax,%r10          ;*ladd
                                                ; - weird::main@177 (line 53)
  0x00007fa35505f22c: mov    0x10(%rsp),%r8d
  0x00007fa35505f231: inc    %r8d               ;*iinc
                                                ; - weird::main@180 (line 52)
  0x00007fa35505f234: cmp    $0x5f5e100,%r8d
  0x00007fa35505f23b: jl     0x00007fa35505f212  ;*if_icmpge
                                                ; - weird::main@168 (line 52)

My guess is that the JIT identified that this part of the code is not used a lot, since it was using profiling information from the second block execution, and therefore did not optimize it heavily. Also the JIT appears to be lazy in a sense not to recompile one method after all relevant parts have been compiled. Remember the first compilation result did not contain source code for the second/third block AT all, so the JIT had to recompile that.

like image 163
Zulan Avatar answered Nov 15 '22 20:11

Zulan


You need to place each loop in a different method. The reason you need to do this is that the JIT collections statistics on how the code was run and this is used to optimise the code. However, a method is optimised after it is called 10000 time or a loop is run 10000 times.

In your case, the first loop trigger the whole method to be optimised, even though the later loops have not been run so no statistics have been collected. Place each loop in its own method and this won't happen.

like image 41
Peter Lawrey Avatar answered Nov 15 '22 19:11

Peter Lawrey