Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java Loops Optimization

Give the following (straight-forward) code:

public class pr1 {

    public static void f1(){
        long sx = 0, s;
        s = System.currentTimeMillis();
        for(long i = 0; i < Integer.MAX_VALUE; ++i){
            sx += i;
        }
        System.out.println("f1(): " + (System.currentTimeMillis() - s));
    }

    public static void f2(){
        long sx = 0, s, i;
        s = System.currentTimeMillis();
        i = Integer.MAX_VALUE;
        while(i-->0){
            sx+=i;
        }
        sx += Integer.MAX_VALUE;
        System.out.println("f2(): " + (System.currentTimeMillis() - s));
    }

    public static void f3(){
        long sx = 0, s, i;
        s = System.currentTimeMillis();
        i = Integer.MAX_VALUE;
        while(--i>0){
            sx+=i;
        }
        sx += Integer.MAX_VALUE;
        System.out.println("f3(): " + (System.currentTimeMillis() - s));
    }

    public static void f4(){
        long sx = 0, s, i;
        s = System.currentTimeMillis();
        i = Integer.MAX_VALUE;
        do{
            sx+=i;
        }while(--i>0);
        System.out.println("f4(): " + (System.currentTimeMillis() - s));
    }

    public static void main(String args[]){
        f1();
        f2();
        f3();
        f4();
    }
}

And the actual results after running the code:

f1(): 5828
f2(): 8125
f3(): 3406
f4(): 3781

Can you please explain me the big time differences ? Theoretically the loops are achieving the same functionality, but in practice it seems there is a relevant time difference for each of the four versions.

After repetitive executions the results are very much the same.

LATER EDIT As another test I've rewritten the main method:

public static void main(String args[]){
    for(int i = 0; i < 4; ++i){
        f1(); f2(); f3(); f4();
    }
}

And the new results are:

f1(): 5906
f2(): 8266
f3(): 3406
f4(): 3844
f1(): 5843
f2(): 8125
f3(): 3438
f4(): 3859
f1(): 5891
f2(): 8156
f3(): 3406
f4(): 3813
f1(): 5859
f2(): 8172
f3(): 3438
f4(): 3828

And for 10 repititions:

f1(): 5844
f2(): 8156
f3(): 3453
f4(): 3813
f1(): 5844
f2(): 8218
f3(): 3485
f4(): 3937
f1(): 5985
f2(): 8156
f3(): 3422
f4(): 3781
f1(): 5828
f2(): 8234
f3(): 3469
f4(): 3828
f1(): 5844
f2(): 8328
f3(): 3422
f4(): 3859
f1(): 5844
f2(): 8188
f3(): 3406
f4(): 3797
f1(): 5906
f2(): 8219
f3(): 3422
f4(): 3797
f1(): 5843
f2(): 8203
f3(): 3454
f4(): 3906
f1(): 5844
f2(): 8140
f3(): 3469
f4(): 3812
f1(): 5860
f2(): 8109
f3(): 3422
f4(): 3813

After removing the calculus between the loops, the results are still a little different:

public class pr2 {

    public static void f1(){
        long sx = 0, s;
        s = System.currentTimeMillis();
        for(long i = 0; i < Integer.MAX_VALUE; ++i);
        System.out.println("f1(): " + (System.currentTimeMillis() - s));
    }

    public static void f2(){
        long sx = 0, s, i;
        s = System.currentTimeMillis();
        i = Integer.MAX_VALUE;
        while(i-->0);
        System.out.println("f2(): " + (System.currentTimeMillis() - s));
    }

    public static void f3(){
        long sx = 0, s, i;
        s = System.currentTimeMillis();
        i = Integer.MAX_VALUE;
        while(--i>0);
        System.out.println("f3(): " + (System.currentTimeMillis() - s));
    }

    public static void f4(){
        long sx = 0, s, i;
        s = System.currentTimeMillis();
        i = Integer.MAX_VALUE;
        do{
        }while(--i>0);
        System.out.println("f4(): " + (System.currentTimeMillis() - s));
    }

    public static void main(String args[]){
        for(int i = 0; i < 2; ++i){
            f1(); f2(); f3(); f4();
        }
    }
}

But the time difference still exists:

f1(): 3219
f2(): 4859
f3(): 2610
f4(): 3031
f1(): 3219
f2(): 4812
f3(): 2610
f4(): 3062

JVM:

java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
Java HotSpot(TM) Client VM (build 16.3-b01, mixed mode, sharing)

LATER EDIT: For the first version, I've used -O parameter for javac. The new results are:

f1(): 3219
f2(): 4859
f3(): 2610
f4(): 3031

LATER EDIT

Ok, I've tried the same code at home, using a Linux machine with:

java version "1.6.0_18"
OpenJDK Runtime Environment (IcedTea6 1.8) (6b18-1.8-0ubuntu1)
OpenJDK Server VM (build 14.0-b16, mixed mode)

And the results were "normal". No problems now:

f1(): 7495
f2(): 7418
f3(): 7457
f4(): 7384
like image 758
Andrei Ciobanu Avatar asked Jul 20 '10 13:07

Andrei Ciobanu


4 Answers

You're actually benchmarking the JVM, not the code.

See also:

  • What is microbenchmarking?
  • I write a simple loop to time a simple operation and it's slow. What am I doing wrong?
  • How to write a good microbenchmark?

Update: OK, that was a bit curtly answer. The loop using the postfix operator (i--) seems to be slower than loops using the prefix operator (--i). That can be true since the value is changed during evaluation of the expression, but the compiler needs to hold a copy of the original value to use in the expression. Using the prefix operator avoids the need to hold a copy because just the changed value will be used in the expression.

See also:

  • Prefer prefix operators over postfix

After all, this microoptimization would save you one or two seconds on 231 executions. Are you in real also executing it that much often? I'd prefer readability above premature optimization.

like image 200
BalusC Avatar answered Oct 18 '22 19:10

BalusC


When I run this code on my JVM (Java HotSpot(TM) 64-Bit Server VM (build 16.0-b13, mixed mode)), all four functions give similar results:

f1(): 3234
f2(): 3132
f3(): 3114
f4(): 3089

I would guess that your JVM isn't making the same optimizations somewhere.

You can examine the bytecode generated for the different functions using javap: javap -l -c pr1. When I do this, I get the following for f2():

public static void f2();
  Code:
   0:   lconst_0
   1:   lstore_0
   2:   invokestatic    #2; //Method java/lang/System.currentTimeMillis:()J
   5:   lstore_2
   6:   ldc2_w  #3; //long 2147483647l
   9:   lstore  4
   11:  lload   4
   13:  dup2
   14:  lconst_1
   15:  lsub
   16:  lstore  4
   18:  lconst_0
   19:  lcmp
   20:  ifle    31
   23:  lload_0
   24:  lload   4
   26:  ladd
   27:  lstore_0
   28:  goto    11
   31:  lload_0
   32:  ldc2_w  #3; //long 2147483647l
   35:  ladd
   36:  lstore_0
   37:  getstatic       #5; //Field java/lang/System.out:Ljava/io/PrintStream;
   40:  new     #6; //class java/lang/StringBuilder
   43:  dup
   44:  invokespecial   #7; //Method java/lang/StringBuilder."<init>":()V
   47:  ldc     #13; //String f2():
   49:  invokevirtual   #9; //Method java/lang/StringBuilder.append:(Ljava/lang/String;)Ljava/lang/StringBuilder;
   52:  invokestatic    #2; //Method java/lang/System.currentTimeMillis:()J
   55:  lload_2
   56:  lsub
   57:  invokevirtual   #10; //Method java/lang/StringBuilder.append:(J)Ljava/lang/StringBuilder;
   60:  invokevirtual   #11; //Method java/lang/StringBuilder.toString:()Ljava/lang/String;
   63:  invokevirtual   #12; //Method java/io/PrintStream.println:(Ljava/lang/String;)V
   66:  return

One possible reason that f2() is slower may be that the compiler/JVM aren't being smart about the while(i-->0) post-decrement operator. Basically, you need the value of i both before and after the increment, so this operation involves more work if naively implemented.

like image 37
Avi Avatar answered Oct 18 '22 20:10

Avi


After multiple runs the Hotspot compiler is likely optimizing each method. This takes time and has variations. But since the loops all work the about same way it seems reasonable that the times would eventually become similar.

like image 1
seand Avatar answered Oct 18 '22 19:10

seand


I suspect that this is something to do with doing 64 bit arithmetic on a machine with a 32 bit ALU. I suspect that certain combinations of test before / after increment / decrement take longer at the native instruction level due to subtle pipelining effects. The fact that someone reported that the numbers were flat on a 64 bit machine supports this theory. The way to confirm this would be to get a dump of the native code generated by the JIT compiler, get hold of the documentation for your particular CPU and figure out where the clock cycles are going.

But to be honest, I don't know if this is worthwhile. We've clear evidence that your micro-benchmark numbers are CPU dependent, and the "work" done is clearly non-representative. (Why would you use a long loop counter on a 32 bit machine?)

And I'm also a bit surprised that the JIT compiler didn't figure out that the loops can (in each case) be completely optimized away.

like image 1
Stephen C Avatar answered Oct 18 '22 21:10

Stephen C