Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java for-loop optimization

I made some runtime tests with java for loops and recognized a strange behaviour. For my code I need wrapper objects for primitive types like int, double and so on, to simulate io and output parameters, but thats not the point. Just watch my code. How can objects with field access be faster then primitive types?

for loop with prtimitive type:

public static void main(String[] args) {
    double max = 1000;
    for (int j = 1; j < 8; j++) {
        double i;
        max = max * 10;
        long start = System.nanoTime();
        for (i = 0; i < max; i++) {
        }
        long end = System.nanoTime();
        long microseconds = (end - start) / 1000;
        System.out.println("MicroTime primitive(max: ="+max + "): " + microseconds);
    }
}

Result:

MicroTime primitive(max: =10000.0): 110
MicroTime primitive(max: =100000.0): 1081
MicroTime primitive(max: =1000000.0): 2450
MicroTime primitive(max: =1.0E7): 28248
MicroTime primitive(max: =1.0E8): 276205
MicroTime primitive(max: =1.0E9): 2729824
MicroTime primitive(max: =1.0E10): 27547009

for loop with simple type (wrapper object):

public static void main(String[] args) {
    HDouble max = new HDouble();
    max.value = 1000;
    for (int j = 1; j < 8; j++) {
        HDouble i = new HDouble();
        max.value = max.value*10;
        long start = System.nanoTime();
        for (i.value = 0; i.value <max.value; i.value++) {
        }
        long end = System.nanoTime();
        long microseconds = (end - start) / 1000;
        System.out.println("MicroTime wrapper(max: ="+max.value + "): " + microseconds);
    }
}

Result:

MicroTime wrapper(max: =10000.0): 157
MicroTime wrapper(max: =100000.0): 1561
MicroTime wrapper(max: =1000000.0): 3174
MicroTime wrapper(max: =1.0E7): 15630
MicroTime wrapper(max: =1.0E8): 155471
MicroTime wrapper(max: =1.0E9): 1520967
MicroTime wrapper(max: =1.0E10): 15373311

The more iterations, the faster is the second code. But why? I know that the java-compiler and jvm are optimizing my code, but I never thought that primitive types can be slower, than objects with field access.
Does anyone have a plausible explanation for it?

Edited: HDouble class:

public class HDouble {
    public double value;

    public HDouble() {
    }

    public HDouble(double value) {
        this.value = value;
    }

    @Override
    public String toString() {
        return String.valueOf(value);
    }
}

I also tested my loops with code in it. For example I calculate the sum -> same behaviour (the difference is not that big, but I thought the primitive algorithm have to be much faster?). First I thought, that the calculation takes that long, that the field access nearly no difference.

Wrapper for-loop:

for (i.value = 0; i.value <max.value; i.value++) {
    sum.value = sum.value + i.value;
}

Result:

MicroTime wrapper(max: =10000.0): 243
MicroTime wrapper(max: =100000.0): 2805
MicroTime wrapper(max: =1000000.0): 3409
MicroTime wrapper(max: =1.0E7): 28104
MicroTime wrapper(max: =1.0E8): 278432
MicroTime wrapper(max: =1.0E9): 2678322
MicroTime wrapper(max: =1.0E10): 26665540

Primitive for-loop:

for (i = 0; i < max; i++) {
    sum = sum + i;
}

Result:

MicroTime primitive(max: =10000.0): 149
MicroTime primitive(max: =100000.0): 1996
MicroTime primitive(max: =1000000.0): 2289
MicroTime primitive(max: =1.0E7): 27085
MicroTime primitive(max: =1.0E8): 279939
MicroTime primitive(max: =1.0E9): 2759133
MicroTime primitive(max: =1.0E10): 27369724

like image 373
user3490546 Avatar asked Dec 18 '22 22:12

user3490546


1 Answers

It's so easy to get fooled by hand-made microbenchmarks - you never know what they actually measure. That's why there are special tools like JMH. But let's analyze what happens to the primitive hand-made benchmark:

static class HDouble {
    double value;
}

public static void main(String[] args) {
    primitive();
    wrapper();
}

public static void primitive() {
    long start = System.nanoTime();
    for (double d = 0; d < 1000000000; d++) {
    }
    long end = System.nanoTime();
    System.out.printf("Primitive: %.3f s\n", (end - start) / 1e9);
}

public static void wrapper() {
    HDouble d = new HDouble();
    long start = System.nanoTime();
    for (d.value = 0; d.value < 1000000000; d.value++) {
    }
    long end = System.nanoTime();
    System.out.printf("Wrapper:   %.3f s\n", (end - start) / 1e9);
}

The results are somewhat similar to yours:

Primitive: 3.618 s
Wrapper:   1.380 s

Now repeat the test several times:

public static void main(String[] args) {
    for (int i = 0; i < 5; i++) {
        primitive();
        wrapper();
    }
}

It gets more interesting:

Primitive: 3.661 s
Wrapper:   1.382 s
Primitive: 3.461 s
Wrapper:   1.380 s
Primitive: 1.376 s <-- starting from 3rd iteration
Wrapper:   1.381 s <-- the timings become equal
Primitive: 1.371 s
Wrapper:   1.372 s
Primitive: 1.379 s
Wrapper:   1.378 s

Looks like both methods got finally optimized. Run it once again, now with logging JIT compiler activity: -XX:-TieredCompilation -XX:CompileOnly=Test -XX:+PrintCompilation

    136    1 %           Test::primitive @ 6 (53 bytes)
   3725    1 %           Test::primitive @ -2 (53 bytes)   made not entrant
Primitive: 3.589 s
   3748    2 %           Test::wrapper @ 17 (73 bytes)
   5122    2 %           Test::wrapper @ -2 (73 bytes)   made not entrant
Wrapper:   1.374 s
   5122    3             Test::primitive (53 bytes)
   5124    4 %           Test::primitive @ 6 (53 bytes)
Primitive: 3.421 s
   8544    5             Test::wrapper (73 bytes)
   8547    6 %           Test::wrapper @ 17 (73 bytes)
Wrapper:   1.378 s
Primitive: 1.372 s
Wrapper:   1.375 s
Primitive: 1.378 s
Wrapper:   1.373 s
Primitive: 1.375 s
Wrapper:   1.378 s

Note % sign in the compilation log on the first iteration. It means that the methods were compiled in OSR (on-stack replacement) mode. During the second iteration the methods were recompiled in normal mode. Since then, starting from the third iteration, there was no difference between primitive and wrapper in execution speed.

What you've actually measured is the performance of OSR stub. It is usually not related to the real performance of an application and you shouldn't care much about it.

But the question still remains, why OSR stub for a wrapper is compiled better than for a primitive variable? To find this out we need to get down to generated assembly code:
-XX:CompileOnly=Test -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly

I'll omit all unrelevant code leaving only the compiled loop.

Primitive:

0x00000000023e90d0: vmovsd 0x28(%rsp),%xmm1      <-- load double from the stack
0x00000000023e90d6: vaddsd -0x7e(%rip),%xmm1,%xmm1
0x00000000023e90de: test   %eax,-0x21f90e4(%rip)
0x00000000023e90e4: vmovsd %xmm1,0x28(%rsp)      <-- store to the stack
0x00000000023e90ea: vucomisd 0x28(%rsp),%xmm0    <-- compare with the stack value
0x00000000023e90f0: ja     0x00000000023e90d0

Wrapper:

0x00000000023ebe90: vaddsd -0x78(%rip),%xmm0,%xmm0
0x00000000023ebe98: vmovsd %xmm0,0x10(%rbx)      <-- store to the object field
0x00000000023ebe9d: test   %eax,-0x21fbea3(%rip)
0x00000000023ebea3: vucomisd %xmm0,%xmm1         <-- compare registers
0x00000000023ebea7: ja     0x00000000023ebe90

As you can see, the 'primitive' case makes a number of loads and stores to a stack location while 'wrapper' does mostly in-register operations. It is quite understandable why OSR stub refers to stack: in the interpreted mode local variables are stored on the stack, and OSR stub is made compatible with this interpreted frame. In a 'wrapper' case the value is stored on the heap, and the reference to the object is already cached in a register.

like image 108
apangin Avatar answered Jan 09 '23 19:01

apangin