Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Integer performance - 30-50 times difference of x32 vs. x64 jvm?

Lately I've got very strange thing - one method was extremely slow under profiler without obvious reason for that. It contains few operations with long, but is invoked rather frequently - its overall usage was around 30-40% of total program time whereas other parts seem much 'heavier'.

I typically run non-memory-hungry programs on x32 JVM, but assuming I've got problem with 64-bit type I tried running the same on x64 JVM - overall performance in 'live scenario' got 2-3 time better. After that I've created JMH benchmarks for operations from particular method and was shocked by the difference on x32 and x64 JVMs - up to 50 times.

I would 'accept' roughly 2 times slower x32 JVM (smaller word size), but I have no clues where 30-50 times may come from. Can you explain that drastic difference?


Replies to comments:

  • I rewrited test code to 'return something' and avoid 'dead code elimination' - it appears it did not change anything for 'x32', but some methods on 'x64' got significantly slower.
  • Both tests were run under 'client'. Running under '-server' did not have noticeable effect.

So it seems answer for my question is

  • 'test code' was wrong: because of 'no return value' it allowed JVM to do 'dead code elimination' or whatever other optimization and it appears that 'x32 JVM' does less of such optimizations than 'x64 JVM' - that caused such significant 'false' difference between x32 and x64
  • perf difference on 'correct test code' is up to 2x-5x times - this seems reasonable

Here are the results (Note: ? 10?? are special characters not printed on Windows - it is something below 0.001 s/op written in scientific notation as 10e-??)

x32 1.8.0_152

Benchmark                Mode  Score Units    Score (with 'return')
IntVsLong.cycleInt       avgt  0.035  s/op    0.034   (?x slower vs. x64)
IntVsLong.cycleLong      avgt  0.106  s/op    0.099   (3x slower vs. x64) 
IntVsLong.divDoubleInt   avgt  0.462  s/op    0.459
IntVsLong.divDoubleLong  avgt  1.658  s/op    1.724   (2x slower vs. x64)
IntVsLong.divInt         avgt  0.335  s/op    0.373
IntVsLong.divLong        avgt  1.380  s/op    1.399
IntVsLong.l2i            avgt  0.101  s/op    0.197   (3x slower vs. x64)  
IntVsLong.mulInt         avgt  0.067  s/op    0.068
IntVsLong.mulLong        avgt  0.278  s/op    0.337   (5x slower vs. x64)
IntVsLong.subInt         avgt  0.067  s/op    0.067   (?x slower vs. x64)
IntVsLong.subLong        avgt  0.243  s/op    0.300   (4x slower vs. x64)

x64 1.8.0_152

Benchmark                Mode  Score Units    Score (with 'return')
IntVsLong.cycleInt       avgt ? 10??  s/op   ? 10??
IntVsLong.cycleLong      avgt  0.035  s/op    0.034
IntVsLong.divDoubleInt   avgt  0.045  s/op    0.788 (was dead)
IntVsLong.divDoubleLong  avgt  0.033  s/op    0.787 (was dead)
IntVsLong.divInt         avgt ? 10??  s/op    0.302 (was dead)
IntVsLong.divLong        avgt  0.046  s/op    1.098 (was dead)
IntVsLong.l2i            avgt  0.037  s/op    0.067
IntVsLong.mulInt         avgt ? 10??  s/op    0.052 (was dead)
IntVsLong.mulLong        avgt  0.040  s/op    0.067
IntVsLong.subInt         avgt ? 10??  s/op   ? 10??
IntVsLong.subLong        avgt  0.075  s/op    0.082

And here is the (fixed) benchmark code

import org.openjdk.jmh.annotations.Benchmark;

public class IntVsLong {

    public static int N_REPEAT_I  = 100_000_000;
    public static long N_REPEAT_L = 100_000_000;

    public static int CONST_I = 3;
    public static long CONST_L = 3;
    public static double CONST_D = 3;

    @Benchmark
    public void cycleInt() throws InterruptedException {
        for( int i = 0; i < N_REPEAT_I; i++ ) {
        }
    }

    @Benchmark
    public void cycleLong() throws InterruptedException {
        for( long i = 0; i < N_REPEAT_L; i++ ) {
        }
    }

    @Benchmark
    public int divInt() throws InterruptedException {
        int r = 0;
        for( int i = 0; i < N_REPEAT_I; i++ ) {
            r += i / CONST_I;
        }
        return r;
    }

    @Benchmark
    public long divLong() throws InterruptedException {
        long r = 0;
        for( long i = 0; i < N_REPEAT_L; i++ ) {
            r += i / CONST_L;
        }
        return r;
    }

    @Benchmark
    public double divDoubleInt() throws InterruptedException {
        double r = 0;
        for( int i = 1; i < N_REPEAT_L; i++ ) {
            r += CONST_D / i;
        }
        return r;
    }

    @Benchmark
    public double divDoubleLong() throws InterruptedException {
        double r = 0;
        for( long i = 1; i < N_REPEAT_L; i++ ) {
            r += CONST_D / i;
        }
        return r;
    }

    @Benchmark
    public int mulInt() throws InterruptedException {
        int r = 0;
        for( int i = 0; i < N_REPEAT_I; i++ ) {
            r += i * CONST_I;
        }
        return r;
    }

    @Benchmark
    public long mulLong() throws InterruptedException {
        long r = 0;
        for( long i = 0; i < N_REPEAT_L; i++ ) {
            r += i * CONST_L;
        }
        return r;
    }

    @Benchmark
    public int subInt() throws InterruptedException {
        int r = 0;
        for( int i = 0; i < N_REPEAT_I; i++ ) {
            r += i - r;
        }
        return r;
    }

    @Benchmark
    public long subLong() throws InterruptedException {
        long r = 0;
        for( long i = 0; i < N_REPEAT_L; i++ ) {
            r += i - r;
        }
        return r;
    }

    @Benchmark
    public long l2i() throws InterruptedException {
        int r = 0;
        for( long i = 0; i < N_REPEAT_L; i++ ) {
            r += (int)i;
        }
        return r;
    }

}
like image 947
Xtra Coder Avatar asked Jul 02 '18 10:07

Xtra Coder


1 Answers

There's a lot of variables to inspect.

If we look only to the processor using 64 bit you can address more operations to the CPU registers in the same step as it uses eitht octets instead of four ocets per registry. This increases the performance of operations and the memory allocation. Also some CPU only enable advanced functions only operating in 64 Bit mode

Going upper if your are using the same CPU to perform the tests, you need to take in consideration that to execute the 32 bit instructions the CPU needs to operate in virtual mode or protected mode that runs slowly that a real 32 bit CPU. Also some of the instruction set extensions probably could not be enabled using 32 bit mode like SSE-SIMD or AVX taht could increase some operations speed.

Also going upper if you are using a modern OS like Windows 10 you need to take in consideration that the OS runs 32 bit applications using WOW64 (x86 Emulator)

Helping doc:

  • Running 32 bit Applications on Windows 64 Bit
  • Wikipedia X86-64 (See about operation modes)
like image 77
Dubas Avatar answered Sep 27 '22 20:09

Dubas