Lately I've got very strange thing - one method was extremely slow under profiler without obvious reason for that. It contains few operations with long
, but is invoked rather frequently - its overall usage was around 30-40% of total program time whereas other parts seem much 'heavier'.
I typically run non-memory-hungry programs on x32 JVM, but assuming I've got problem with 64-bit type I tried running the same on x64 JVM - overall performance in 'live scenario' got 2-3 time better. After that I've created JMH benchmarks for operations from particular method and was shocked by the difference on x32 and x64 JVMs - up to 50 times.
I would 'accept' roughly 2 times slower x32 JVM (smaller word size), but I have no clues where 30-50 times may come from. Can you explain that drastic difference?
Replies to comments:
So it seems answer for my question is
Here are the results (Note: ? 10??
are special characters not printed on Windows - it is something below 0.001 s/op written in scientific notation as 10e-??)
x32 1.8.0_152
Benchmark Mode Score Units Score (with 'return')
IntVsLong.cycleInt avgt 0.035 s/op 0.034 (?x slower vs. x64)
IntVsLong.cycleLong avgt 0.106 s/op 0.099 (3x slower vs. x64)
IntVsLong.divDoubleInt avgt 0.462 s/op 0.459
IntVsLong.divDoubleLong avgt 1.658 s/op 1.724 (2x slower vs. x64)
IntVsLong.divInt avgt 0.335 s/op 0.373
IntVsLong.divLong avgt 1.380 s/op 1.399
IntVsLong.l2i avgt 0.101 s/op 0.197 (3x slower vs. x64)
IntVsLong.mulInt avgt 0.067 s/op 0.068
IntVsLong.mulLong avgt 0.278 s/op 0.337 (5x slower vs. x64)
IntVsLong.subInt avgt 0.067 s/op 0.067 (?x slower vs. x64)
IntVsLong.subLong avgt 0.243 s/op 0.300 (4x slower vs. x64)
x64 1.8.0_152
Benchmark Mode Score Units Score (with 'return')
IntVsLong.cycleInt avgt ? 10?? s/op ? 10??
IntVsLong.cycleLong avgt 0.035 s/op 0.034
IntVsLong.divDoubleInt avgt 0.045 s/op 0.788 (was dead)
IntVsLong.divDoubleLong avgt 0.033 s/op 0.787 (was dead)
IntVsLong.divInt avgt ? 10?? s/op 0.302 (was dead)
IntVsLong.divLong avgt 0.046 s/op 1.098 (was dead)
IntVsLong.l2i avgt 0.037 s/op 0.067
IntVsLong.mulInt avgt ? 10?? s/op 0.052 (was dead)
IntVsLong.mulLong avgt 0.040 s/op 0.067
IntVsLong.subInt avgt ? 10?? s/op ? 10??
IntVsLong.subLong avgt 0.075 s/op 0.082
And here is the (fixed) benchmark code
import org.openjdk.jmh.annotations.Benchmark;
public class IntVsLong {
public static int N_REPEAT_I = 100_000_000;
public static long N_REPEAT_L = 100_000_000;
public static int CONST_I = 3;
public static long CONST_L = 3;
public static double CONST_D = 3;
@Benchmark
public void cycleInt() throws InterruptedException {
for( int i = 0; i < N_REPEAT_I; i++ ) {
}
}
@Benchmark
public void cycleLong() throws InterruptedException {
for( long i = 0; i < N_REPEAT_L; i++ ) {
}
}
@Benchmark
public int divInt() throws InterruptedException {
int r = 0;
for( int i = 0; i < N_REPEAT_I; i++ ) {
r += i / CONST_I;
}
return r;
}
@Benchmark
public long divLong() throws InterruptedException {
long r = 0;
for( long i = 0; i < N_REPEAT_L; i++ ) {
r += i / CONST_L;
}
return r;
}
@Benchmark
public double divDoubleInt() throws InterruptedException {
double r = 0;
for( int i = 1; i < N_REPEAT_L; i++ ) {
r += CONST_D / i;
}
return r;
}
@Benchmark
public double divDoubleLong() throws InterruptedException {
double r = 0;
for( long i = 1; i < N_REPEAT_L; i++ ) {
r += CONST_D / i;
}
return r;
}
@Benchmark
public int mulInt() throws InterruptedException {
int r = 0;
for( int i = 0; i < N_REPEAT_I; i++ ) {
r += i * CONST_I;
}
return r;
}
@Benchmark
public long mulLong() throws InterruptedException {
long r = 0;
for( long i = 0; i < N_REPEAT_L; i++ ) {
r += i * CONST_L;
}
return r;
}
@Benchmark
public int subInt() throws InterruptedException {
int r = 0;
for( int i = 0; i < N_REPEAT_I; i++ ) {
r += i - r;
}
return r;
}
@Benchmark
public long subLong() throws InterruptedException {
long r = 0;
for( long i = 0; i < N_REPEAT_L; i++ ) {
r += i - r;
}
return r;
}
@Benchmark
public long l2i() throws InterruptedException {
int r = 0;
for( long i = 0; i < N_REPEAT_L; i++ ) {
r += (int)i;
}
return r;
}
}
There's a lot of variables to inspect.
If we look only to the processor using 64 bit you can address more operations to the CPU registers in the same step as it uses eitht octets instead of four ocets per registry. This increases the performance of operations and the memory allocation. Also some CPU only enable advanced functions only operating in 64 Bit mode
Going upper if your are using the same CPU to perform the tests, you need to take in consideration that to execute the 32 bit instructions the CPU needs to operate in virtual mode or protected mode that runs slowly that a real 32 bit CPU. Also some of the instruction set extensions probably could not be enabled using 32 bit mode like SSE-SIMD or AVX taht could increase some operations speed.
Also going upper if you are using a modern OS like Windows 10 you need to take in consideration that the OS runs 32 bit applications using WOW64 (x86 Emulator)
Helping doc:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With