The question had been asked by another SO member, but was disappointingly deleted. The comments were saying the measurements are flawed and does not make sense.
However I was able to reproduce the original problem with a small benchmark under JMH:
package bench;
import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.runner.*;
import org.openjdk.jmh.runner.options.*;
import java.util.concurrent.*;
@State(Scope.Benchmark)
public class LoopInc {
private int getValue() {
return ThreadLocalRandom.current().nextInt(2);
}
@Benchmark
public int directInc() {
int result = 0;
for (int i = 0; i < 1000; i++) {
switch (getValue()) {
case 0:
break;
case 1:
result++;
break;
}
}
return result;
}
@Benchmark
public int indirectInc() {
int result = 0;
for (int i = 0; i < 1000; i++) {
boolean incr = false;
switch (getValue()) {
case 0:
break;
case 1:
incr = true;
break;
}
if (incr) {
result++;
}
}
return result;
}
public static void main(String[] args) throws RunnerException {
Options options = new OptionsBuilder()
.include("bench.LoopInc.*")
.warmupIterations(5)
.measurementIterations(10)
.forks(3)
.timeUnit(TimeUnit.MILLISECONDS)
.build();
new Runner(options).run();
}
}
The benchmarks shows indirectInc
works 3 times faster, though the "optimization" is not obvious at all. One would assume indirectInc
should work a bit slower because it involves an extra intermediate operation.
Benchmark Mode Cnt Score Error Units
LoopInc.directInc thrpt 30 127,301 ± 0,202 ops/ms
LoopInc.indirectInc thrpt 30 378,147 ± 1,144 ops/ms
java version "1.8.0_51"
Java(TM) SE Runtime Environment (build 1.8.0_51-b16)
Java HotSpot(TM) 64-Bit Server VM (build 25.51-b03, mixed mode)
What causes JIT to compile indirectInc
better than similar directInc
?
Ok, this is how you approach these things.
Try to reproduce it. Okay, it reproduces:
Benchmark Mode Cnt Score Error Units
LoopInc.directInc thrpt 15 175.678 ± 1.118 ops/ms
LoopInc.indirectInc thrpt 15 641.413 ± 9.722 ops/ms
Try to see the generated assembly with -prof perfasm
. Long story short, it produces a lot of generated code, and so we probably want to limit the loop unrolling. But, it can affect the performance, and can pretty much be the cause. So, let's re-run with -XX:LoopUnrollLimit=1
. Okay, the score is lower, but the difference is still there, excellent:
Benchmark Mode Cnt Score Error Units
LoopInc.directInc thrpt 15 161.147 ± 6.101 ops/ms
LoopInc.indirectInc thrpt 15 489.430 ± 1.698 ops/ms
Take another look at the generated code, still nothing that pops out to our eye. Well, this seems interesting. Let's get on this properly. Can we characterize the workload? Of course we can, with the help of -prof perfnorm
, which normalizes the hardware counters per benchmark op. Let's see:
Benchmark Mode Cnt Score Error Units
LoopInc.directInc thrpt 15 161.875 ± 3.038 ops/ms
LoopInc.directInc:·CPI thrpt 3 0.967 ± 0.196 #/op
LoopInc.directInc:·L1-dcache-load-misses thrpt 3 0.394 ± 3.663 #/op
LoopInc.directInc:·L1-dcache-loads thrpt 3 2149.594 ± 228.166 #/op
LoopInc.directInc:·L1-dcache-store-misses thrpt 3 0.114 ± 1.001 #/op
LoopInc.directInc:·L1-dcache-stores thrpt 3 1073.666 ± 96.066 #/op
LoopInc.directInc:·L1-icache-load-misses thrpt 3 0.965 ± 22.984 #/op
LoopInc.directInc:·LLC-loads thrpt 3 0.204 ± 2.763 #/op
LoopInc.directInc:·LLC-stores thrpt 3 0.060 ± 0.633 #/op
LoopInc.directInc:·branch-misses thrpt 3 536.068 ± 43.293 #/op
LoopInc.directInc:·branches thrpt 3 3728.890 ± 220.539 #/op
LoopInc.directInc:·cycles thrpt 3 26219.146 ± 6287.590 #/op
LoopInc.directInc:·dTLB-load-misses thrpt 3 0.063 ± 0.124 #/op
LoopInc.directInc:·dTLB-loads thrpt 3 2136.942 ± 165.990 #/op
LoopInc.directInc:·dTLB-store-misses thrpt 3 0.022 ± 0.029 #/op
LoopInc.directInc:·dTLB-stores thrpt 3 1084.787 ± 417.281 #/op
LoopInc.directInc:·iTLB-load-misses thrpt 3 0.081 ± 0.333 #/op
LoopInc.directInc:·iTLB-loads thrpt 3 3.623 ± 19.955 #/op
LoopInc.directInc:·instructions thrpt 3 27114.052 ± 1843.720 #/op
LoopInc.indirectInc thrpt 15 489.164 ± 2.692 ops/ms
LoopInc.indirectInc:·CPI thrpt 3 0.281 ± 0.015 #/op
LoopInc.indirectInc:·L1-dcache-load-misses thrpt 3 0.503 ± 9.071 #/op
LoopInc.indirectInc:·L1-dcache-loads thrpt 3 2149.806 ± 369.040 #/op
LoopInc.indirectInc:·L1-dcache-store-misses thrpt 3 0.167 ± 1.370 #/op
LoopInc.indirectInc:·L1-dcache-stores thrpt 3 1073.895 ± 186.741 #/op
LoopInc.indirectInc:·L1-icache-load-misses thrpt 3 0.313 ± 1.275 #/op
LoopInc.indirectInc:·branch-misses thrpt 3 1.102 ± 0.375 #/op
LoopInc.indirectInc:·branches thrpt 3 2143.670 ± 228.475 #/op
LoopInc.indirectInc:·cycles thrpt 3 8701.665 ± 706.183 #/op
LoopInc.indirectInc:·dTLB-load-misses thrpt 3 0.020 ± 0.301 #/op
LoopInc.indirectInc:·dTLB-loads thrpt 3 2141.965 ± 135.852 #/op
LoopInc.indirectInc:·dTLB-store-misses thrpt 3 0.002 ± 0.029 #/op
LoopInc.indirectInc:·dTLB-stores thrpt 3 1070.376 ± 81.445 #/op
LoopInc.indirectInc:·iTLB-load-misses thrpt 3 0.007 ± 0.135 #/op
LoopInc.indirectInc:·iTLB-loads thrpt 3 0.310 ± 5.768 #/op
LoopInc.indirectInc:·instructions thrpt 3 30968.207 ± 3627.540 #/op
Oh, both benchmarks have comparable number of instructions. The slower one takes more cycles (that's why CPI is also not ideal in directInc
; indirectInc
, however, produces a close-to-ideal CPI). If you look closely at possible causes: there is not many cache misses, not many TLB misses, but slow benchmark has lots of branch misses. AHA! Now we know what to look in the generated code.
Let's look at generated code again. -prof perfasm
conveniently highlights the jumps. And then you will see this...
directInc:
╭│ 0x00007fa0a82a50ff: jmp 0x00007fa0a82a5116
11.39% 16.90% ││ ↗ 0x00007fa0a82a5101: inc %edx ;*iinc
││ │ ; - org.openjdk.LoopInc::directInc@46 (line 18)
12.52% 23.11% ││ │↗↗ 0x00007fa0a82a5103: mov %r10,0xe8(%r11) ;*invokevirtual putLong
││ │││ ; - java.util.concurrent.ThreadLocalRandom::nextSeed@27 (line 241)
12.00% 8.14% ││ │││ 0x00007fa0a82a510a: inc %r8d ;*iinc
││ │││ ; - org.openjdk.LoopInc::directInc@46 (line 18)
0.03% 0.03% ││ │││ 0x00007fa0a82a510d: cmp $0x3e8,%r8d
│╰ │││ 0x00007fa0a82a5114: jge 0x00007fa0a82a50c7 ;*aload_0
│ │││ ; - org.openjdk.LoopInc::directInc@11 (line 19)
0.80% 0.91% ↘ │││ 0x00007fa0a82a5116: mov 0xf0(%r11),%r10d ;*invokevirtual getInt
│││ ; - java.util.concurrent.ThreadLocalRandom::current@9 (line 222)
4.28% 1.23% │││ 0x00007fa0a82a511d: test %r10d,%r10d
╭│││ 0x00007fa0a82a5120: je 0x00007fa0a82a517b ;*ifne
││││ ; - java.util.concurrent.ThreadLocalRandom::current@12 (line 222)
2.11% 0.01% ││││ 0x00007fa0a82a5122: movabs $0x9e3779b97f4a7c15,%r10
0.01% 0.07% ││││ 0x00007fa0a82a512c: add 0xe8(%r11),%r10 ;*ladd
││││ ; - java.util.concurrent.ThreadLocalRandom::nextSeed@24 (line 242)
7.73% 1.89% ││││ 0x00007fa0a82a5133: mov %r10,%r9
1.21% 1.84% ││││ 0x00007fa0a82a5136: shr $0x21,%r9
1.90% 0.03% ││││ 0x00007fa0a82a513a: xor %r10,%r9
2.02% 0.03% ││││ 0x00007fa0a82a513d: movabs $0xff51afd7ed558ccd,%rcx
0.94% 1.82% ││││ 0x00007fa0a82a5147: imul %rcx,%r9 ;*lmul
││││ ; - java.util.concurrent.ThreadLocalRandom::mix32@9 (line 182)
7.01% 2.40% ││││ 0x00007fa0a82a514b: mov %r9,%rcx
││││ 0x00007fa0a82a514e: shr $0x21,%rcx
1.89% 0.70% ││││ 0x00007fa0a82a5152: xor %r9,%rcx
3.11% 2.55% ││││ 0x00007fa0a82a5155: movabs $0xc4ceb9fe1a85ec53,%r9
0.99% 1.50% ││││ 0x00007fa0a82a515f: imul %r9,%rcx
7.66% 2.89% ││││ 0x00007fa0a82a5163: shr $0x20,%rcx
3.70% 1.97% ││││ 0x00007fa0a82a5167: mov %ecx,%r9d
0.11% ││││ 0x00007fa0a82a516a: and $0x1,%r9d ;*iand
││││ ; - java.util.concurrent.ThreadLocalRandom::nextInt@34 (line 356)
3.76% 11.13% ││││ 0x00007fa0a82a516e: cmp $0x1,%r9d
│╰││ 0x00007fa0a82a5172: je 0x00007fa0a82a5101
10.48% 16.62% │ ││ 0x00007fa0a82a5174: test %r9d,%r9d
│ ╰│ 0x00007fa0a82a5177: je 0x00007fa0a82a5103 ;*lookupswitch
│ │ ; - org.openjdk.LoopInc::directInc@15 (line 19)
│ ╰ 0x00007fa0a82a5179: jmp 0x00007fa0a82a5103 ;*aload_0
│ ; - org.openjdk.LoopInc::directInc@11 (line 19)
↘ 0x00007fa0a82a517b: mov $0xffffff5d,%esi
indirectInc:
0.01% 0.01% ↗ 0x00007f65588d8260: mov %edx,%r9d
0.01% │ 0x00007f65588d8263: nopw 0x0(%rax,%rax,1)
11.99% 11.38% │ 0x00007f65588d826c: data16 data16 xchg %ax,%ax ;*iconst_0
│ ; - org.openjdk.LoopInc::indirectInc@11 (line 34)
│ 0x00007f65588d8270: mov 0xf0(%r8),%r10d ;*invokevirtual getInt
│ ; - java.util.concurrent.ThreadLocalRandom::current@9 (line 222)
│ 0x00007f65588d8277: test %r10d,%r10d
│ 0x00007f65588d827a: je 0x00007f65588d8331 ;*ifne
│ ; - java.util.concurrent.ThreadLocalRandom::current@12 (line 222)
0.01% │ 0x00007f65588d8280: movabs $0x9e3779b97f4a7c15,%r10
11.80% 11.49% │ 0x00007f65588d828a: add 0xe8(%r8),%r10 ;*ladd
│ ; - java.util.concurrent.ThreadLocalRandom::nextSeed@24 (line 242)
0.01% 0.01% │ 0x00007f65588d8291: mov %r10,0xe8(%r8) ;*invokevirtual putLong
│ ; - java.util.concurrent.ThreadLocalRandom::nextSeed@27 (line 241)
│ 0x00007f65588d8298: mov %r9d,%edx
0.01% 0.01% │ 0x00007f65588d829b: inc %edx
11.12% 12.40% │ 0x00007f65588d829d: mov %r10,%rcx
0.01% │ 0x00007f65588d82a0: shr $0x21,%rcx
0.03% │ 0x00007f65588d82a4: xor %r10,%rcx
0.06% 0.03% │ 0x00007f65588d82a7: movabs $0xff51afd7ed558ccd,%r10
12.38% 13.94% │ 0x00007f65588d82b1: imul %r10,%rcx ;*lmul
│ ; - java.util.concurrent.ThreadLocalRandom::mix32@9 (line 182)
0.03% 0.01% │ 0x00007f65588d82b5: mov %rcx,%r10
│ 0x00007f65588d82b8: shr $0x21,%r10
0.03% │ 0x00007f65588d82bc: xor %rcx,%r10
11.43% 12.62% │ 0x00007f65588d82bf: movabs $0xc4ceb9fe1a85ec53,%rcx
0.01% │ 0x00007f65588d82c9: imul %rcx,%r10
0.34% 0.30% │ 0x00007f65588d82cd: shr $0x20,%r10
0.85% 0.76% │ 0x00007f65588d82d1: mov %r10d,%r10d
11.81% 11.51% │ 0x00007f65588d82d4: and $0x1,%r10d
2.16% 1.78% │ 0x00007f65588d82d8: cmp $0x1,%r10d
3.45% 3.00% │ 0x00007f65588d82dc: cmovne %r9d,%edx <----- HERE IT IS
17.55% 15.86% │ 0x00007f65588d82e0: inc %r11d ;*iinc
│ ; - org.openjdk.LoopInc::indirectInc@56 (line 33)
│ 0x00007f65588d82e3: cmp $0x3e8,%r11d
╰ 0x00007f65588d82ea: jl 0x00007f65588d8260 ;*if_icmpge
; - org.openjdk.LoopInc::indirectInc@8 (line 33)
Notice the cmovne
instead of jmp
-- this is why we have more "predictable" branches. HotSpot profiles the branches, and emits the conditional move when the branch profile branch is very flat. In other words, dodge a very likely branch misprediction by paying a bit for the added latency of conditional move. However, in this case, switch is special: it has more than two alternatives (0, 1, and "nothing"). This is why, I speculate, the result
increment is not being folded into cmov. (Generally speaking, HotSpot could have stored zero to result
in "default", but it blew it, oh well)
To confirm that hypothesis, let's make a directCompleteInc
case, where we still use switch
, but now cover all the cases:
@Benchmark
public int directCompleteInc() {
int result = 0;
for (int i = 0; i < 1000; i++) {
switch (getValue()) {
case 1:
result++;
break;
default:
break;
}
}
return result;
}
...and measure it, and this time without any options, like the OP did:
Benchmark Mode Cnt Score Error Units
LoopInc.directCompleteInc thrpt 5 644.414 ± 0.371 ops/ms
LoopInc.directInc thrpt 5 174.974 ± 0.103 ops/ms
LoopInc.indirectInc thrpt 5 644.015 ± 0.533 ops/ms
THERE.
Confirm directCompleteInc
is using cmov
with -prof perfasm
. It does.
Drink up.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With