Scala 2.11 compiles a match
expression over a relatively dense Int
range into a lookupswitch
:
lookupswitch { // 21
-12: 200
-11: 200
-10: 184
-9: 190
-8: 190
-7: 190
-6: 190
-5: 190
-4: 200
-1: 200
2: 195
3: 195
4: 195
5: 195
6: 184
7: 184
12: 184
13: 184
18: 184
21: 184
25: 184
default: 180
}
Whereas Java 7 compiles the equivalent switch
statement into a tableswitch
:
tableswitch { // -12 to 25
-12: 168
-11: 168
-10: 177
-9: 174
-8: 174
-7: 174
-6: 174
-5: 174
-4: 168
-3: 185
-2: 185
-1: 168
0: 185
1: 185
2: 171
3: 171
4: 171
5: 171
6: 177
7: 177
8: 185
9: 185
10: 185
11: 185
12: 181
13: 181
14: 185
15: 185
16: 185
17: 185
18: 181
19: 185
20: 185
21: 181
22: 185
23: 185
24: 185
25: 181
default: 185
}
Is there some way to force Scala into generating a tableswitch
as well?
You should not care about the bytecode, since modern JVMs are smart enough to compile both lookupswitch
and tableswitch
in a similarly efficient way.
Intuitively tableswitch
should be faster, and this is also suggested by
JVM specification:
Thus, a tableswitch instruction is probably more efficient than a lookupswitch where space considerations permit a choice.
However, the spec was written 20+ years ago, when JVM had no JIT compiler. Is there a performance difference in a modern HotSpot JVM?
package bench;
import org.openjdk.jmh.annotations.*;
@State(Scope.Benchmark)
public class SwitchBench {
@Param({"1", "2", "3", "4", "5", "6", "7", "8"})
int n;
@Benchmark
public long lookupSwitch() {
return Switch.lookupSwitch(n);
}
@Benchmark
public long tableSwitch() {
return Switch.tableSwitch(n);
}
}
To have precise control over the bytecode, I build Switch
class with Jasmin.
.class public bench/Switch
.super java/lang/Object
.method public static lookupSwitch(I)I
.limit stack 1
iload_0
lookupswitch
1 : One
2 : Two
3 : Three
4 : Four
5 : Five
6 : Six
7 : Seven
default : Other
One:
bipush 11
ireturn
Two:
bipush 22
ireturn
Three:
bipush 33
ireturn
Four:
bipush 44
ireturn
Five:
bipush 55
ireturn
Six:
bipush 66
ireturn
Seven:
bipush 77
ireturn
Other:
bipush -1
ireturn
.end method
.method public static tableSwitch(I)I
.limit stack 1
iload_0
tableswitch 1
One
Two
Three
Four
Five
Six
Seven
default : Other
One:
bipush 11
ireturn
Two:
bipush 22
ireturn
Three:
bipush 33
ireturn
Four:
bipush 44
ireturn
Five:
bipush 55
ireturn
Six:
bipush 66
ireturn
Seven:
bipush 77
ireturn
Other:
bipush -1
ireturn
.end method
The results show no performance difference between lookupswitch/tableswitch benchmarks, but there is a small variation depending on switch argument:
Let's verify the theory by looking at generated assembly code.
The following JVM option will help: -XX:CompileCommand=print,bench.Switch::*
# {method} {0x0000000017498a48} 'lookupSwitch' '(I)I' in 'bench/Switch'
# parm0: rdx = int
# [sp+0x20] (sp of caller)
0x000000000329b240: sub $0x18,%rsp
0x000000000329b247: mov %rbp,0x10(%rsp) ;*synchronization entry
; - bench.Switch::lookupSwitch@-1
0x000000000329b24c: cmp $0x4,%edx
0x000000000329b24f: je 0x000000000329b2a5
0x000000000329b251: cmp $0x4,%edx
0x000000000329b254: jg 0x000000000329b281
0x000000000329b256: cmp $0x2,%edx
0x000000000329b259: je 0x000000000329b27a
0x000000000329b25b: cmp $0x2,%edx
0x000000000329b25e: jg 0x000000000329b273
0x000000000329b260: cmp $0x1,%edx
0x000000000329b263: jne 0x000000000329b26c ;*lookupswitch
; - bench.Switch::lookupSwitch@1
...
What we see here is a binary search starting with a mid value 4 (this explains why case 4 has the best performance in the graph above).
But the interesting thing is that tableSwitch
is compiled exactly the same way!
# {method} {0x0000000017528b18} 'tableSwitch' '(I)I' in 'bench/Switch'
# parm0: rdx = int
# [sp+0x20] (sp of caller)
0x000000000332c280: sub $0x18,%rsp
0x000000000332c287: mov %rbp,0x10(%rsp) ;*synchronization entry
; - bench.Switch::tableSwitch@-1
0x000000000332c28c: cmp $0x4,%edx
0x000000000332c28f: je 0x000000000332c2e5
0x000000000332c291: cmp $0x4,%edx
0x000000000332c294: jg 0x000000000332c2c1
0x000000000332c296: cmp $0x2,%edx
0x000000000332c299: je 0x000000000332c2ba
0x000000000332c29b: cmp $0x2,%edx
0x000000000332c29e: jg 0x000000000332c2b3
0x000000000332c2a0: cmp $0x1,%edx
0x000000000332c2a3: jne 0x000000000332c2ac ;*tableswitch
; - bench.Switch::tableSwitch@1
...
But wait... Why binary search, not a jump table?
HotSpot JVM has an heuristics to generate a jump table only for switches with 10+ cases. This can be altered by the option -XX:MinJumpTableSize=
.
OK, let's extend our test case with some more labels and check the generated code once again.
# {method} {0x0000000017288a68} 'lookupSwitch' '(I)I' in 'bench/Switch'
# parm0: rdx = int
# [sp+0x20] (sp of caller)
0x000000000307ecc0: sub $0x18,%rsp ; {no_reloc}
0x000000000307ecc7: mov %rbp,0x10(%rsp) ;*synchronization entry
; - bench.Switch::lookupSwitch@-1
0x000000000307eccc: mov %edx,%r10d
0x000000000307eccf: dec %r10d
0x000000000307ecd2: cmp $0xa,%r10d
0x000000000307ecd6: jb 0x000000000307ece9
0x000000000307ecd8: mov $0xffffffff,%eax
0x000000000307ecdd: add $0x10,%rsp
0x000000000307ece1: pop %rbp
0x000000000307ece2: test %eax,-0x1faece8(%rip) # 0x00000000010d0000
; {poll_return}
0x000000000307ece8: retq
0x000000000307ece9: movslq %edx,%r10
0x000000000307ecec: movabs $0x307ec60,%r11 ; {section_word}
0x000000000307ecf6: jmpq *-0x8(%r11,%r10,8) ;*lookupswitch
; - bench.Switch::lookupSwitch@1
^^^^^^^^^^^^^^^^^^^^^^^^^
Yes! Here is our computed jump instruction. Note, this is generated for lookupswitch
. But there will be exactly the same one for tableswitch
.
Amazingly, HotSpot JVM can generate jump tables even for switches with gaps and with outliers. -XX:MaxJumpTableSparseness
controls how large the gaps can be. E.g. if there are labels from 1 to 10, then from 13 to 20 and the last label with the value 99 - JIT will generate a guard test for the value 99, and for the rest labels it will create a table.
HotSpot source code will finally convince there should be no performance difference between lookupswitch
and tableswitch
after a method is JIT-compiled with C2. That's basically because the parsing of both instructions ends up with a call to the same jump_switch_ranges
function that works for an arbitrary set of labels.
As we saw, HotSpot JVM may compile tableswitch
using a binary search and lookupswitch
using a jump table, or vice versa. This depends on the number and the density of labels, but not on the bytecode itself.
So, answering your original question - you don't need to!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With