I've tried two ways to iterate char-by-char over java.lang.String and found them confusing. The benchmark summarizes it:
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(jvmArgsAppend = {"-Xms2g", "-Xmx2g"})
public class CharByCharIterationBenchmark {
@Benchmark
public void toCharArray(Data data, Blackhole b) {
char[] chars = data.string.toCharArray();
for (char ch : chars) {
b.consume(ch);
}
}
@Benchmark
public void charAt(Data data, Blackhole b) {
String string = data.string;
int length = string.length();
for (int i = 0; i < length; i++) {
b.consume(string.charAt(i));
}
}
@State(Scope.Thread)
public static class Data {
String string;
@Param({"true", "false"})
private boolean latin;
@Param({"5", "10", "50", "100"})
private int length;
@Setup
public void setup() {
String alphabet = latin
? "abcdefghijklmnopqrstuvwxyz" // English
: "абвгдеёжзиклмнопрстуфхцчшщьыъэюя"; // Russian
RandomStringGenerator generator = new RandomStringGenerator();
string = generator.randomString(alphabet, length);
}
}
Intuitively the approach described in toCharArray()
seems to be less effective as it allocates a copy of underlying char[]
as of Java 8 and encodes byte[]
into char[]
as of Java 9 and newer. In practise however it's vice versa: toCharArray()
performs much faster:
Java 8
(latin) (length) Mode Score Error Units
charAt true 5 avgt 21.051 ± 0.796 ns/op
charAt true 10 avgt 44.002 ± 2.324 ns/op
charAt true 50 avgt 221.068 ± 7.422 ns/op
charAt true 100 avgt 410.162 ± 13.441 ns/op
toCharArray true 5 avgt 16.819 ± 0.662 ns/op
toCharArray true 10 avgt 28.364 ± 0.663 ns/op
toCharArray true 50 avgt 110.910 ± 1.144 ns/op
toCharArray true 100 avgt 205.694 ± 1.669 ns/op
charAt:·gc.alloc.rate.norm true 5 avgt ≈ 10⁻⁵ B/op
charAt:·gc.alloc.rate.norm true 10 avgt ≈ 10⁻⁵ B/op
charAt:·gc.alloc.rate.norm true 50 avgt ≈ 10⁻⁴ B/op
charAt:·gc.alloc.rate.norm true 100 avgt ≈ 10⁻⁴ B/op
toCharArray:·gc.alloc.rate.norm true 5 avgt 32.000 ± 0.001 B/op
toCharArray:·gc.alloc.rate.norm true 10 avgt 40.000 ± 0.001 B/op
toCharArray:·gc.alloc.rate.norm true 50 avgt 120.000 ± 0.001 B/op
toCharArray:·gc.alloc.rate.norm true 100 avgt 216.000 ± 0.001 B/op
charAt false 5 avgt 20.372 ± 0.406 ns/op
charAt false 10 avgt 39.962 ± 0.911 ns/op
charAt false 50 avgt 201.337 ± 3.752 ns/op
charAt false 100 avgt 410.530 ± 17.931 ns/op
toCharArray false 5 avgt 15.767 ± 0.606 ns/op
toCharArray false 10 avgt 26.258 ± 0.345 ns/op
toCharArray false 50 avgt 109.631 ± 1.064 ns/op
toCharArray false 100 avgt 205.815 ± 4.716 ns/op
charAt:·gc.alloc.rate.norm false 5 avgt ≈ 10⁻⁵ B/op
charAt:·gc.alloc.rate.norm false 10 avgt ≈ 10⁻⁵ B/op
charAt:·gc.alloc.rate.norm false 50 avgt ≈ 10⁻⁴ B/op
charAt:·gc.alloc.rate.norm false 100 avgt ≈ 10⁻⁴ B/op
toCharArray:·gc.alloc.rate.norm false 5 avgt 32.000 ± 0.001 B/op
toCharArray:·gc.alloc.rate.norm false 10 avgt 40.000 ± 0.001 B/op
toCharArray:·gc.alloc.rate.norm false 50 avgt 120.000 ± 0.001 B/op
toCharArray:·gc.alloc.rate.norm false 100 avgt 216.000 ± 0.001 B/op
Java 11
(latin) (length) Mode Score Error Units
charAt true 5 avgt 22.035 ± 1.557 ns/op
charAt true 10 avgt 41.800 ± 1.572 ns/op
charAt true 50 avgt 227.180 ± 18.655 ns/op
charAt true 100 avgt 474.719 ± 29.782 ns/op
toCharArray true 5 avgt 17.091 ± 0.662 ns/op
toCharArray true 10 avgt 26.167 ± 0.220 ns/op
toCharArray true 50 avgt 127.876 ± 2.106 ns/op
toCharArray true 100 avgt 244.449 ± 9.330 ns/op
charAt:·gc.alloc.rate.norm true 5 avgt ≈ 10⁻⁵ B/op
charAt:·gc.alloc.rate.norm true 10 avgt ≈ 10⁻⁵ B/op
charAt:·gc.alloc.rate.norm true 50 avgt ≈ 10⁻⁴ B/op
charAt:·gc.alloc.rate.norm true 100 avgt ≈ 10⁻⁴ B/op
toCharArray:·gc.alloc.rate.norm true 5 avgt 32.000 ± 0.001 B/op
toCharArray:·gc.alloc.rate.norm true 10 avgt 40.000 ± 0.001 B/op
toCharArray:·gc.alloc.rate.norm true 50 avgt 120.000 ± 0.001 B/op
toCharArray:·gc.alloc.rate.norm true 100 avgt 216.000 ± 0.001 B/op
charAt false 5 avgt 22.215 ± 2.064 ns/op
charAt false 10 avgt 45.606 ± 2.567 ns/op
charAt false 50 avgt 204.577 ± 18.302 ns/op
charAt false 100 avgt 404.056 ± 10.203 ns/op
toCharArray false 5 avgt 17.055 ± 0.556 ns/op
toCharArray false 10 avgt 29.254 ± 2.616 ns/op
toCharArray false 50 avgt 123.610 ± 5.033 ns/op
toCharArray false 100 avgt 226.174 ± 6.396 ns/op
charAt:·gc.alloc.rate.norm false 5 avgt ≈ 10⁻⁵ B/op
charAt:·gc.alloc.rate.norm false 10 avgt ≈ 10⁻⁵ B/op
charAt:·gc.alloc.rate.norm false 50 avgt ≈ 10⁻⁴ B/op
charAt:·gc.alloc.rate.norm false 100 avgt ≈ 10⁻⁴ B/op
toCharArray:·gc.alloc.rate.norm false 5 avgt 32.000 ± 0.001 B/op
toCharArray:·gc.alloc.rate.norm false 10 avgt 40.000 ± 0.001 B/op
toCharArray:·gc.alloc.rate.norm false 50 avgt 120.000 ± 0.001 B/op
toCharArray:·gc.alloc.rate.norm false 100 avgt 216.000 ± 0.001 B/op
First I was thinking that reason here is the same as descrbed in Nitsan Wakart's article "Volatile read surprise". However doing profiling with perfasm I see that the hottest spots in the code are not related to char[]
/byte[]
field access:
╭│ 0x00007fa638407dd9: jmp 0x00007fa638407e4c
││ 0x00007fa638407ddb: nopl 0x0(%rax,%rax,1)
4.96% ││ ↗ 0x00007fa638407de0: shl $0x3,%r11
0.01% ││ │ 0x00007fa638407de4: movzwl 0x10(%r11,%r13,2),%edx ;*invokevirtual charAt {reexecute=0 rethrow=0 return_oop=0}
││ │ ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
││ │ ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
3.58% ││ ↗│ 0x00007fa638407dea: mov %rsi,0x18(%rsp)
1.87% ││ ││ 0x00007fa638407def: mov %r8d,0x14(%rsp)
4.18% ││ ││ 0x00007fa638407df4: mov %edi,0x10(%rsp)
0.04% ││ ││ 0x00007fa638407df8: mov %rbx,0x8(%rsp)
1.29% ││ ││ 0x00007fa638407dfd: mov %r10,(%rsp)
1.83% ││ ││ 0x00007fa638407e01: mov %r9,0x70(%rsp)
4.32% ││ ││ 0x00007fa638407e06: mov %rax,0x60(%rsp)
0.05% ││ ││ 0x00007fa638407e0b: mov %r9,%rsi
1.27% ││ ││ 0x00007fa638407e0e: nop
1.88% ││ ││ 0x00007fa638407e0f: callq 0x00007fa630926e00 ; ImmutableOopMap{[96]=Oop [104]=Oop [112]=Oop [120]=Oop [0]=Oop [16]=NarrowOop [24]=Oop }
││ ││ ;*invokevirtual consume {reexecute=0 rethrow=0 return_oop=0}
││ ││ ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@28 (line 35)
││ ││ ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
││ ││ ; {optimized virtual_call}
5.71% ││ ││ 0x00007fa638407e14: inc %ebp ;*iinc {reexecute=0 rethrow=0 return_oop=0}
││ ││ ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@31 (line 34)
││ ││ ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
0.05% ││ ││ 0x00007fa638407e16: cmp 0x14(%rsp),%ebp
0.00% │╰ ││ 0x00007fa638407e1a: jge 0x00007fa638407d87 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
│ ││ ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@18 (line 34)
│ ││ ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
3.05% │ ││ 0x00007fa638407e20: mov 0x10(%rsp),%edi
4.24% │ ││ 0x00007fa638407e24: movsbl 0x14(%r12,%rdi,8),%ecx ;*getfield coder {reexecute=0 rethrow=0 return_oop=0}
│ ││ ; - java.lang.String::isLatin1@7 (line 3266)
│ ││ ; - java.lang.String::charAt@1 (line 692)
│ ││ ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
│ ││ ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
0.86% │ ││ 0x00007fa638407e2a: mov 0xc(%r12,%rdi,8),%r11d ;*getfield value {reexecute=0 rethrow=0 return_oop=0}
│ ││ ; - java.lang.String::charAt@8 (line 693)
│ ││ ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
│ ││ ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
1.67% │ ││ 0x00007fa638407e2f: mov 0x60(%rsp),%rax
1.71% │ ││ 0x00007fa638407e34: mov 0x70(%rsp),%r9
3.92% │ ││ 0x00007fa638407e39: mov (%rsp),%r10
0.20% │ ││ 0x00007fa638407e3d: mov 0x8(%rsp),%rbx
1.44% │ ││ 0x00007fa638407e42: mov 0x14(%rsp),%r8d
1.70% │ ││ 0x00007fa638407e47: mov 0x18(%rsp),%rsi ;*aload_2 {reexecute=0 rethrow=0 return_oop=0}
│ ││ ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@21 (line 35)
│ ││ ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
3.93% ↘ ││ 0x00007fa638407e4c: movslq %ebp,%r13 ;*invokestatic getChar {reexecute=0 rethrow=0 return_oop=0}
││ ; - java.lang.StringUTF16::charAt@7 (line 1268)
││ ; - java.lang.String::charAt@21 (line 695)
││ ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
││ ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
0.23% ││ 0x00007fa638407e4f: test %ecx,%ecx
0.00% ╭││ 0x00007fa638407e51: jne 0x00007fa638407e6b ;*ifeq {reexecute=0 rethrow=0 return_oop=0}
│││ ; - java.lang.String::charAt@4 (line 692)
│││ ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
│││ ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
│││ 0x00007fa638407e53: mov 0xc(%r12,%r11,8),%edx ; implicit exception: dispatches to 0x00007fa638407fbc
│││ 0x00007fa638407e58: cmp %edx,%ebp
│││ 0x00007fa638407e5a: jae 0x00007fa638407eb0
│││ 0x00007fa638407e5c: shl $0x3,%r11
│││ 0x00007fa638407e60: movzbl 0x10(%r11,%r13,1),%edx ;*iand {reexecute=0 rethrow=0 return_oop=0}
│││ ; - java.lang.StringLatin1::charAt@25 (line 49)
│││ ; - java.lang.String::charAt@12 (line 693)
│││ ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
│││ ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
│╰│ 0x00007fa638407e66: jmpq 0x00007fa638407dea
1.52% ↘ │ 0x00007fa638407e6b: mov 0xc(%r12,%r11,8),%ecx ; implicit exception: dispatches to 0x00007fa638407fb0
5.99% │ 0x00007fa638407e70: sar %ecx ;*ishr {reexecute=0 rethrow=0 return_oop=0}
│ ; - java.lang.StringUTF16::length@3 (line 74)
│ ; - java.lang.StringUTF16::checkIndex@2 (line 1470)
│ ; - java.lang.StringUTF16::charAt@2 (line 1267)
│ ; - java.lang.String::charAt@21 (line 695)
│ ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
│ ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
5.51% │ 0x00007fa638407e72: cmp %ecx,%ebp
0.01% ╰ 0x00007fa638407e74: jb 0x00007fa638407de0 ;*if_icmplt {reexecute=0 rethrow=0 return_oop=0}
It looks like the hottest place is iinc
(increment of loop index) and ishr
arythmetic shift in StringUTF16.length()
which is rather conter-intuitive to me.
Also using perfnorm profiler I see that toCharArray()
has less cycles, instructions and load-misses comparing to chatAt()
:
Benchmark Mode Cnt Score Error Units
charAt:L1-dcache-loads avgt 2104.816 #/op
charAt:L1-dcache-stores avgt 1200.878 #/op
charAt:branches avgt 603.754 #/op
charAt:cycles avgt 1461.282 #/op
charAt:dTLB-loads avgt 2105.253 #/op
charAt:dTLB-stores avgt 1200.909 #/op
charAt:instructions avgt 4716.775 #/op
toCharArray:L1-dcache-loads avgt 1026.341 #/op
toCharArray:L1-dcache-stores avgt 416.997 #/op
toCharArray:branches avgt 419.265 #/op
toCharArray:cycles avgt 820.521 #/op
toCharArray:dTLB-loads avgt 1026.506 #/op
toCharArray:dTLB-stores avgt 417.591 #/op
toCharArray:instructions avgt 2409.806 #/op
Could someone help to interpret this and explain such significant difference?
As @apangin mentioned in his comment
The problem is that BlackHole.consume is called inside the loop. Being a non-inlined black box method, it prevents from optimizing the code around the call, in particular, caching String fields.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With