Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Could one explain significant performance difference in char-by-char iteration over j.l.String?

I've tried two ways to iterate char-by-char over java.lang.String and found them confusing. The benchmark summarizes it:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(jvmArgsAppend = {"-Xms2g", "-Xmx2g"})
public class CharByCharIterationBenchmark {

  @Benchmark
  public void toCharArray(Data data, Blackhole b) {
    char[] chars = data.string.toCharArray();
    for (char ch : chars) {
      b.consume(ch);
    }
  }

  @Benchmark
  public void charAt(Data data, Blackhole b) {
    String string = data.string;
    int length = string.length();
    for (int i = 0; i < length; i++) {
      b.consume(string.charAt(i));
    }
  }

  @State(Scope.Thread)
  public static class Data {
    String string;

    @Param({"true", "false"})
    private boolean latin;

    @Param({"5", "10", "50", "100"})
    private int length;

    @Setup
    public void setup() {
      String alphabet = latin
        ? "abcdefghijklmnopqrstuvwxyz"        // English
        : "абвгдеёжзиклмнопрстуфхцчшщьыъэюя"; // Russian

      RandomStringGenerator generator = new RandomStringGenerator();

      string = generator.randomString(alphabet, length);
    }
  }

Intuitively the approach described in toCharArray() seems to be less effective as it allocates a copy of underlying char[] as of Java 8 and encodes byte[] into char[] as of Java 9 and newer. In practise however it's vice versa: toCharArray() performs much faster:

Java 8

                                 (latin)  (length)  Mode      Score     Error   Units
charAt                              true         5  avgt     21.051 ±   0.796   ns/op
charAt                              true        10  avgt     44.002 ±   2.324   ns/op
charAt                              true        50  avgt    221.068 ±   7.422   ns/op
charAt                              true       100  avgt    410.162 ±  13.441   ns/op

toCharArray                         true         5  avgt     16.819 ±   0.662   ns/op
toCharArray                         true        10  avgt     28.364 ±   0.663   ns/op
toCharArray                         true        50  avgt    110.910 ±   1.144   ns/op
toCharArray                         true       100  avgt    205.694 ±   1.669   ns/op

charAt:·gc.alloc.rate.norm          true         5  avgt     ≈ 10⁻⁵              B/op
charAt:·gc.alloc.rate.norm          true        10  avgt     ≈ 10⁻⁵              B/op
charAt:·gc.alloc.rate.norm          true        50  avgt     ≈ 10⁻⁴              B/op
charAt:·gc.alloc.rate.norm          true       100  avgt     ≈ 10⁻⁴              B/op

toCharArray:·gc.alloc.rate.norm     true         5  avgt     32.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm     true        10  avgt     40.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm     true        50  avgt    120.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm     true       100  avgt    216.000 ±   0.001    B/op

charAt                              false        5  avgt     20.372 ±   0.406   ns/op
charAt                              false       10  avgt     39.962 ±   0.911   ns/op
charAt                              false       50  avgt    201.337 ±   3.752   ns/op
charAt                              false      100  avgt    410.530 ±  17.931   ns/op

toCharArray                         false        5  avgt     15.767 ±   0.606   ns/op
toCharArray                         false       10  avgt     26.258 ±   0.345   ns/op
toCharArray                         false       50  avgt    109.631 ±   1.064   ns/op
toCharArray                         false      100  avgt    205.815 ±   4.716   ns/op

charAt:·gc.alloc.rate.norm          false        5  avgt     ≈ 10⁻⁵              B/op
charAt:·gc.alloc.rate.norm          false       10  avgt     ≈ 10⁻⁵              B/op
charAt:·gc.alloc.rate.norm          false       50  avgt     ≈ 10⁻⁴              B/op
charAt:·gc.alloc.rate.norm          false      100  avgt     ≈ 10⁻⁴              B/op

toCharArray:·gc.alloc.rate.norm     false        5  avgt     32.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm     false       10  avgt     40.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm     false       50  avgt    120.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm     false      100  avgt    216.000 ±   0.001    B/op


Java 11


                                  (latin)  (length)  Mode     Score     Error   Units
charAt                               true         5  avgt    22.035 ±   1.557   ns/op
charAt                               true        10  avgt    41.800 ±   1.572   ns/op
charAt                               true        50  avgt   227.180 ±  18.655   ns/op
charAt                               true       100  avgt   474.719 ±  29.782   ns/op

toCharArray                          true         5  avgt    17.091 ±   0.662   ns/op
toCharArray                          true        10  avgt    26.167 ±   0.220   ns/op
toCharArray                          true        50  avgt   127.876 ±   2.106   ns/op
toCharArray                          true       100  avgt   244.449 ±   9.330   ns/op

charAt:·gc.alloc.rate.norm           true         5  avgt    ≈ 10⁻⁵              B/op
charAt:·gc.alloc.rate.norm           true        10  avgt    ≈ 10⁻⁵              B/op
charAt:·gc.alloc.rate.norm           true        50  avgt    ≈ 10⁻⁴              B/op
charAt:·gc.alloc.rate.norm           true       100  avgt    ≈ 10⁻⁴              B/op

toCharArray:·gc.alloc.rate.norm      true         5  avgt    32.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm      true        10  avgt    40.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm      true        50  avgt   120.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm      true       100  avgt   216.000 ±   0.001    B/op

charAt                              false         5  avgt    22.215 ±   2.064   ns/op
charAt                              false        10  avgt    45.606 ±   2.567   ns/op
charAt                              false        50  avgt   204.577 ±  18.302   ns/op
charAt                              false       100  avgt   404.056 ±  10.203   ns/op

toCharArray                         false         5  avgt    17.055 ±   0.556   ns/op
toCharArray                         false        10  avgt    29.254 ±   2.616   ns/op
toCharArray                         false        50  avgt   123.610 ±   5.033   ns/op
toCharArray                         false       100  avgt   226.174 ±   6.396   ns/op

charAt:·gc.alloc.rate.norm          false         5  avgt    ≈ 10⁻⁵              B/op
charAt:·gc.alloc.rate.norm          false        10  avgt    ≈ 10⁻⁵              B/op
charAt:·gc.alloc.rate.norm          false        50  avgt    ≈ 10⁻⁴              B/op
charAt:·gc.alloc.rate.norm          false       100  avgt    ≈ 10⁻⁴              B/op

toCharArray:·gc.alloc.rate.norm     false         5  avgt    32.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm     false        10  avgt    40.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm     false        50  avgt   120.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm     false       100  avgt   216.000 ±   0.001    B/op

First I was thinking that reason here is the same as descrbed in Nitsan Wakart's article "Volatile read surprise". However doing profiling with perfasm I see that the hottest spots in the code are not related to char[]/byte[] field access:

           ╭│     0x00007fa638407dd9: jmp    0x00007fa638407e4c
           ││     0x00007fa638407ddb: nopl   0x0(%rax,%rax,1)
  4.96%    ││  ↗  0x00007fa638407de0: shl    $0x3,%r11
  0.01%    ││  │  0x00007fa638407de4: movzwl 0x10(%r11,%r13,2),%edx  ;*invokevirtual charAt {reexecute=0 rethrow=0 return_oop=0}
           ││  │                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
           ││  │                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
  3.58%    ││ ↗│  0x00007fa638407dea: mov    %rsi,0x18(%rsp)
  1.87%    ││ ││  0x00007fa638407def: mov    %r8d,0x14(%rsp)
  4.18%    ││ ││  0x00007fa638407df4: mov    %edi,0x10(%rsp)
  0.04%    ││ ││  0x00007fa638407df8: mov    %rbx,0x8(%rsp)
  1.29%    ││ ││  0x00007fa638407dfd: mov    %r10,(%rsp)
  1.83%    ││ ││  0x00007fa638407e01: mov    %r9,0x70(%rsp)
  4.32%    ││ ││  0x00007fa638407e06: mov    %rax,0x60(%rsp)
  0.05%    ││ ││  0x00007fa638407e0b: mov    %r9,%rsi
  1.27%    ││ ││  0x00007fa638407e0e: nop
  1.88%    ││ ││  0x00007fa638407e0f: callq  0x00007fa630926e00  ; ImmutableOopMap{[96]=Oop [104]=Oop [112]=Oop [120]=Oop [0]=Oop [16]=NarrowOop [24]=Oop }
           ││ ││                                                ;*invokevirtual consume {reexecute=0 rethrow=0 return_oop=0}
           ││ ││                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@28 (line 35)
           ││ ││                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
           ││ ││                                                ;   {optimized virtual_call}
  5.71%    ││ ││  0x00007fa638407e14: inc    %ebp               ;*iinc {reexecute=0 rethrow=0 return_oop=0}
           ││ ││                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@31 (line 34)
           ││ ││                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
  0.05%    ││ ││  0x00007fa638407e16: cmp    0x14(%rsp),%ebp
  0.00%    │╰ ││  0x00007fa638407e1a: jge    0x00007fa638407d87  ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
           │  ││                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@18 (line 34)
           │  ││                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
  3.05%    │  ││  0x00007fa638407e20: mov    0x10(%rsp),%edi
  4.24%    │  ││  0x00007fa638407e24: movsbl 0x14(%r12,%rdi,8),%ecx  ;*getfield coder {reexecute=0 rethrow=0 return_oop=0}
           │  ││                                                ; - java.lang.String::isLatin1@7 (line 3266)
           │  ││                                                ; - java.lang.String::charAt@1 (line 692)
           │  ││                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
           │  ││                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
  0.86%    │  ││  0x00007fa638407e2a: mov    0xc(%r12,%rdi,8),%r11d  ;*getfield value {reexecute=0 rethrow=0 return_oop=0}
           │  ││                                                ; - java.lang.String::charAt@8 (line 693)
           │  ││                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
           │  ││                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
  1.67%    │  ││  0x00007fa638407e2f: mov    0x60(%rsp),%rax
  1.71%    │  ││  0x00007fa638407e34: mov    0x70(%rsp),%r9
  3.92%    │  ││  0x00007fa638407e39: mov    (%rsp),%r10
  0.20%    │  ││  0x00007fa638407e3d: mov    0x8(%rsp),%rbx
  1.44%    │  ││  0x00007fa638407e42: mov    0x14(%rsp),%r8d
  1.70%    │  ││  0x00007fa638407e47: mov    0x18(%rsp),%rsi    ;*aload_2 {reexecute=0 rethrow=0 return_oop=0}
           │  ││                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@21 (line 35)
           │  ││                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
  3.93%    ↘  ││  0x00007fa638407e4c: movslq %ebp,%r13          ;*invokestatic getChar {reexecute=0 rethrow=0 return_oop=0}
              ││                                                ; - java.lang.StringUTF16::charAt@7 (line 1268)
              ││                                                ; - java.lang.String::charAt@21 (line 695)
              ││                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
              ││                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
  0.23%       ││  0x00007fa638407e4f: test   %ecx,%ecx
  0.00%      ╭││  0x00007fa638407e51: jne    0x00007fa638407e6b  ;*ifeq {reexecute=0 rethrow=0 return_oop=0}
             │││                                                ; - java.lang.String::charAt@4 (line 692)
             │││                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
             │││                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
             │││  0x00007fa638407e53: mov    0xc(%r12,%r11,8),%edx  ; implicit exception: dispatches to 0x00007fa638407fbc
             │││  0x00007fa638407e58: cmp    %edx,%ebp
             │││  0x00007fa638407e5a: jae    0x00007fa638407eb0
             │││  0x00007fa638407e5c: shl    $0x3,%r11
             │││  0x00007fa638407e60: movzbl 0x10(%r11,%r13,1),%edx  ;*iand {reexecute=0 rethrow=0 return_oop=0}
             │││                                                ; - java.lang.StringLatin1::charAt@25 (line 49)
             │││                                                ; - java.lang.String::charAt@12 (line 693)
             │││                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
             │││                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
             │╰│  0x00007fa638407e66: jmpq   0x00007fa638407dea
  1.52%      ↘ │  0x00007fa638407e6b: mov    0xc(%r12,%r11,8),%ecx  ; implicit exception: dispatches to 0x00007fa638407fb0
  5.99%        │  0x00007fa638407e70: sar    %ecx               ;*ishr {reexecute=0 rethrow=0 return_oop=0}
               │                                                ; - java.lang.StringUTF16::length@3 (line 74)
               │                                                ; - java.lang.StringUTF16::checkIndex@2 (line 1470)
               │                                                ; - java.lang.StringUTF16::charAt@2 (line 1267)
               │                                                ; - java.lang.String::charAt@21 (line 695)
               │                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
               │                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
  5.51%        │  0x00007fa638407e72: cmp    %ecx,%ebp
  0.01%        ╰  0x00007fa638407e74: jb     0x00007fa638407de0  ;*if_icmplt {reexecute=0 rethrow=0 return_oop=0}

It looks like the hottest place is iinc (increment of loop index) and ishr arythmetic shift in StringUTF16.length() which is rather conter-intuitive to me.

Also using perfnorm profiler I see that toCharArray() has less cycles, instructions and load-misses comparing to chatAt():

Benchmark                          Mode  Cnt     Score   Error  Units

charAt:L1-dcache-loads             avgt       2104.816           #/op
charAt:L1-dcache-stores            avgt       1200.878           #/op
charAt:branches                    avgt        603.754           #/op
charAt:cycles                      avgt       1461.282           #/op
charAt:dTLB-loads                  avgt       2105.253           #/op
charAt:dTLB-stores                 avgt       1200.909           #/op
charAt:instructions                avgt       4716.775           #/op

toCharArray:L1-dcache-loads        avgt       1026.341           #/op
toCharArray:L1-dcache-stores       avgt        416.997           #/op
toCharArray:branches               avgt        419.265           #/op
toCharArray:cycles                 avgt        820.521           #/op
toCharArray:dTLB-loads             avgt       1026.506           #/op
toCharArray:dTLB-stores            avgt        417.591           #/op
toCharArray:instructions           avgt       2409.806           #/op

Could someone help to interpret this and explain such significant difference?

like image 387
Sergey Tsypanov Avatar asked Aug 10 '20 10:08

Sergey Tsypanov


1 Answers

As @apangin mentioned in his comment

The problem is that BlackHole.consume is called inside the loop. Being a non-inlined black box method, it prevents from optimizing the code around the call, in particular, caching String fields.

like image 125
Sergey Tsypanov Avatar answered Sep 29 '22 14:09

Sergey Tsypanov