Looking into UTF8 decoding performance, I noticed the performance of protobuf's UnsafeProcessor::decodeUtf8
is better than String(byte[] bytes, int offset, int length, Charset charset)
for the following non ascii string: "Quizdeltagerne spiste jordbær med flØde, mens cirkusklovnen"
.
I tried to figure out why, so I copied the relevant code in String
and replaced the array accesses with unsafe array accesses, same as UnsafeProcessor::decodeUtf8
.
Here are the JMH benchmark results:
Benchmark Mode Cnt Score Error Units
StringBenchmark.safeDecoding avgt 10 127.107 ± 3.642 ns/op
StringBenchmark.unsafeDecoding avgt 10 100.915 ± 4.090 ns/op
I assume the difference is due to missing bounds checking elimination which I expected to kick in, especially since there is an explicit bounds check in the form of a call to checkBoundsOffCount(offset, length, bytes.length)
in the beginning of String(byte[] bytes, int offset, int length, Charset charset)
.
Is the issue really a missing bounds check elimination?
Here's the code I benchmarked using OpenJDK 17 & JMH. Note that this is only part of the String(byte[] bytes, int offset, int length, Charset charset)
constructor code, and works correctly only for this specific German String.
The static methods were copied from String
.
Look for the // the unsafe version:
comments that indicate where I replaced the safe access with unsafe.
private static byte[] safeDecode(byte[] bytes, int offset, int length) {
checkBoundsOffCount(offset, length, bytes.length);
int sl = offset + length;
int dp = 0;
byte[] dst = new byte[length];
while (offset < sl) {
int b1 = bytes[offset];
// the unsafe version:
// int b1 = UnsafeUtil.getByte(bytes, offset);
if (b1 >= 0) {
dst[dp++] = (byte)b1;
offset++;
continue;
}
if ((b1 == (byte)0xc2 || b1 == (byte)0xc3) &&
offset + 1 < sl) {
// the unsafe version:
// int b2 = UnsafeUtil.getByte(bytes, offset + 1);
int b2 = bytes[offset + 1];
if (!isNotContinuation(b2)) {
dst[dp++] = (byte)decode2(b1, b2);
offset += 2;
continue;
}
}
// anything not a latin1, including the repl
// we have to go with the utf16
break;
}
if (offset == sl) {
if (dp != dst.length) {
dst = Arrays.copyOf(dst, dp);
}
return dst;
}
return dst;
}
Apparently if I change the while loop condition from offset < sl
to 0 <= offset && offset < sl
I get similar performance in both versions:
Benchmark Mode Cnt Score Error Units
StringBenchmark.safeDecoding avgt 10 100.802 ± 13.147 ns/op
StringBenchmark.unsafeDecoding avgt 10 102.774 ± 3.893 ns/op
This question was picked up by HotSpot developers as https://bugs.openjdk.java.net/browse/JDK-8278518.
Optimizing this code ended up giving a 2.5x boost to decoding the above Latin1 string.
This C2 optimization closes the unbelievable more than 7x gap between commonBranchFirst
and commonBranchSecond
in the below benchmark and will land in Java 19.
Benchmark Mode Cnt Score Error Units
LoopBenchmark.commonBranchFirst avgt 25 1737.111 ± 56.526 ns/op
LoopBenchmark.commonBranchSecond avgt 25 232.798 ± 12.676 ns/op
@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class LoopBenchmark {
private final boolean[] mostlyTrue = new boolean[1000];
@Setup
public void setup() {
for (int i = 0; i < mostlyTrue.length; i++) {
mostlyTrue[i] = i % 100 > 0;
}
}
@Benchmark
public int commonBranchFirst() {
int i = 0;
while (i < mostlyTrue.length) {
if (mostlyTrue[i]) {
i++;
} else {
i += 2;
}
}
return i;
}
@Benchmark
public int commonBranchSecond() {
int i = 0;
while (i < mostlyTrue.length) {
if (!mostlyTrue[i]) {
i += 2;
} else {
i++;
}
}
return i;
}
}
In computer science, bounds-checking elimination is a compiler optimization useful in programming languages or runtime systems that enforce bounds checking, the practice of checking every index into an array to verify that the index is within the defined valid range of indexes.
One technique for bounds-checking elimination is to use a typed static single assignment form representation and for each array to create a new type representing a safe index for that particular array.
It is usually used to ensure that a number fits into a given type (range checking), or that a variable being used as an array index is within the bounds of the array (index checking). A failed bounds check usually results in the generation of some sort of exception signal.
However, it may still be possible for the compiler or runtime to perform proper bounds-checking elimination in this case. One technique for bounds-checking elimination is to use a typed static single assignment form representation and for each array to create a new type representing a safe index for that particular array.
To measure the branch you are interested in and particularly the scenario when while
loop becomes hot, I've used the following benchmark:
@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class StringConstructorBenchmark {
private byte[] array;
@Setup
public void setup() {
String str = "Quizdeltagerne spiste jordbær med fløde, mens cirkusklovnen. Я";
array = str.getBytes(StandardCharsets.UTF_8);
}
@Benchmark
public String newString() {
return new String(array, 0, array.length, StandardCharsets.UTF_8);
}
}
And indeed, with modified constructor it does give significant improvement:
//baseline
Benchmark Mode Cnt Score Error Units
StringConstructorBenchmark.newString avgt 50 173,092 ± 3,048 ns/op
//patched
Benchmark Mode Cnt Score Error Units
StringConstructorBenchmark.newString avgt 50 126,908 ± 2,355 ns/op
This is likely to be a HotSpot issue: optimizing compiler for some reason failed to eliminate array bounds check within while
-loop. I guess the reason is that offset
is modified within the loop:
while (offset < sl) {
int b1 = bytes[offset];
if (b1 >= 0) {
dst[dp++] = (byte)b1;
offset++; // <---
continue;
}
if ((b1 == (byte)0xc2 || b1 == (byte)0xc3) &&
offset + 1 < sl) {
int b2 = bytes[offset + 1];
if (!isNotContinuation(b2)) {
dst[dp++] = (byte)decode2(b1, b2);
offset += 2;
continue;
}
}
// anything not a latin1, including the repl
// we have to go with the utf16
break;
}
Also I've looked into the code via LinuxPerfAsmProfiler
, here's the link for baseline https://gist.github.com/stsypanov/d2524f98477d633fb1d4a2510fedeea6 and this one is for patched constructor https://gist.github.com/stsypanov/16c787e4f9fa3dd122522f16331b68b7
What should one pay attention to? Let's find the code corresponding int b1 = bytes[offset];
(line 538). In baseline we have this:
3.62% ││ │ 0x00007fed70eb4c1c: mov %ebx,%ecx
2.29% ││ │ 0x00007fed70eb4c1e: mov %edx,%r9d
2.22% ││ │ 0x00007fed70eb4c21: mov (%rsp),%r8 ;*iload_2 {reexecute=0 rethrow=0 return_oop=0}
││ │ ; - java.lang.String::<init>@107 (line 537)
2.32% ↘│ │ 0x00007fed70eb4c25: cmp %r13d,%ecx
│ │ 0x00007fed70eb4c28: jge 0x00007fed70eb5388 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
│ │ ; - java.lang.String::<init>@110 (line 537)
3.05% │ │ 0x00007fed70eb4c2e: cmp 0x8(%rsp),%ecx
│ │ 0x00007fed70eb4c32: jae 0x00007fed70eb5319
2.38% │ │ 0x00007fed70eb4c38: mov %r8,(%rsp)
2.64% │ │ 0x00007fed70eb4c3c: movslq %ecx,%r8
2.46% │ │ 0x00007fed70eb4c3f: mov %rax,%rbx
3.44% │ │ 0x00007fed70eb4c42: sub %r8,%rbx
2.62% │ │ 0x00007fed70eb4c45: add $0x1,%rbx
2.64% │ │ 0x00007fed70eb4c49: and $0xfffffffffffffffe,%rbx
2.30% │ │ 0x00007fed70eb4c4d: mov %ebx,%r8d
3.08% │ │ 0x00007fed70eb4c50: add %ecx,%r8d
2.55% │ │ 0x00007fed70eb4c53: movslq %r8d,%r8
2.45% │ │ 0x00007fed70eb4c56: add $0xfffffffffffffffe,%r8
2.13% │ │ 0x00007fed70eb4c5a: cmp (%rsp),%r8
│ │ 0x00007fed70eb4c5e: jae 0x00007fed70eb5319
3.36% │ │ 0x00007fed70eb4c64: mov %ecx,%edi ;*aload_1 {reexecute=0 rethrow=0 return_oop=0}
│ │ ; - java.lang.String::<init>@113 (line 538)
2.86% │ ↗│ 0x00007fed70eb4c66: movsbl 0x10(%r14,%rdi,1),%r8d ;*baload {reexecute=0 rethrow=0 return_oop=0}
│ ││ ; - java.lang.String::<init>@115 (line 538)
2.48% │ ││ 0x00007fed70eb4c6c: mov %r9d,%edx
2.26% │ ││ 0x00007fed70eb4c6f: inc %edx ;*iinc {reexecute=0 rethrow=0 return_oop=0}
│ ││ ; - java.lang.String::<init>@127 (line 540)
3.28% │ ││ 0x00007fed70eb4c71: mov %edi,%ebx
2.44% │ ││ 0x00007fed70eb4c73: inc %ebx ;*iinc {reexecute=0 rethrow=0 return_oop=0}
│ ││ ; - java.lang.String::<init>@134 (line 541)
2.35% │ ││ 0x00007fed70eb4c75: test %r8d,%r8d
╰ ││ 0x00007fed70eb4c78: jge 0x00007fed70eb4c04 ;*iflt {reexecute=0 rethrow=0 return_oop=0}
││ ; - java.lang.String::<init>@120 (line 539)
and in patched code the corresponding part is
17.28% ││ 0x00007f6b88eb6061: mov %edx,%r10d ;*iload_2 {reexecute=0 rethrow=0 return_oop=0}
││ ; - java.lang.String::<init>@107 (line 537)
0.11% ↘│ 0x00007f6b88eb6064: test %r10d,%r10d
│ 0x00007f6b88eb6067: jl 0x00007f6b88eb669c ;*iflt {reexecute=0 rethrow=0 return_oop=0}
│ ; - java.lang.String::<init>@108 (line 537)
0.39% │ 0x00007f6b88eb606d: cmp %r13d,%r10d
│ 0x00007f6b88eb6070: jge 0x00007f6b88eb66d0 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
│ ; - java.lang.String::<init>@114 (line 537)
0.66% │ 0x00007f6b88eb6076: mov %ebx,%r9d
13.70% │ 0x00007f6b88eb6079: cmp 0x8(%rsp),%r10d
0.01% │ 0x00007f6b88eb607e: jae 0x00007f6b88eb6671
0.14% │ 0x00007f6b88eb6084: movsbl 0x10(%r14,%r10,1),%edi ;*baload {reexecute=0 rethrow=0 return_oop=0}
│ ; - java.lang.String::<init>@119 (line 538)
0.37% │ 0x00007f6b88eb608a: mov %r9d,%ebx
0.99% │ 0x00007f6b88eb608d: inc %ebx ;*iinc {reexecute=0 rethrow=0 return_oop=0}
│ ; - java.lang.String::<init>@131 (line 540)
12.88% │ 0x00007f6b88eb608f: movslq %r9d,%rsi ;*bastore {reexecute=0 rethrow=0 return_oop=0}
│ ; - java.lang.String::<init>@196 (line 548)
0.17% │ 0x00007f6b88eb6092: mov %r10d,%edx
0.39% │ 0x00007f6b88eb6095: inc %edx ;*iinc {reexecute=0 rethrow=0 return_oop=0}
│ ; - java.lang.String::<init>@138 (line 541)
0.96% │ 0x00007f6b88eb6097: test %edi,%edi
0.02% │ 0x00007f6b88eb6099: jl 0x00007f6b88eb60dc ;*iflt {reexecute=0 rethrow=0 return_oop=0}
│ ; - java.lang.String::<init>@124 (line 539)
In baseline between if_icmpge
and aload_1
byte-code instructions we have bounds check, but we don't have one in patched code.
So your original assumptions is correct: it is about missing bounds check elimination.
UPD I must correct my answer: it turned out that bounds check is still there:
13.70% │ 0x00007f6b88eb6079: cmp 0x8(%rsp),%r10d
0.01% │ 0x00007f6b88eb607e: jae 0x00007f6b88eb6671
and the code I've pointed out is something that the compiler introduces, but it does nothing. The issue itself is still about bounds check as its explicit declaration solves the issue ad hoc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With