Some byte arrays using new String (byte [],"UTF-8") return different results in jdk 1.7 and 1.8
byte[] bytes1 = {55, 93, 97, -13, 4, 8, 29, 26, -68, -4, -26, -94, -37, 32, -41, 88};
String str1 = new String(bytes1,"UTF-8");
System.out.println(str1.length());
byte[] out1 = str1.getBytes("UTF-8");
System.out.println(out1.length);
System.out.println(Arrays.toString(out1));
byte[] bytes2 = {65, -103, -103, 73, 32, 68, 49, 73, -1, -30, -1, -103, -92, 11, -32, -30};
String str2 = new String(bytes2,"UTF-8");
System.out.println(str2.length());
byte[] out2 = str2.getBytes("UTF-8");
System.out.println(out2.length);
System.out.println(Arrays.toString(out2));
bytes2 use new String(byte[],"UTF-8"),the result(str2) is not the same in jdk7 and jdk8, but byte1 is same. What is special about bytes2?
Test the "ISO-8859-1" code, the result of bytes2 is the same in jdk1.8!
jdk1.7.0_80:
15
27
[55, 93, 97, -17, -65, -67, 4, 8, 29, 26, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 32, -17, -65, -67, 88]
15
31
[65, -17, -65, -67, -17, -65, -67, 73, 32, 68, 49, 73, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 11, -17, -65, -67]
jdk1.8.0_201
15
27
[55, 93, 97, -17, -65, -67, 4, 8, 29, 26, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 32, -17, -65, -67, 88]
16
34
[65, -17, -65, -67, -17, -65, -67, 73, 32, 68, 49, 73, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 11, -17, -65, -67, -17, -65, -67]
We recently migrated our application to JDK 8 from JDK 7. After the change, we ran into a problem with the following snippet of code. The byte array may contain invalid UTF-8 byte sequences. The same byte array upon UTF-8 decoding, results in two difference strings on Java 7 and Java 8.
According to the answer to this SO post, Java 8 "fixes" an error in Java 7 and replaces invalid UTF-8 byte sequences with a replacement string, which is in accordance with the UTF-8 specification. But we would like to stick with Java 7's version of the decoded string.
The byte array may contain invalid UTF-8 byte sequences. The same byte array upon UTF-8 decoding, results in two difference strings on Java 7 and Java 8. According to the answer to this SO post, Java 8 "fixes" an error in Java 7 and replaces invalid UTF-8 byte sequences with a replacement string, which is in accordance with the UTF-8 specification.
Starting from JDK 8 ,the input data to NIO UTF-8 decoder (Java API) should be of pure UTF-8. Till JDK 8 this rule was not strictly implemented and decoder was able to decode ill-formed UTF-8 byte sequence also. Decoder in JDK 1.8 will throw an exception if it encounters non UTF-8 data and the program cannot proceed with the execution.
Short answer:
In second byte array last 2 bytes: [-32, -37] (0b11011011_11100000) is encoded as:
By JDK 7: [-17, -65, -67] which is Unicode character 0xFFFD ("invalid character"),
By JDK 8: [-17, -65, -67, -17, -65, -67] which is 2 of 0xFFFD characters.
Long answer:
Some byte sequence in your arrays doesn't appears to be valid UTF-8 sequence. Let's consider this code:
byte[] bb = {55, 93, 97, -13, 4, 8, 29, 26, -68, -4, -26, -94, -37, 32, -41, 88};
for (byte b : bb) System.out.println(Integer.toBinaryString(b & 0xff));
It will print (I added leading underscores manually for readability):
__110111
_1011101
_1100001
11110011
_____100
____1000
___11101
___11010
10111100
11111100
11100110
10100010
11011011
__100000
11010111
_1011000
As you can read in UTF-8 Wikipedia article, utf-8 encoded string, uses following binary sequences:
0xxxxxxx -- for ASCII characters
110xxxxx 10xxxxxx -- for 0x0080 to 0x07ff
1110xxxx 10xxxxxx 10xxxxxx -- for 0x0800 to 0xFFFF
... and so on
So each character that doesn't follow this encoding scheme, is replaced by 3 bytes:
[-17, -65, -67]
In binary 11101111 10111111 10111101
Unicode bits are 0b11111111_11111101
Unicode hex is 0xFFFD (Unicode's "Invalid character")
The only difference in arrays printed by your code is how following characters are processed, those are 2 bytes at the end of your second array:
[-32, -30] is 0b11100000_11100010, and this is not valid UTF-8
JDK 7 generated single 0xFFFD character for this sequence.
JDK 8 generated two 0xFFFD characters for this sequence.
RFC-3629 standard has no clear instructions on how to handle invalid sequences, so it seems that in JDK 8 they decided to generate 0xFFFD per each invalid byte, which seems to be more correct.
The other question, is why you try to parse such raw non UTF-8 bytes as UTF-8 chars, when you should not be doing that?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With