new String(byte []) results differ in JDK 7 and 8

Tags:

Some byte arrays using new String (byte [],"UTF-8") return different results in jdk 1.7 and 1.8

byte[] bytes1 = {55, 93, 97, -13, 4, 8, 29, 26, -68, -4, -26, -94, -37, 32, -41, 88};
        String str1 = new String(bytes1,"UTF-8");
        System.out.println(str1.length());

        byte[] out1 = str1.getBytes("UTF-8");
        System.out.println(out1.length);
        System.out.println(Arrays.toString(out1));

byte[] bytes2 = {65, -103, -103, 73, 32, 68, 49, 73, -1, -30, -1, -103, -92, 11, -32, -30};
        String str2 = new String(bytes2,"UTF-8");
        System.out.println(str2.length());

        byte[] out2 = str2.getBytes("UTF-8");
        System.out.println(out2.length);
        System.out.println(Arrays.toString(out2));

bytes2 use new String(byte[],"UTF-8"),the result(str2) is not the same in jdk7 and jdk8, but byte1 is same. What is special about bytes2?

Test the "ISO-8859-1" code, the result of bytes2 is the same in jdk1.8!

jdk1.7.0_80:

15
27
[55, 93, 97, -17, -65, -67, 4, 8, 29, 26, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 32, -17, -65, -67, 88]
15
31
[65, -17, -65, -67, -17, -65, -67, 73, 32, 68, 49, 73, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 11, -17, -65, -67]

jdk1.8.0_201

15
27
[55, 93, 97, -17, -65, -67, 4, 8, 29, 26, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 32, -17, -65, -67, 88]
16
34
[65, -17, -65, -67, -17, -65, -67, 73, 32, 68, 49, 73, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 11, -17, -65, -67, -17, -65, -67]

610

asked May 07 '19 08:05

sapeming

1 Answers

Short answer:

In second byte array last 2 bytes: [-32, -37] (0b11011011_11100000) is encoded as:

By JDK 7: [-17, -65, -67] which is Unicode character 0xFFFD ("invalid character"),
By JDK 8: [-17, -65, -67, -17, -65, -67] which is 2 of 0xFFFD characters.

Long answer:

Some byte sequence in your arrays doesn't appears to be valid UTF-8 sequence. Let's consider this code:

byte[] bb = {55, 93, 97, -13, 4, 8, 29, 26, -68, -4, -26, -94, -37, 32, -41, 88};
for (byte b : bb) System.out.println(Integer.toBinaryString(b & 0xff));

It will print (I added leading underscores manually for readability):

As you can read in UTF-8 Wikipedia article, utf-8 encoded string, uses following binary sequences:

0xxxxxxx -- for ASCII characters
110xxxxx 10xxxxxx -- for 0x0080 to 0x07ff
1110xxxx 10xxxxxx 10xxxxxx -- for 0x0800 to 0xFFFF
... and so on

So each character that doesn't follow this encoding scheme, is replaced by 3 bytes:

[-17, -65, -67]
In binary 11101111 10111111 10111101
Unicode bits are 0b11111111_11111101
Unicode hex is 0xFFFD (Unicode's "Invalid character")

The only difference in arrays printed by your code is how following characters are processed, those are 2 bytes at the end of your second array:

[-32, -30] is 0b11100000_11100010, and this is not valid UTF-8

JDK 7 generated single 0xFFFD character for this sequence.
JDK 8 generated two 0xFFFD characters for this sequence.

RFC-3629 standard has no clear instructions on how to handle invalid sequences, so it seems that in JDK 8 they decided to generate 0xFFFD per each invalid byte, which seems to be more correct.

The other question, is why you try to parse such raw non UTF-8 bytes as UTF-8 chars, when you should not be doing that?

115

answered Sep 23 '22 04:09

semplar

Related questions
                            
                                Appropriate usage of TestPropertyValues in Spring Boot Tests
                            
                                Lombok @NonNull null check enforcement not working with Jackson deserialization
                            
                                Adding Spring arguments to VSCode Debug launch.json
                            
                                Configure a new serializer for spring-boot redis cache config
                            
                                pit mutation - if ( x !=null ) return null else throw new RuntimeException
                            
                                Collecting Lists in Java 8
                            
                                What does ${_csrf} do? Is this an implicit EL object?
                            
                                StringJoiner remove delimeter from first position for every line
                            
                                How Comparator Interface in Java 8 become a @FunctionalInterface [duplicate]
                            
                                Checkmarx Java fix for Log Forging -sanitizing user input
                            
                                Is there a way to declare a class/method as experimental in Java?
                            
                                Cannot resolve symbol "FirebaseRecyclerOptions"
                            
                                Accept generic List as parameter and use it base on its type
                            
                                Copy List elements N times using Stream API
                            
                                Spring Boot Embedded Kafka can't connect
                            
                                Appending to a list within a stream to a map
                            
                                Difference between Web ignoring and Http permitting in Spring Security?
                            
                                Ambiguous call to Module
                            
                                Jacoco code coverage dropped with migration to Java 11
                            
                                How to transform java class fields to an array of string values

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

new String(byte []) results differ in JDK 7 and 8

Tags:

java

java-8

java-7

sapeming

People also ask

1 Answers

semplar

Recent Activity

Donate For Us