I have a problem when trying to convert bytes to String in Java, with code like: <pre class="prettyprint"><code>byte[] bytes = {1, 2, -3}; byte[] transferred = new String(bytes, Charsets.UTF_8).getBytes(Charsets.UTF_8); </code></pre> and the original bytes are not the same as the transferred bytes, which are respectively <pre class="prettyprint"><code>[1, 2, -3] [1, 2, -17, -65, -67] </code></pre> I once thought it is due to the UTF-8 charset mapping for the negative "-3". So I change it to "-32". But the transferred array remains the same! <pre class="prettyprint"><code>[1, 2, -32] [1, 2, -17, -65, -67] </code></pre> So I strongly want to know exactly what happens when I call new String(bytes) :)

Not all sequences of bytes are valid in UTF-8. UTF-8 is a smart scheme with a variable number of bytes per code point, the form of every byte indicating how many other bytes follow for the same code point. Refer to this table: <img src="https://i.stack.imgur.com/494lP.png" alt="table"> Now let's see how it applies to your <code>{1, 2, -3}</code>: Bytes <code>1</code> (hex <code>0x01</code>, binary <code>00000001</code>) and <code>2</code> (hex <code>0x02</code>, binary <code>00000010</code>) stand alone, no problem. Byte <code>-3</code> (hex <code>0xFD</code>, binary <code>11111101</code>) is the start byte of a 6-byte sequence (which is actually illegal in the current UTF-8 standard), but your byte array does not have such a sequence. Your UTF-8 is invalid. The Java UTF-8 decoder replaces this invalid byte <code>-3</code> with Unicode codepoint U+FFFD REPLACEMENT CHARACTER (also see this). in UTF-8, codepoint U+FFFD is hex <code>0xEF 0xBF 0xBD</code> (binary <code>11101111 10111111 10111101</code>), represented in Java as <code>-17, -65, -67</code>.

What happens under the hood when bytes converted to String in Java?

Tags:

java

string

unicode

utf-8

byte

I have a problem when trying to convert bytes to String in Java, with code like:

byte[] bytes = {1, 2, -3};

byte[] transferred = new String(bytes, Charsets.UTF_8).getBytes(Charsets.UTF_8);

and the original bytes are not the same as the transferred bytes, which are respectively

[1, 2, -3]
[1, 2, -17, -65, -67]

I once thought it is due to the UTF-8 charset mapping for the negative "-3". So I change it to "-32". But the transferred array remains the same!

[1, 2, -32]
[1, 2, -17, -65, -67]

So I strongly want to know exactly what happens when I call new String(bytes) :)

693

asked May 28 '15 15:05

user1702713

1 Answers

Not all sequences of bytes are valid in UTF-8.

UTF-8 is a smart scheme with a variable number of bytes per code point, the form of every byte indicating how many other bytes follow for the same code point.

Refer to this table:

table

Now let's see how it applies to your {1, 2, -3}:

Bytes 1 (hex 0x01, binary 00000001) and 2 (hex 0x02, binary 00000010) stand alone, no problem.

Byte -3 (hex 0xFD, binary 11111101) is the start byte of a 6-byte sequence (which is actually illegal in the current UTF-8 standard), but your byte array does not have such a sequence.

Your UTF-8 is invalid. The Java UTF-8 decoder replaces this invalid byte -3 with Unicode codepoint U+FFFD REPLACEMENT CHARACTER (also see this). in UTF-8, codepoint U+FFFD is hex 0xEF 0xBF 0xBD (binary 11101111 10111111 10111101), represented in Java as -17, -65, -67.

163

answered Sep 18 '22 06:09

Denys Séguret

Related questions
                            
                                Integer.parseInt(scanner.nextLine()) vs scanner.nextInt()
                            
                                Linux JAVA in path but permissions denied
                            
                                Sending dynamic custom headers in swagger UI try outs
                            
                                Unable to get results from H2 db
                            
                                JAVA XML - How do I get specific elements in an XML Node?
                            
                                What does asm standard for?
                            
                                Java NIO Files.createFile() fails with NoSuchFileException
                            
                                Remove JavaFX button padding?
                            
                                Unable to locate Spring NamespaceHandler util
                            
                                Jackson deserialization with anonymous classes
                            
                                Gradle - Different JDK Version for Source and Test
                            
                                Spring MVC passing ArrayList back to controller
                            
                                Thymeleaf: Use #dates.format() function for format date with internatinalization.
                            
                                Looping through lists, better method
                            
                                Serializing Dates with Protocol Buffers
                            
                                @DatabaseSetup unable to load data set
                            
                                What are the best practices to handle exception at Controller, Service and DAO Layer simultaneously in Spring & Hibernate
                            
                                Jersey parsing Java 8 date time
                            
                                How to use log4j2 Commons Logging Bridge
                            
                                MyBatis Spring MVC Error: Invalid bound statement (not found)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With