Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What happens under the hood when bytes converted to String in Java?

I have a problem when trying to convert bytes to String in Java, with code like:

byte[] bytes = {1, 2, -3};

byte[] transferred = new String(bytes, Charsets.UTF_8).getBytes(Charsets.UTF_8);

and the original bytes are not the same as the transferred bytes, which are respectively

[1, 2, -3]
[1, 2, -17, -65, -67]

I once thought it is due to the UTF-8 charset mapping for the negative "-3". So I change it to "-32". But the transferred array remains the same!

[1, 2, -32]
[1, 2, -17, -65, -67] 

So I strongly want to know exactly what happens when I call new String(bytes) :)

like image 693
user1702713 Avatar asked May 28 '15 15:05

user1702713


People also ask

How to decode a byte array to string in Java?

Though, we should use charset for decoding a byte array. There are two ways to convert byte array to String: By using String class constructor; By using UTF-8 encoding; By using String Class Constructor. The simplest way to convert a byte array into String, we can use String class constructor with byte[] as the constructor argument.

How many bits are in a character set in Java?

It works for ASCII character set, where only seven bits are used. If the character sets have more than 256 values, we should explicitly specify the encoding which tells how to encode characters into a sequence of bytes. There are follllowing charsets supported by Java platform are:

How to create string with UTF-8 encoding in Java?

String string = new String (b, StandardCharsets.UTF_8); //string with "UTF-8" encoding In the following example, We have taken char while creating the byte array.

What is the difference between bytes and strings in C++?

Since bytes is the binary data while String is character data. It is important to know the original encoding of the text from which the byte array has created. When we use a different character encoding, we do not get the original string back.


1 Answers

Not all sequences of bytes are valid in UTF-8.

UTF-8 is a smart scheme with a variable number of bytes per code point, the form of every byte indicating how many other bytes follow for the same code point.

Refer to this table:

table

Now let's see how it applies to your {1, 2, -3}:

Bytes 1 (hex 0x01, binary 00000001) and 2 (hex 0x02, binary 00000010) stand alone, no problem.

Byte -3 (hex 0xFD, binary 11111101) is the start byte of a 6-byte sequence (which is actually illegal in the current UTF-8 standard), but your byte array does not have such a sequence.

Your UTF-8 is invalid. The Java UTF-8 decoder replaces this invalid byte -3 with Unicode codepoint U+FFFD REPLACEMENT CHARACTER (also see this). in UTF-8, codepoint U+FFFD is hex 0xEF 0xBF 0xBD (binary 11101111 10111111 10111101), represented in Java as -17, -65, -67.

like image 163
Denys Séguret Avatar answered Sep 18 '22 06:09

Denys Séguret