I have a problem when trying to convert bytes to String in Java, with code like:
byte[] bytes = {1, 2, -3};
byte[] transferred = new String(bytes, Charsets.UTF_8).getBytes(Charsets.UTF_8);
and the original bytes are not the same as the transferred bytes, which are respectively
[1, 2, -3]
[1, 2, -17, -65, -67]
I once thought it is due to the UTF-8 charset mapping for the negative "-3". So I change it to "-32". But the transferred array remains the same!
[1, 2, -32]
[1, 2, -17, -65, -67]
So I strongly want to know exactly what happens when I call new String(bytes) :)
Though, we should use charset for decoding a byte array. There are two ways to convert byte array to String: By using String class constructor; By using UTF-8 encoding; By using String Class Constructor. The simplest way to convert a byte array into String, we can use String class constructor with byte[] as the constructor argument.
It works for ASCII character set, where only seven bits are used. If the character sets have more than 256 values, we should explicitly specify the encoding which tells how to encode characters into a sequence of bytes. There are follllowing charsets supported by Java platform are:
String string = new String (b, StandardCharsets.UTF_8); //string with "UTF-8" encoding In the following example, We have taken char while creating the byte array.
Since bytes is the binary data while String is character data. It is important to know the original encoding of the text from which the byte array has created. When we use a different character encoding, we do not get the original string back.
Not all sequences of bytes are valid in UTF-8.
UTF-8 is a smart scheme with a variable number of bytes per code point, the form of every byte indicating how many other bytes follow for the same code point.
Refer to this table:
Now let's see how it applies to your {1, 2, -3}
:
Bytes 1
(hex 0x01
, binary 00000001
) and 2
(hex 0x02
, binary 00000010
) stand alone, no problem.
Byte -3
(hex 0xFD
, binary 11111101
) is the start byte of a 6-byte sequence (which is actually illegal in the current UTF-8 standard), but your byte array does not have such a sequence.
Your UTF-8 is invalid. The Java UTF-8 decoder replaces this invalid byte -3
with Unicode codepoint U+FFFD REPLACEMENT CHARACTER (also see this). in UTF-8, codepoint U+FFFD is hex 0xEF 0xBF 0xBD
(binary 11101111 10111111 10111101
), represented in Java as -17, -65, -67
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With