I read some documents about String.getBytes(Charset) method in Java.
It is used to convert a String to a byte array (byte type can get value -2^7 to 2^7-1).
As I knew, per character in UTF-8 charset can be used with 1-4 byte(s). What will happen if the code of a character in UTF-8 charset is larger than 2^7-1?
I tried with
String s="Hélô"
then I got such 'Hélô' with:
String sr=new String(s.getBytes("UTF-8"),Charset.forName("UTF-8"));
I want it to return orginal value 'Hélô'.
Can anybody describe this? Thanks. (Sorry for my English)
As Jon already said, the reason is that you use different encodings. In UTF-8 encoding the characters é
and ô
are encoded as two bytes each.
ISO-8859-1: H é l ô
bytes: 48 E9 6C F4
UTF-8 : H é l ô
bytes: 48 C3A9 6C C3B4
Your example fo the wrong string result is in bytes as follow
UTF-8 bytes interpreted as ISO-8859-1
H à © l à ´
48 C3 A9 6C C3 B4
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With