Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UTF-8 in Java's String.GetBytes(Charset)

I read some documents about String.getBytes(Charset) method in Java.

It is used to convert a String to a byte array (byte type can get value -2^7 to 2^7-1).

As I knew, per character in UTF-8 charset can be used with 1-4 byte(s). What will happen if the code of a character in UTF-8 charset is larger than 2^7-1?

I tried with

String s="Hélô"

then I got such 'Hélô' with:

String sr=new String(s.getBytes("UTF-8"),Charset.forName("UTF-8"));

I want it to return orginal value 'Hélô'.

Can anybody describe this? Thanks. (Sorry for my English)

like image 627
Bành Thanh Sơn Avatar asked Mar 15 '23 04:03

Bành Thanh Sơn


1 Answers

As Jon already said, the reason is that you use different encodings. In UTF-8 encoding the characters é and ô are encoded as two bytes each.

ISO-8859-1: H  é  l ô
     bytes: 48 E9 6C F4

UTF-8     : H  é    l  ô
     bytes: 48 C3A9 6C C3B4

Your example fo the wrong string result is in bytes as follow

UTF-8 bytes interpreted as ISO-8859-1
H  Ã  ©  l  Ã  ´
48 C3 A9 6C C3 B4
like image 186
SubOptimal Avatar answered Mar 23 '23 00:03

SubOptimal