UTF-8 in Java's String.GetBytes(Charset)

Question

I read some documents about String.getBytes(Charset) method in Java.

It is used to convert a String to a byte array (byte type can get value -2^7 to 2^7-1).

As I knew, per character in UTF-8 charset can be used with 1-4 byte(s). What will happen if the code of a character in UTF-8 charset is larger than 2^7-1?

I tried with

String s="Hélô"

then I got such 'HÃ©lÃ´' with:

String sr=new String(s.getBytes("UTF-8"),Charset.forName("UTF-8"));

I want it to return orginal value 'Hélô'.

Can anybody describe this? Thanks. (Sorry for my English)

SubOptimal · Accepted Answer

As Jon already said, the reason is that you use different encodings. In UTF-8 encoding the characters é and ô are encoded as two bytes each.

ISO-8859-1: H  é  l ô
     bytes: 48 E9 6C F4

UTF-8     : H  é    l  ô
     bytes: 48 C3A9 6C C3B4

Your example fo the wrong string result is in bytes as follow

UTF-8 bytes interpreted as ISO-8859-1
H  Ã  ©  l  Ã  ´
48 C3 A9 6C C3 B4

UTF-8 in Java's String.GetBytes(Charset)

Tags:

java

character-encoding

encoding

utf-8

Bành Thanh Sơn

1 Answers

SubOptimal

Recent Activity

Donate For Us

UTF-8 in Java's String.GetBytes(Charset)

Tags:

java

character-encoding

encoding

utf-8

Bành Thanh Sơn

1 Answers

SubOptimal

Related questions

Recent Activity

Donate For Us