Issue with getBytes() for accented charaters

Question

I'm trying to convert a string with special characters like É into a string with UTF-8 encoding. I tried doing this:

String str = "MARIE-HÉLÈNE";
byte sByte[] = str.getBytes("UTF-8"); 
str = new String(sByte,"UTF-8");

The problem is, when I do "É".getBytes("UTF-8"), I get 63 which is interpreted as '?' when it's being converted to a new string. How can I fix this issue?

EDIT: I also noticed that this issue was not reproducible on Eclipse, probably because the text file encoding is usually set to UTF-8.

I tried doing byte[] str = "MARIE-HÉLÈNE".getBytes("UTF-8") in http://www.javarepl.com/console.html and got the result byte[] str = [77, 65, 82, 73, 69, 45, 72, 63, 76, 63, 78, 69]

Takahiko Kawasaki · Accepted Answer

This kind of error happens when information about the encoding of the source file is not given to the compiler (javac) properly. If the encoding of your source file is UTF-8, compile the file like the following.

javac -encoding UTF-8 E.java

The following is another example for the case where the encoding of the source file is UTF-16 Big Endian.

javac -encoding UTF-16BE E.java

I've already confirmed that the program below properly shows "0xC3 0x89". So, there is no problem in your code.

public class E
{
    public static void main(String[] args) throws Exception
    {
        byte[] bytes = "É".getBytes("UTF-8");

        for (int i = 0; i < bytes.length; ++i)
        {
            System.out.format("0x%02X ", (byte)(bytes[i]));
        }

        System.out.println();
    }
}

Andreas · Answer

"É".getBytes("UTF-8") returns a byte[] of 2 bytes: c3 89.

"MARIE-HÉLÈNE" is 4d 41 52 49 45 2d 48 c3 89 4c c3 88 4e 45.

4d 41 52 49 45 2d 48 c3 89 4c c3 88 4e 45
M  A  R  I  E  -  H  É     L  È     N  E

Converting the bytes back using new String(bytes,"UTF-8") will restore the original string.

Issue with getBytes() for accented charaters

Tags:

java

encoding

utf-8

Shwetha Durgashankar

2 Answers

Takahiko Kawasaki

Andreas

Recent Activity

Donate For Us

Issue with getBytes() for accented charaters

Tags:

java

encoding

utf-8

Shwetha Durgashankar

2 Answers

Takahiko Kawasaki

Andreas

Related questions

Recent Activity

Donate For Us