Java UTF-8 strange behaviour

Question

I am trying to decode some UTF-8 strings in Java. These strings contain some combining unicode characters, such as CC 88 (combining diaresis). The character sequence seems ok, according to http://www.fileformat.info/info/unicode/char/0308/index.htm

But the output after conversion to String is invalid. Any idea ?

byte[] utf8 = { 105, -52, -120 };
System.out.print("{{");
for(int i = 0; i < utf8.length; ++i)
{
    int value = utf8[i] & 0xFF;
    System.out.print(Integer.toHexString(value));
}
System.out.println("}}");
System.out.println(">" + new String(utf8, "UTF-8"));

Output:

    {{69cc88}}
    >i?

skaffman · Accepted Answer

The console which you're outputting to (e.g. windows) may not support unicode, and may mangle the characters. The console output is not a good representation of the data.

Try writing the output to a file instead, making sure the encoding is correct on the FileWriter, then open the file in a unicode-friendly editor.

Alternatively, use a debugger to make sure the characters are what you expect. Just don't trust the console.

Jon Skeet · Answer

The code is fine, but as skaffman said your console probably doesn't support the appropriate character.

To test for sure, you need to print out the unicode values of the character:

public class Test {
    public static void main(String[] args) throws Exception {
        byte[] utf8 = { 105, -52, -120 };
        String text = new String(utf8, "UTF-8");
        for (int i=0; i < text.length(); i++) {
            System.out.println(Integer.toHexString(text.charAt(i)));
        }
    }
}

This prints 69, 308 - which is correct (U+0069, U+0308).

Java UTF-8 strange behaviour

Tags:

java

utf-8

Eric Nicolas

2 Answers

skaffman

Jon Skeet

Recent Activity

Donate For Us

Java UTF-8 strange behaviour

Tags:

java

utf-8

Eric Nicolas

2 Answers

skaffman

Jon Skeet

Related questions

Recent Activity

Donate For Us