Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java UTF-8 strange behaviour

Tags:

java

utf-8

I am trying to decode some UTF-8 strings in Java. These strings contain some combining unicode characters, such as CC 88 (combining diaresis). The character sequence seems ok, according to http://www.fileformat.info/info/unicode/char/0308/index.htm

But the output after conversion to String is invalid. Any idea ?

byte[] utf8 = { 105, -52, -120 };
System.out.print("{{");
for(int i = 0; i < utf8.length; ++i)
{
    int value = utf8[i] & 0xFF;
    System.out.print(Integer.toHexString(value));
}
System.out.println("}}");
System.out.println(">" + new String(utf8, "UTF-8"));

Output:

    {{69cc88}}
    >i?
like image 855
Eric Nicolas Avatar asked Aug 13 '09 13:08

Eric Nicolas


2 Answers

The console which you're outputting to (e.g. windows) may not support unicode, and may mangle the characters. The console output is not a good representation of the data.

Try writing the output to a file instead, making sure the encoding is correct on the FileWriter, then open the file in a unicode-friendly editor.

Alternatively, use a debugger to make sure the characters are what you expect. Just don't trust the console.

like image 55
skaffman Avatar answered Oct 31 '22 20:10

skaffman


The code is fine, but as skaffman said your console probably doesn't support the appropriate character.

To test for sure, you need to print out the unicode values of the character:

public class Test {
    public static void main(String[] args) throws Exception {
        byte[] utf8 = { 105, -52, -120 };
        String text = new String(utf8, "UTF-8");
        for (int i=0; i < text.length(); i++) {
            System.out.println(Integer.toHexString(text.charAt(i)));
        }
    }
}

This prints 69, 308 - which is correct (U+0069, U+0308).

like image 21
Jon Skeet Avatar answered Oct 31 '22 19:10

Jon Skeet