I am trying to decode some UTF-8 strings in Java. These strings contain some combining unicode characters, such as CC 88 (combining diaresis). The character sequence seems ok, according to http://www.fileformat.info/info/unicode/char/0308/index.htm
But the output after conversion to String is invalid. Any idea ?
byte[] utf8 = { 105, -52, -120 };
System.out.print("{{");
for(int i = 0; i < utf8.length; ++i)
{
int value = utf8[i] & 0xFF;
System.out.print(Integer.toHexString(value));
}
System.out.println("}}");
System.out.println(">" + new String(utf8, "UTF-8"));
Output:
{{69cc88}} >i?
The console which you're outputting to (e.g. windows) may not support unicode, and may mangle the characters. The console output is not a good representation of the data.
Try writing the output to a file instead, making sure the encoding is correct on the FileWriter, then open the file in a unicode-friendly editor.
Alternatively, use a debugger to make sure the characters are what you expect. Just don't trust the console.
The code is fine, but as skaffman said your console probably doesn't support the appropriate character.
To test for sure, you need to print out the unicode values of the character:
public class Test {
public static void main(String[] args) throws Exception {
byte[] utf8 = { 105, -52, -120 };
String text = new String(utf8, "UTF-8");
for (int i=0; i < text.length(); i++) {
System.out.println(Integer.toHexString(text.charAt(i)));
}
}
}
This prints 69, 308 - which is correct (U+0069, U+0308).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With