Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert ISO8859 String to UTF8? ÄÖÜ => ÃÃ why?

Whats the problem with this code? I made an ISO8859 String. So most of the ÄÖÜ are some krypooutput. Thats fine. But how to Convert them back to normal chars (UTF8 or something)?

    String s = new String("Üü?öäABC".getBytes(), "ISO-8859-15");

    System.out.println(s);
    //ÃÃŒ?öÀABC => ok(?)
    System.out.println(new String(s.getBytes(), "ISO-8859-15"));
    //ÃÂÃÅ?öÃâ¬ABC => ok(?)
    System.out.println(new String(s.getBytes(), "UTF-8"));
    //ÃÃŒ?öÀABC => huh?
like image 263
Lissy Avatar asked May 30 '11 10:05

Lissy


3 Answers

A construct such as new String("Üü?öäABC".getBytes(), "ISO-8859-15"); is almost always an error.

What you're doing here is taking a String object, getting the corresponding byte[] in the platform default encoding and re-interpreting it as ISO-8859-15 to convert it back to a String.

If the platform default encoding happens to be ISO-8859-15 (or near enough to make no difference for this particular String, for example ISO-8859-1), then it is a no-op (i.e. it has no real effect).

In all other cases it will most likely destroy the String.

If you try to "fix" a String, then you're probably too late: if you have to use a specific encoding to read data, then you should use it at the point where binary data is converted to String data. For example if you read from an InputStream, you need to pass the correct encoding to the constructor of the InputStreamReader.

Trying to fix the problem "after the fact" will be

  1. harder to do and
  2. often not even possible (because decoding a byte[] with the wrong encoding can be a destructive operation).
like image 83
Joachim Sauer Avatar answered Sep 19 '22 14:09

Joachim Sauer


I hope this will solve your problem.

String readable = "äöüÄÖÜßáéíóúÁÉÍÓÚàèìòùÀÈÌÒÙñÑ";

try {
    String unreadable = new String(readable.getBytes("UTF-8"), "ISO-8859-15");
    // unreadable -> äöüÃÃÃÃáéíóúÃÃÃÃÃàèìòùÃÃÃÃÃñÃ
} catch (UnsupportedEncodingException e) {
    // handle error
}

And:

String unreadable = "äöüÃÃÃÃáéíóúÃÃÃÃÃàèìòùÃÃÃÃÃñÃ";

try {
    String readable = new String(unreadable.getBytes("ISO-8859-15"), "UTF-8");
    // readable -> äöüÄÖÜßáéíóúÁÉÍÓÚàèìòùÀÈÌÒÙñÑ
} catch (UnsupportedEncodingException e) {
    // ...
}
like image 28
Jooce Avatar answered Sep 21 '22 14:09

Jooce


String s = new String("Üü?öäABC".getBytes(), "ISO-8859-15"); //bug

All this code does is corrupt data. It transcodes UTF-16 data to the system encoding (whatever that is) and the takes those bytes, pretends they're valid ISO-8859-15 and transcodes them to UTF-16.

Then how to convert an input String like "ÃÃŒ?öÀABC" to normal? (if I know that the string is from an ISO8859 file).

The correct way to perform this operation would be like this:

byte[] iso859_15 = { (byte) 0xc3, (byte) 0xc3, (byte) 0xbc, 0x3f,
  (byte) 0xc3, (byte) 0xb6, (byte) 0xc3, (byte) 0xa4, 0x41, 0x42,
         0x43 };
String utf16 = new String(iso859_15, Charset.forName("ISO-8859-15"));

Strings in Java are always UTF-16. All other encodings must be represented using the byte type.

Now, if you use System.out to output the resultant string, that might not appear correctly, but that is a different transcoding issue. For example, the Windows console default encoding doesn't match the system encoding. The encoding used by System.out must match the encoding of the device receiving the data. You should also take care to ensure that you are reading your source files with the same encoding your editor is using.

To understand how treatment of character data varies between languages, read this.

like image 34
McDowell Avatar answered Sep 21 '22 14:09

McDowell