Convert ISO8859 String to UTF8? ÄÖÜ = ÃÃ why?

Question

Whats the problem with this code? I made an ISO8859 String. So most of the ÄÖÜ are some krypooutput. Thats fine. But how to Convert them back to normal chars (UTF8 or something)?

    String s = new String("Üü?öäABC".getBytes(), "ISO-8859-15");

    System.out.println(s);
    //ÃÃŒ?Ã¶Ã€ABC => ok(?)
    System.out.println(new String(s.getBytes(), "ISO-8859-15"));
    //ÃÂÃÅ?ÃÂ¶Ãâ¬ABC => ok(?)
    System.out.println(new String(s.getBytes(), "UTF-8"));
    //ÃÃŒ?Ã¶Ã€ABC => huh?

Joachim Sauer · Accepted Answer

A construct such as new String("Üü?öäABC".getBytes(), "ISO-8859-15"); is almost always an error.

What you're doing here is taking a String object, getting the corresponding byte[] in the platform default encoding and re-interpreting it as ISO-8859-15 to convert it back to a String.

If the platform default encoding happens to be ISO-8859-15 (or near enough to make no difference for this particular String, for example ISO-8859-1), then it is a no-op (i.e. it has no real effect).

In all other cases it will most likely destroy the String.

If you try to "fix" a String, then you're probably too late: if you have to use a specific encoding to read data, then you should use it at the point where binary data is converted to String data. For example if you read from an InputStream, you need to pass the correct encoding to the constructor of the InputStreamReader.

Trying to fix the problem "after the fact" will be

harder to do and
often not even possible (because decoding a byte[] with the wrong encoding can be a destructive operation).

Jooce · Answer

I hope this will solve your problem.

String readable = "äöüÄÖÜßáéíóúÁÉÍÓÚàèìòùÀÈÌÒÙñÑ";

try {
    String unreadable = new String(readable.getBytes("UTF-8"), "ISO-8859-15");
    // unreadable -> Ã¤Ã¶Ã¼ÃÃÃÃÃ¡Ã©ÃÃ³ÃºÃÃÃÃÃÃ Ã¨Ã¬Ã²Ã¹ÃÃÃÃÃÃ±Ã
} catch (UnsupportedEncodingException e) {
    // handle error
}

And:

String unreadable = "Ã¤Ã¶Ã¼ÃÃÃÃÃ¡Ã©ÃÃ³ÃºÃÃÃÃÃÃ Ã¨Ã¬Ã²Ã¹ÃÃÃÃÃÃ±Ã";

try {
    String readable = new String(unreadable.getBytes("ISO-8859-15"), "UTF-8");
    // readable -> äöüÄÖÜßáéíóúÁÉÍÓÚàèìòùÀÈÌÒÙñÑ
} catch (UnsupportedEncodingException e) {
    // ...
}

McDowell · Answer

String s = new String("Üü?öäABC".getBytes(), "ISO-8859-15"); //bug

All this code does is corrupt data. It transcodes UTF-16 data to the system encoding (whatever that is) and the takes those bytes, pretends they're valid ISO-8859-15 and transcodes them to UTF-16.

Then how to convert an input String like "ÃÃŒ?Ã¶Ã€ABC" to normal? (if I know that the string is from an ISO8859 file).

The correct way to perform this operation would be like this:

byte[] iso859_15 = { (byte) 0xc3, (byte) 0xc3, (byte) 0xbc, 0x3f,
  (byte) 0xc3, (byte) 0xb6, (byte) 0xc3, (byte) 0xa4, 0x41, 0x42,
         0x43 };
String utf16 = new String(iso859_15, Charset.forName("ISO-8859-15"));

Strings in Java are always UTF-16. All other encodings must be represented using the byte type.

Now, if you use System.out to output the resultant string, that might not appear correctly, but that is a different transcoding issue. For example, the Windows console default encoding doesn't match the system encoding. The encoding used by System.out must match the encoding of the device receiving the data. You should also take care to ensure that you are reading your source files with the same encoding your editor is using.

To understand how treatment of character data varies between languages, read this.

Convert ISO8859 String to UTF8? ÄÖÜ => ÃÃ why?

Tags:

java

string

character-encoding

unicode

Lissy

3 Answers

Joachim Sauer

Jooce

McDowell

Recent Activity

Donate For Us