Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

getBytes() With UTF-8 Doesn't Work for Upper-Case German Umlauts

For development I'm using ResourceBundle to read a UTF-8 encoded properties-file (I set that in Eclipse' file properties on that file) directly from my resources-directory in the IDE (native2ascii is used on the way to production), e.g.:

menu.file.open.label=&Öffnen...
label.btn.add.name=&Hinzufügen
label.btn.remove.name=&Löschen

Since that causes issues with the character encoding when using non-ASCII characters I thought I'd be happy with:

ResourceBundle resourceBundle = ResourceBundle.getBundle("messages", Locale.getDefault());
String value = resourceBundle.getString(key);
value = new String(value.getBytes(), "UTF-8");

Well, it does work nicely for lower-case German umlauts, but not for the upper-case ones, the ß also doesn't work. Here's the value read with getString(key) and the value after the conversion with new String(value.getBytes(), "UTF-8"):

&Löschen => &Löschen
&Hinzufügen => &Hinzufügen

&Ã?ber => &??ber
&SchlieÃ?en => &Schlie??en
&Ã?ffnen... => &??ffnen...

The last three should be:

&Ã?ber => &Über
&SchlieÃ?en => &Schließen
&Ã?ffnen... => &Öffnen...

I guess that I'm not too far away from the truth, but what am I missing here?

Google found something similar, but that remained unanswered.

EDIT: a little more code

like image 972
sjngm Avatar asked Sep 03 '12 19:09

sjngm


3 Answers

The problem is you're calling String.getBytes() without specifying an encoding - which will use the default platform encoding. You're then using the binary result of that operation as if it were in UTF-8.

If you use UTF-8 in both directions, it'll be fine:

// Should be a round-trip
value = new String(value.getBytes("UTF-8"), "UTF-8");

... but if you were trying to use this to read a UTF-8-encoded property file without telling the code which is performing the initial read, that won't work.

The code you've presented is basically always the wrong approach. Your "Since that causes issues with the character encoding" suggests that you'd already run across an earlier problem - so I'd go back to that, instead of trying to apply a broken fix. If you've already lost data when constructing the ResourceBundle, it's too late to go back later... you need to make sure the ResourceBundle itself is loaded correctly.

Please tell us exactly what problems you had with the ResourceBundle, and we can see if we can fix the root cause.

EDIT: It's not clear how you're running native2ascii. The fix may be as simple as changing to use:

native2ascii -encoding UTF-8 input.properties output.properties
like image 99
Jon Skeet Avatar answered Sep 18 '22 12:09

Jon Skeet


Some notes:

  • If it is a String it is UTF-16 and if it isn't it is a corrupt string (and too late to fix.)
  • new String(value.getBytes(), "UTF-8"); - this code will (at best) do nothing on a system that uses UTF-8 as the default encoding; otherwise it will corrupt the string.
  • .properties files must be ISO 8859-1 (the Properties type supports other formats and encodings, but I don't know how you would tell ResourceBundle that.)
  • System.out can introduce its own transcoding bugs (the PrintStream encodes UTF-16 strings to the default encoding; the receiving device must decode the bytes using the same encoding.)

I suspect you are trying to fix your problems in the wrong place.

like image 34
McDowell Avatar answered Sep 21 '22 12:09

McDowell


You are encoding the text with a different encoding to the one you are decoding with.

Try instead using the same character set for encoding and decoding.

value = new String(value.getBytes("UTF-8"), "UTF-8");

String s = "ßßßßß";
s += s.toUpperCase();
s = new String(s.getBytes("UTF-8"), "UTF-8");
System.out.println(s);

prints

ßßßßßSSSSSSSSSS
like image 40
Peter Lawrey Avatar answered Sep 20 '22 12:09

Peter Lawrey