Some legacy code relies on the platform's default charset for translations. For Windows and Linux installations in the "western world" I know what that means. But thinking about Russian or Asian platforms I am totally unsure what their platform's default charset is (just UTF-16?).
Therefore I would like to know what I would get when executing the following code line:
System.out.println("Default Charset=" + Charset.defaultCharset());
PS:
I don't want to discuss the problems of charsets and their difference to Unicode here. I just want to collect what operating systems will result in what specific charset. Please post only concrete values!
On many modern Linux systems, it's UTF-8. On Macs, it's MacRoman. In the US on Windows, it's often CP1250, in Europe it's CP1252.
The native character encoding of the Java programming language is UTF-16. A charset in the Java platform therefore defines a mapping between sequences of sixteen-bit UTF-16 code units (that is, sequences of chars) and sequences of bytes.
The default character encoding for Android is UTF-8, as specified by the JavaDoc of the Charset.
That's a user specific setting. On many modern Linux systems, it's UTF-8. On Macs, it’s MacRoman. In the US on Windows, it's often CP1250, in Europe it's CP1252. In China, you often find simplified chinese (Big5 or a GB*).
But that’s the system default, which each user can change at any time. Which is probably the solution: Set the encoding when you start your app using the system property file.encoding
See this answer how to do that. I suggest to put this into a small script which starts your app, so the user default isn't tainted.
For Windows and Linux installations in the "western world" I know what that means.
Probably not as well as you think.
But thinking about Russian or Asian platforms I am totally unsure what their platform's default charset is
Usually it's whatever encoding is historically used in their country.
(just UTF-16?).
Most definitely not. Computer usage spread widely before the Unicode standard existed, and each language area developed one or more encodings that could support its language. Those who needed less than 128 characters outside ASCII typically developed an "extended ASCII", many of which were eventually standardized as ISO-8859, while others developed two-byte encodings, often several competing ones. For example, in Japan, emails typically use JIS, but webpages use Shift-JIS, and some applications use EUC-JP. Any of these might be encountered as the platform default encoding in Java.
It's all a huge mess, which is exactly why Unicode was developed. But the mess has not yet disappeared and we still have to deal with it and should not make any assumptions about what encoding a given bunch of bytes to be interpreted as text are in. There Ain't No Such Thing as Plain Text.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With