Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does the platform default character encoding affect cross platform performance

I have read that its bad idea to use platform default character encoding for example when reading a text file and importing the text into arrays etc. Could you explain how that could affect cross platform performance , and how to get past that problem ? Is there an encoding that should be used for cross-platform applications ? Thanks

like image 693
Giannis Avatar asked Apr 07 '11 13:04

Giannis


3 Answers

It's not about performance, but about showing and reading properly encoded text. There are a number of ways to cope with the problem:

  • set a JVM option -Dfile.encoding=utf-8
  • always use the methods overloaded with a character encoding parameter. These are the ones of String, Reader, Writer and more.

I think the latter is a must. If you always set the jvm option, it will work, but if you forget to set it at some point, there will be unexpected failures at random places.

And the other question - stick to UTF-8.

See also this question.

like image 97
Bozho Avatar answered Nov 05 '22 11:11

Bozho


Usually its no problem, if the read and written files are not exchanged between platforms. But if you have e.g. a configuration file created on windows (Win1252, similar to ISO8859-1 encoding), and then start your app on a recent linux (UTF-8 encoding), the config file will have problems with nearly all chars above 127 (like german Umlauts ä, ö, ü, or the € sign, or similar characters).

In this case just specify that you always use either encoding, and stick with it. If you only use plain ASCII (non latin extended!) files, you won't have problems so far.

like image 38
Daniel Avatar answered Nov 05 '22 10:11

Daniel


The default encoding varies from OS to OS and even between users on the same machine in the case of some multilingual installs. This means that character data written by the application will vary and not be readable/appear corrupt if read using a different default encoding. The Euro character (€) will encode as the bytes 80 under windows-1252, A4 under ISO-8859-15 and E2 82 AC under UTF-8.

Legacy encodings can cause data loss since many of them only support a narrow range of code points.

The only supported way to change the default encoding is to change it in the operating system.

It is generally better to be explicit in choosing encodings and prefer a lossless Unicode encoding (usually UTF-8.) The decision to make "ANSI" encodings the default on Windows, for example, made more sense when when supporting Windows 95.

like image 2
McDowell Avatar answered Nov 05 '22 10:11

McDowell