I recently realized that I don't fully understand Java's string encoding process.
Consider the following code:
public class Main
{
public static void main(String[] args)
{
System.out.println(java.nio.charset.Charset.defaultCharset().name());
System.out.println("ack char: ^"); /* where ^ = 0x06, the ack char */
}
}
Since the control characters are interpreted differently between windows-1252 and ISO-8859-1, I chose the ack
char for testing.
I now compile it with different file encodings, UTF-8, windows-1252, and ISO-8859-1. The both compile to the exact same thing, byte-per-byte as verified by md5sum
.
I then run the program:
$ java Main | hexdump -C
00000000 55 54 46 2d 38 0a 61 63 6b 20 63 68 61 72 3a 20 |UTF-8.ack char: |
00000010 06 0a |..|
00000012
$ java -Dfile.encoding=iso-8859-1 Main | hexdump -C
00000000 49 53 4f 2d 38 38 35 39 2d 31 0a 61 63 6b 20 63 |ISO-8859-1.ack c|
00000010 68 61 72 3a 20 06 0a |har: ..|
00000017
$ java -Dfile.encoding=windows-1252 Main | hexdump -C
00000000 77 69 6e 64 6f 77 73 2d 31 32 35 32 0a 61 63 6b |windows-1252.ack|
00000010 20 63 68 61 72 3a 20 06 0a | char: ..|
00000019
It correctly outputs the 0x06
no matter which encoding is being used.
Ok, it still outputs the same 0x06
, which would be interpreted as the printable [ACK] char by windows-1252 codepages.
That leads me to a few questions:
It always stores them as UTF-16. The constructor String(byte[],Charset) tells Java to create a UTF-16 string from an array of bytes that is supposed to be in the given character set. The method getBytes(Charset) tells Java to give you a sequence of bytes that represent the string in the given encoding (charset).
Strings are immutable in Java, which means we cannot change a String character encoding. To achieve what we want, we need to copy the bytes of the String and then create a new one with the desired encoding.
UTF-8 represents a variable-width character encoding that uses between one and four eight-bit bytes to represent all valid Unicode code points. A code point can represent single characters, but also have other meanings, such as for formatting.
In Java, when we deal with String sometimes it is required to encode a string in a specific character set. Encoding is a way to convert data from one format to another.
javac -encoding...
); otherwise, platform encoding is assumedSystem.out
PrintStream
will transform your strings from UTF-16 to bytes in the system encoding prior to writing them to stdout
Notes:
-Dfile.encoding
A summary of "what to know" about string encodings in Java:
String
instance, in memory, is a sequence of 16-bit "code units", which Java handles as char
values. Conceptually, those code units encode a sequence of "code points", where a code point is "the number attributed to a given character as per the Unicode standard". Code points range from 0 to a bit more than one million, although only 100 thousands or so have been defined so far. Code points from 0 to 65535 are encoded into a single code unit, while other code points use two code units. This process is called UTF-16 (aka UCS-2). There are a few subtleties (some code points are invalid, e.g. 65535, and there is a range of 2048 code points in the first 65536 reserved precisely for the encoding of the other code points).System.out.println()
, the JVM will convert the string into something suitable for wherever those characters go, which often means converting them to bytes using a charset which depends on the current locale (or what the JVM guessed of the current locale).javac
) accepts a command-line flag (-encoding
) which can be used to override that default choice.String
instances do not depend on any kind of encoding, as long as they remain in RAM, some of the operations you may want to perform on strings are locale-dependent. This is not a question of encoding; but a locale also defines a "language" and it so happens that the notions of uppercase and lowercase depend on the language which is used. The Usual Suspect is calling "unicode".toUpperCase()
: this yields "UNICODE"
except if the current locale is Turkish, in which case you get "UNİCODE"
(the "I
" has a dot). The basic assumption here is that if the current locale is Turkish then the data the application is managing is probably Turkish text; personally, I find this assumption at best questionable. But so it is.In practical terms, you should specify encodings explicitly in your code, at least most of the time. Do not call String.getBytes()
, call String.getBytes("UTF-8")
. Use of the default, locale-dependent encoding is fine when it is applied to some data exchanged with the user, such as a configuration file or a message to display immediately; but elsewhere, avoid locale-dependent methods whenever possible.
Among other locale-dependent parts of Java, there are calendars. There is the whole time zone business, which depends on the "time zone", which should relate to the geographical position of the computer (and this is not part of the "locale" stricto sensu...). Also, countless Java application mysteriously fail when run in Bangkok, because in a Thai locale, Java defaults to the Buddhist calendar according to which the current year is 2553.
As a rule of thumb, assume that the World is vast (it is !) and keep things generic (do not do anything which depends on a charset until the very last moment, when I/O must actually be performed).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With