Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UTF-8 and UTF-16 in Java

I really expect that the byte data below should show differently, but in fact, they are same, according to wiki http://en.wikipedia.org/wiki/UTF-8#Examples , the encoding in byte look different, but why Java print them out as the same?

    String a = "€";
    byte[] utf16 = a.getBytes(); //Java default UTF-16
    byte[] utf8 = null;

    try {
        utf8 = a.getBytes("UTF-8");
    } catch (UnsupportedEncodingException e) {
        throw new RuntimeException(e);
    }

    for (int i = 0 ; i < utf16.length ; i ++){
        System.out.println("utf16 = " + utf16[i]);
    }

    for (int i = 0 ; i < utf8.length ; i ++){
        System.out.println("utf8 = " + utf8[i]);
    }
like image 488
Sam YC Avatar asked Oct 18 '12 02:10

Sam YC


3 Answers

Although Java holds characters internally as UTF-16, when you convert to bytes using String.getBytes(), each character is converted using the default platform encoding which will likely be something like windows-1252. The results I'm getting are:

utf16 = -30
utf16 = -126
utf16 = -84
utf8 = -30
utf8 = -126
utf8 = -84

This indicates that the default encoding is "UTF-8" on my system.

Also note that the documentation for String.getBytes() has this comment: The behavior of this method when this string cannot be encoded in the default charset is unspecified.

Generally, though, you'll avoid confusion if you always specify an encoding like you do with a.getBytes("UTF-8")

Also, another thing that can cause confusion is including Unicode characters directly in your source file: String a = "€";. That euro symbol has to be encoded to be stored as one or more bytes in a file. When Java compiles your program, it sees those bytes and decodes them back into the euro symbol. You hope. You have to be sure that the software that save the euro symbol into the file (Notepad, eclipse, etc) encodes it the same way as Java expects when it reads it back in. UTF-8 is becoming more popular but it is not universal and many editors will not write files in UTF-8.

like image 165
Adrian Pronk Avatar answered Sep 18 '22 02:09

Adrian Pronk


One curiosity, I wonder how JVM know the original default charset ...

The mechanism that the JVM uses to determine the initial default charset is platform specific. On UNIX / UNIX-like systems, it is determined by the LANG and LC_* environment variables; see man locale.


Ermmm.. This command is used to check what is the default charset in specific OS?

That is correct. But I told you about it because the manual entry describes how the default encoding is determined by the environment variables.

In retrospect, this may not been what you meant by your original comment, but this IS how the platform default encoding is specified. (And the concept of a "default character set" for an individual file is meaningless; see below.)

What if let say I have 10 Java source file, half of them save as UTF-8 and the rest save as UTF-16, after compile, I move them (class file) into another OS platform, now how JVM know their default encoding? Will the default charset information be included in the Java class file?

That is a rather confused set of questions:

  1. A text file doesn't have a default character set. It has a character set / encoding.

  2. A non-text file doesn't have an character encoding at all. The concept is meaningless.

  3. There's no 100% reliable way to determine what a text file's character encoding is.

  4. If you don't tell the java compiler what the file's encoding is, it will assume that it is the platform's default encoding. The compiler doesn't try to second guess you. If you get the encoding incorrect, the compiler may or may not even notice your mistake.

  5. Bytecode (".class") files are binary files (see 2).

  6. When Character and String literals are compiled into a ".class" file, they are NOW represented in a way that is not affected by the platform default encoding, or anything else that you can influence.

  7. If you made a mistake with the source file encoding when compiling, you can't fix it at the ".class" file level. Your only option is to go back and recompile the classes, telling the Java compiler the correct source file encoding.

  8. "What if let say I have 10 Java source file, half of them save as UTF-8 and the rest save as UTF-16".
    Just don't do it!

    • Don't save your source files in a mixture of encodings. You will drive yourself nuts.
    • I can't thing of a good reason to store files in UTF-16 at all ...

So, I am confused that while people say "platform dependent", is it related to the source file?

Platform dependent means that it potentially depends on the operating system, the JVM vendor and version, the hardware, and so on.

It is not necessarily related to the source file. (The encoding of any given source file could be different to the default character encoding.)

If it is not, how do I explain the phenomena above? Anyway, the confusion above extend my question into "so, what happen after I compile the source file into class file, because class file might not contain the encoding information, so now the result is really dependent on 'platform' but not source file anymore?"

The platform specific mechanism (e.g. the environment variables) determine what the java compiler sees as the default character set. Unless you override this (e.g. by providing options to the java compiler on the command line), that is what the Java compiler will use as the source file character set. However, this may not be the correct character encoding for the source files; e.g. if you created them on a different machine with a different default character set. And if the java compiler uses the wrong character set to decode your source files, it is liable to put incorrect character codes into the ".class" files.

The ".class" files are no platform dependent. But if they are created incorrectly because you didn't tell the Java compiler the correct encoding for the source files, the ".class" files will contain the wrong characters.


Why do you mean :" the concept of a "default character set" for an individual file is meaningless"?

I say it because it is true!

The default character set MEANS the character set that is used when you don't specify one.

But we can control how we want text file to be stored right? Even using notepad, there is an option to choose between the encoding.

That is correct. And that is you TELLING Notepad what character set to use for the file. If you don't TELL it, Notepad will use the default character set to write the file.

There is a little bit of black magic in Notepad to GUESS what the character encoding is when it reads a text file. Basically, it looks at the first few of bytes of the file to see if it starts with a UTF-16 byte-order mark. If it sees one, it can heuristically distinguish between UTF-16, UTF-8 (generated by a Microscoft product), and "other". But it cannot distinguish between the different "other" character encodings, and it doesn't recognize as UTF-8 a file that doesn't start with a BOM marker. (The BOM on a UTF-8 file is a Microsoft-specific convention ... and causes problems if a Java application reads the file and doesn't know to skip the BOM character.)

Anyway, the problems are not in writing the source file. They happen when the Java compiler reads the source file with the incorrect character encoding.

like image 38
Stephen C Avatar answered Sep 18 '22 02:09

Stephen C


You are working with a bad hypothesis. The getBytes() method does not use the UTF-16 encoding. It uses the platform default encoding.

You can query it with the java.nio.charset.Charset.defaultCharset() method. In my case, it's UTF-8 and should be the same for you too.

like image 30
gawi Avatar answered Sep 22 '22 02:09

gawi