Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does Java fit a 3 byte Unicode character into a char type?

So a 'char' in Java is 2 bytes. (Can be verified from here.)

I have this sample code:

public class FooBar {
    public static void main(String[] args) {
        String foo = "€";
        System.out.println(foo.getBytes().length);
        final char[] chars = foo.toCharArray();
        System.out.println(chars[0]);
    }
}

And the output is as follows:

3
€

My question is, how did Java fit a 3 byte character into a char data type? BTW, I am running the application with the parameter: -Dfile.encoding=UTF-8

Also if I edit the code a little further and add the following statements:

File baz = new File("baz.txt");
final DataOutputStream dataOutputStream = new DataOutputStream(new FileOutputStream(baz));
dataOutputStream.writeChar(chars[0]);
dataOutputStream.flush();
dataOutputStream.close();

the final file "baz.txt" will only be 2 bytes, and it will not show the correct character even if I treat it as a UTF-8 file.

Edit 2: If I open the file "baz.txt" with encoding UTF-16 BE, I will see the € character just fine in my text editor, which makes sense I guess.

like image 991
Koray Tugay Avatar asked Jan 21 '16 11:01

Koray Tugay


People also ask

Why does Java use 2 bytes for char?

And, every char is made up of 2 bytes because Java internally uses UTF-16. For instance, if a String contains a word in the English language, the leading 8 bits will all be 0 for every char, as an ASCII character can be represented using a single byte.

How does Unicode work in Java?

Unicode sequences can be used everywhere in Java code. As long as it contains Unicode characters, it can be used as an identifier. You may use Unicode to convey comments, ids, character content, and string literals, as well as other information. However, note that they are interpreted by the compiler early.

How many bytes does Unicode use per character?

Unicode uses two encoding forms: 8-bit and 16-bit, based on the data type of the data that is being that is being encoded. The default encoding form is 16-bit, where each character is 16 bits (2 bytes) wide. Sixteen-bit encoding form is usually shown as U+hhhh, where hhhh is the hexadecimal code point of the character.


1 Answers

String.getBytes() returns the bytes using the platform's default character encoding which does not necessary match internal representation.

Java using 2 bytes in ram for each char, when chars are "serialized" using UTF-8, they may produce one, two or three bytes in the resulting byte array, that's how the UTF-8 encoding works.

Your code example is using UTF-8. Java strings are encoded in memory using UTF-16 instead. Unicode codepoints that do not fit in a single 16-bit char will be encoded using a 2-char pair known as a surrogate pair.

If you do not pass a parameter value to String.getBytes(), it returns a byte array that has the String contents encoded using the underlying OS's default charset. If you want to ensure a UTF-8 encoded array then you need to use getBytes("UTF-8") instead.

Calling String.charAt() returns an original UTF-16 encoded char from the String's in-memory storage only.

Check this link : java utf8 encoding - char, string types

like image 147
Shiladittya Chakraborty Avatar answered Sep 18 '22 18:09

Shiladittya Chakraborty