I have the following program to test how Java handle Chinese characters:
String s3 = "世界您好";
char[] chs = s3.toCharArray();
byte[] bs = s3.getBytes(StandardCharsets.UTF_8);
byte[] bs2 = new String(chs).getBytes(StandardCharsets.UTF_8);
System.out.println("encoding=" + Charset.defaultCharset().name() + ", " + s3 + " char[].length=" + chs.length
+ ", byte[].length=" + bs.length + ", byte[]2.length=" + bs2.length);
The print out is this:
encoding=UTF-8, 世界您好 char[].length=4, byte[].length=12, byte[]2.length=12
The result are these:
one Chinese character takes one char
, which is 2 bytes in Java, if char[]
is used to hold the Chinese characters;
one Chinese character takes 3 byte
s if byte[]
is used to hold the Chinese characters;
My questions are if 2 bytes are enough, why we use 3 bytes? if 2 bytes is not enough, why we use 2 bytes?
EDIT:
My JVM's default encoding is set to UTF-8.
Each Chinese character is represented by a 3-byte code in which each byte is 7-bit, between 0x21 and 0x7E inclusive. Thus, the maximum number of Chinese characters representable in CCCII is 94×94×94 = 830584.
Because java is unicode based and c is ASCII code based and java contains 18 languages whereas c contains only 256 character. 256 is represented in 1 byte but 65535 can't represent in 1 byte so java char size is 2 byte or c char size is 1 byte.
Even if you think of a "character" as a multi-byte thingy, char is not. sizeof(char) is always exactly 1. No exceptions, ever.
The char type takes 1 byte of memory (8 bits) and allows expressing in the binary notation 2^8=256 values. The char type can contain both positive and negative values. The range of values is from -128 to 127.
A Java char type stores 16 bits of data in a two-byte object, using every bit to store the data. UTF-8 doesn't do this. For Chinese characters, UTF-8 only uses 6 bits of each byte to store the data. The other two bits contain control information. (It varies depending on the character. For ASCII characters, UTF-8 uses 7 bits.) It's a complicated encoding mechanism, but it allows UTF-8 to store characters up to 32-bits long. This has the advantage of taking only one byte per character for 7-bit (ASCII) characters, making it backward compatible with ASCII. But it needs 3 bytes to store 16-bits of data. You can learn how it works by looking it up on Wikipedia.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With