why a Chinese character takes one char (2 bytes) but 3 bytes?

Tags:

I have the following program to test how Java handle Chinese characters:

String s3 = "世界您好";
char[] chs = s3.toCharArray();
byte[] bs = s3.getBytes(StandardCharsets.UTF_8);
byte[] bs2 = new String(chs).getBytes(StandardCharsets.UTF_8);

System.out.println("encoding=" + Charset.defaultCharset().name() + ", " + s3 + " char[].length=" + chs.length
                + ", byte[].length=" + bs.length + ", byte[]2.length=" + bs2.length);

The print out is this:

encoding=UTF-8, 世界您好 char[].length=4, byte[].length=12, byte[]2.length=12

The result are these:

one Chinese character takes one char, which is 2 bytes in Java, if char[] is used to hold the Chinese characters;
one Chinese character takes 3 bytes if byte[] is used to hold the Chinese characters;

My questions are if 2 bytes are enough, why we use 3 bytes? if 2 bytes is not enough, why we use 2 bytes?

EDIT:

My JVM's default encoding is set to UTF-8.

484

asked Mar 10 '17 03:03

peterboston

1 Answers

A Java char type stores 16 bits of data in a two-byte object, using every bit to store the data. UTF-8 doesn't do this. For Chinese characters, UTF-8 only uses 6 bits of each byte to store the data. The other two bits contain control information. (It varies depending on the character. For ASCII characters, UTF-8 uses 7 bits.) It's a complicated encoding mechanism, but it allows UTF-8 to store characters up to 32-bits long. This has the advantage of taking only one byte per character for 7-bit (ASCII) characters, making it backward compatible with ASCII. But it needs 3 bytes to store 16-bits of data. You can learn how it works by looking it up on Wikipedia.

152

answered Sep 25 '22 11:09

MiguelMunoz

Related questions
                            
                                The TensorFlow library wasn't compiled to use SSE3, SSE4.1, SSE4.2, AVX on Google Cloud Platform Console
                            
                                How to detect when a file is being sourced from bash [duplicate]
                            
                                Escaping in a Bash extended pattern @(..)
                            
                                import json file to create a network in vis.js
                            
                                Jenkins pipeline sh returnstdout not working
                            
                                Line/border with border-radius, overflow:hidden and colored background
                            
                                Error when predicting new fitted values from R gamlss object
                            
                                Facebook social sigin-in javascript sdk won't load in Chrome or FF
                            
                                Is there a way to show the menu bar inside an application window on a Mac?
                            
                                Android Studio not showing source set when adding new resource file
                            
                                Is it possible to reverse the order of views being laid out in a vertical LinearLayout?
                            
                                Time.time vs DateTime.UtcNow.Millisecond performance in Unity3D?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With