Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are all Kanji characters in UTF-8 3 bytes long?

Tags:

Can someone please confirm that all Kanji characters in Chinese are 3 bytes long in UTF-8?

like image 358
TopCoder Avatar asked Sep 09 '10 16:09

TopCoder


People also ask

How many bytes does UTF-8 code encode each character?

UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes. The first 128 Unicode code points are encoded as 1 byte in UTF-8.

How many bytes is UTF-8?

UTF-8 is a byte encoding used to encode unicode characters. UTF-8 uses 1, 2, 3 or 4 bytes to represent a unicode character. Remember, a unicode character is represented by a unicode code point. Thus, UTF-8 uses 1, 2, 3 or 4 bytes to represent a unicode code point.

How many bytes is a Chinese character?

Each Chinese character is represented by a 3-byte code in which each byte is 7-bit, between 0x21 and 0x7E inclusive.

Does UTF-8 support Japanese?

The Unicode Standard supports all of the CJK characters from JIS X 0208, JIS X 0212, JIS X 0221, or JIS X 0213, for example, and many more. This is true no matter which encoding form of Unicode is used: UTF-8, UTF-16, or UTF-32.


2 Answers

Yes, Kanji is U+4e00 to U+9faf, UTF8 3 bytes are U+0800 to U+FFFF.

like image 43
gawi Avatar answered Sep 19 '22 07:09

gawi


The commonly used Hanzi/Kanji characters are in the "CJK Unified Ideographs" block between U+4E00 and U+9FFF, and take 3 bytes in UTF-8. (The Japanese Hiragana and Katakana characters also take 3 bytes.)

However, there are also some very rarely-used characters in the "CJK Unified Ideographs Extension B" and "CJK Compatibility Ideographs Supplement" blocks, which take 4 bytes in UTF-8.

Also be aware that Chinese text often contains ASCII characters like the digits 0-9.

like image 61
dan04 Avatar answered Sep 21 '22 07:09

dan04