Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How many characters can be mapped with Unicode?

Tags:

unicode

utf-8

utf

I am asking for the count of all the possible valid combinations in Unicode with explanation. I know a char can be encoded as 1,2,3 or 4 bytes. I also don't understand why continuation bytes have restrictions even though starting byte of that char clears how long it should be.

like image 764
Ufuk Hacıoğulları Avatar asked May 07 '11 21:05

Ufuk Hacıoğulları


People also ask

How many characters can Unicode have?

Unicode is a universal character set. It is aimed to include all the characters needed for any writing system or language. The first code point positions in Unicode use 16 bits to represent the most commonly used characters in a number of languages. This Basic Multilingual Plane allows for 65,536 characters.

How many Unicode codes are assigned?

The Unicode code space is divided into seventeen planes (the basic multilingual plane, and 16 supplementary planes), each with 65,536 (= 216) code points. Thus the total size of the Unicode code space is 17 × 65,536 = 1,114,112.


1 Answers

I am asking for the count of all the possible valid combinations in Unicode with explanation.

1,111,998: 17 planes × 65,536 characters per plane - 2048 surrogates - 66 noncharacters

Note that UTF-8 and UTF-32 could theoretically encode much more than 17 planes, but the range is restricted based on the limitations of the UTF-16 encoding.

137,929 code points are actually assigned in Unicode 12.1.

I also don't understand why continuation bytes have restrictions even though starting byte of that char clears how long it should be.

The purpose of this restriction in UTF-8 is to make the encoding self-synchronizing.

For a counterexample, consider the Chinese GB 18030 encoding. There, the letter ß is represented as the byte sequence 81 30 89 38, which contains the encoding of the digits 0 and 8. So if you have a string-searching function not designed for this encoding-specific quirk, then a search for the digit 8 will find a false positive within the letter ß.

In UTF-8, this cannot happen, because the non-overlap between lead bytes and trail bytes guarantees that the encoding of a shorter character can never occur within the encoding of a longer character.

like image 93
dan04 Avatar answered Sep 22 '22 05:09

dan04