Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Total number of UTF16 Characters

Can you calculate that a UTF16 Encoding represents 1,112,064 numbers by permuations/commbinations?

like image 422
user4344 Avatar asked Feb 13 '11 12:02

user4344


2 Answers

The UNICODE standard is section 3.9 says:

Each encoding form maps the Unicode code points U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences.

Hence the number of code points ('characters') that can be represented by UTF-16 is

0xD7FF + 1 + (0x10FFFF - 0xE000) + 1 = 1 112 064

The UNICODE standard is generally 32-bit. However, specific encodings reserve smaller amount of bits to represent the most common characters impose specific limitations on the real number of characters they can legally represent. To allow for longer bit sequences that in turn allow describing code points longer that 8 (UTF-8) or 16 (UTF-16) bits special surrogate code points are defined.

Also, being able to represent a given code point in the given encoding doesn't mean it is valid — it has to be allocated and described by the UNICODE standard first. Therefore there's no mathematical formula which would yield the number of characters that can be represented and the number 1 112 064 doesn't necessarily mean there are 1M valid characters.

For a detailed discussion see section 3.9 of the UNICODE standard.

like image 68
Ondrej Tucny Avatar answered Sep 17 '22 23:09

Ondrej Tucny


No. The number of characters represented by UTF-16 is only knowable by specification, not by mathematics. UTF-16 is a specific set of encoding rules laid out by people, not an intrinsic property of some formula.

like image 42
Dan Grossman Avatar answered Sep 20 '22 23:09

Dan Grossman