Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does the Unicode Consortium Intend to make UTF-16 run out of characters? [closed]

The current version of UTF-16 is only capable of encoding 1,112,064 different numbers(code points); 0x0-0x10FFFF.

Does the Unicode Consortium Intend to make UTF-16 run out of characters?

i.e. make a code point > 0x10FFFF

If not, why would anyone write the code for a utf-8 parser to be able to accept 5 or 6 byte sequences? Since it would add unnecessary instructions to their function.

Isn't 1,112,064 enough, do we actually need MORE characters? I mean: How quickly are we running out?

like image 396
GlassGhost Avatar asked Feb 21 '12 19:02

GlassGhost


People also ask

How many characters can UTF-16 represent?

With supplementary characters, UTF-16 character codes can represent more than one million characters. Without supplementary characters, only 65,536 characters can be represented.

How many characters can you really store with 16-bit Unicode?

UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16).

Is UTF-16 fixed-length?

UTF-16 (UCS Transformation Format, 16-bit form) is a fixed-length encoding of the Unicode standard using 16-bit sequences, where all characters are 2 bytes long.

How many bytes does a UTF-16 needs to represent characters?

UTF-16 is based on 16-bit code units. Each character is encoded as at least 2 bytes. Some characters that are encoded with a 1-byte code unit in UTF-8 are encoded with a 2-byte code unit in UTF-16. Characters that are surrogate or supplementary characters use 4 bytes and thus require additional storage.


2 Answers

As of 2011 we have consumed 109,449 characters AND set aside for application use(6,400+131,068):

leaving room for over 860,000 unused chars; plenty for CJK extension E(~10,000 chars) and 85 more sets just like it; so that in the event of contact with the Ferengi culture, we should be ready.

In November 2003 the IETF restricted UTF-8 to end at U+10FFFF with RFC 3629, in order to match the constraints of the UTF-16 character encoding: a UTF-8 parser should not accept 5 or 6 byte sequences that would overflow the utf-16 set, or characters in the 4 byte sequence that are greater than 0x10FFFF

Please put edits listing sets that pose threats on the size of the unicode code point limit here if they are over 1/3 the Size of the CJK extension E(~10,000 chars):

  • CJK extension E(~10,000 chars)
  • Ferengi culture characters(~5,000 chars)
like image 71
GlassGhost Avatar answered Oct 12 '22 15:10

GlassGhost


At present time, the Unicode standard doesn't define any characters above U+10FFFF, so you would be fine to code your app to reject characters above that point.

Predicting the future is hard, but I think you're safe for the near term with this strategy. Honestly, even if Unicode extends past U+10FFFF in the distant future, it almost certainly won't be for mission critical glyphs. Your app might not be compatible with the new Ferengi fonts that come out in 2063, but you can always fix it when it actually becomes an issue.

like image 29
StilesCrisis Avatar answered Oct 12 '22 14:10

StilesCrisis