Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode code point limit

As explained here, All unicode encodings end at largest code point 10FFFF But I've heard differently that they can go upto 6 bytes, is it true?

like image 637
user4344 Avatar asked Feb 13 '11 08:02

user4344


People also ask

What is Unicode code point value?

A code point is a number assigned to represent an abstract character in a system for representing text (such as Unicode). In Unicode, a code point is expressed in the form "U+1234" where "1234" is the assigned number. For example, the character "A" is assigned a code point of U+0041.

How many bits can Unicode use?

Unicode uses two encoding forms: 8-bit and 16-bit, based on the data type of the data that is being that is being encoded. The default encoding form is 16-bit, where each character is 16 bits (2 bytes) wide. Sixteen-bit encoding form is usually shown as U+hhhh, where hhhh is the hexadecimal code point of the character.

How many characters can Unicode hold?

Unicode is a universal character set. It is aimed to include all the characters needed for any writing system or language. The first code point positions in Unicode use 16 bits to represent the most commonly used characters in a number of languages. This Basic Multilingual Plane allows for 65,536 characters.

How many characters can 32 bit Unicode represent?

Unicode allows for 17 planes, each of 65,536 possible characters (or 'code points'). This gives a total of 1,114,112 possible characters.


1 Answers

UTF-8 underwent some changes during its life, and there are many specifications (most of which are outdated now) which standardized UTF-8. Most of the changes were introduced to help compatibility with UTF-16 and to allow for the ever-growing amount of codepoints.

To make the long story short, UTF-8 was originally specified to allow codepoints with up to 31 bits (or 6 bytes). But with RFC3629, this was reduced to 4 bytes max. to be more compatible to UTF-16.

Wikipedia has some more information. The specification of the Universal Character Set is closely linked to the history of Unicode and its transformation format (UTF).

like image 149
Holger Just Avatar answered Sep 18 '22 15:09

Holger Just