As explained here, All unicode encodings end at largest code point 10FFFF
But I've heard differently that
they can go upto 6 bytes, is it true?
A code point is a number assigned to represent an abstract character in a system for representing text (such as Unicode). In Unicode, a code point is expressed in the form "U+1234" where "1234" is the assigned number. For example, the character "A" is assigned a code point of U+0041.
Unicode uses two encoding forms: 8-bit and 16-bit, based on the data type of the data that is being that is being encoded. The default encoding form is 16-bit, where each character is 16 bits (2 bytes) wide. Sixteen-bit encoding form is usually shown as U+hhhh, where hhhh is the hexadecimal code point of the character.
Unicode is a universal character set. It is aimed to include all the characters needed for any writing system or language. The first code point positions in Unicode use 16 bits to represent the most commonly used characters in a number of languages. This Basic Multilingual Plane allows for 65,536 characters.
Unicode allows for 17 planes, each of 65,536 possible characters (or 'code points'). This gives a total of 1,114,112 possible characters.
UTF-8 underwent some changes during its life, and there are many specifications (most of which are outdated now) which standardized UTF-8. Most of the changes were introduced to help compatibility with UTF-16 and to allow for the ever-growing amount of codepoints.
To make the long story short, UTF-8 was originally specified to allow codepoints with up to 31 bits (or 6 bytes). But with RFC3629, this was reduced to 4 bytes max. to be more compatible to UTF-16.
Wikipedia has some more information. The specification of the Universal Character Set is closely linked to the history of Unicode and its transformation format (UTF).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With