Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why Unicode code points are always written with at least 2 bytes?

Why does Unicode code points are always written with 2 bytes (4 digits) even when that's not necessary ?

From the Wikipedia page about UTF-8 :

$ -> U+0024
¢ -> U+00A2 
like image 442
Radioreve Avatar asked Jan 28 '23 12:01

Radioreve


1 Answers

TL;DR This is all by convention of the Unicode Consortium.

Here is the formal definition, found in Appendix A: Notational Conventions of the Unicode standard (I've referenced the latest at this time, version 11):

In running text, an individual Unicode code point is expressed as U+n, where n is four to six hexadecimal digits, using the digits 0–9 and uppercase letters A–F (for 10 through 15, respectively). Leading zeros are omitted, unless the code point would have fewer than four hexadecimal digits—for example, U+0001, U+0012, U+0123, U+1234, U+12345, U+102345.


They are hexadecimal digits, that represent Unicode Scalar Values. Initially only the first plane called the Basic Multilingual Plane were made available, which supported a range of U+0000 to U+FFFF to be defined. Initially the U+ encoding therefore always had 4 hexadecimal characters.

However, that only allows 64 Ki (65536) code points to use for characters (excluding some reserved values). So the single plane was later extended to 17 planes. Leading zero's are suppressed for values of U+10000 or higher, so the next character is written as U+10000, not U+010000. Currently there are 17 planes of 64Ki code points (some of which may be reserved), starting at U+0000, U+10000 ... U+90000 and finally U100000.

The U+xxxx notation does not follow UTF-8 encoding. Nor does it follow UTF-16, UTF-32 or the deprecated UCS encodings, either in big or little endian. The encoding of characters within the Basic Multilingual Plane is however identical to UTF-16(BE) in hexadecimal. Note that UTF-16 may contain surrogate code units that are used as escape to encode the characters in the other planes. The ranges of those code units are not mapped to characters and will therefore not be present in the textual code point representation.

See for instance, the PLUS-MINUS SIGN, ±:

Unicode code point: U+00B1 (as a textual string)
UTF-8             : 0xC2 0xB1 (as two bytes)
UTF-16            : 0x00B1
UTF-16BE          : 0x00B1 as 0x00 0xB1 (as two bytes)
UTF-16LE          : 0x00B1 as 0xB1 0x00 (as two bytes)

https://www.fileformat.info/info/unicode/char/00b1/index.htm


Much of this information can be found at sil.org.

like image 180
Maarten Bodewes Avatar answered Feb 02 '23 09:02

Maarten Bodewes