Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

why cannot UTF-8 encoding of unicode code points fit in 3 bytes

Tags:

unicode

Wikipedia

Unicode comprises 1,114,112 code points in the range 0hex to 10FFFFhex

I am little puzzled that unicode encoding can take up-to 4 bytes. Could not 1,114,112 code points comfortably fit in 3 bytes? May be I am missing some special situations where it needs 4 bytes; please some concrete example if any?

like image 476
Saturday Sherpa Avatar asked Mar 19 '26 19:03

Saturday Sherpa


1 Answers

The Wikipedia article on the history of UTF-8 says that an earlier version of UTF-8 allowed more than 21 bits to be encoded. These encodings took 5 or even 6 bytes.

After it became clear that 2^21 code points will probably be enough for the remaining time of humankind (same thinking as with 5 bits, 6 bits, 7 bits, 8 bits and 16 bits), the encodings for 5 and for 6 bytes were simply forbidden. All other encoding rules were kept, for backwards compatibility.

As a consequence, the number space for the Unicode code points is now 0..10FFFF, which is even a bit less than 21 bits. Therefore it might be worth checking whether these 21 bits fit into the 24 bits of 3 bytes, instead of the current 4 bytes.

One important property of UTF-8 is that each byte that is part of a multibyte encoding has its highest bit set. To distinguish the leading byte from the trailing bytes, the leading byte has the second-highest bit set, while the trailing bytes have the second-highest bit cleared. This property ensures a consistent ordering. Therefore the characters could be encoded like this:

0xxx_xxxx                        7 bits freely chooseable
110x_xxxx 10xx_xxxx             11 bits freely chooseable
1110_xxxx 10xx_xxxx 10xx_xxxx   16 bits freely chooseable

Now 7 + 11 + 16 bits = 16.04 bits, which is much shorter than the 21 bits needed. Therefore encoding all Unicode code points using up to 3 bytes per the current UTF-8 encoding rules is impossible.

You can define another encoding where the highest bit of each byte is the continuation bit:

0xxx_xxxx                        7 bits freely chooseable
1xxx_xxxx 0xxx_xxxx             14 bits freely chooseable
1xxx_xxxx 1xxx_xxxx 0xxx_xxxx   21 bits freely chooseable

Now you have enough space to encode all 21-bit code points. But that's an entirely new encoding, so you would have to establish this world-wide. Given the experience from Unicode, it will take about 20 years. Good luck.

like image 106
Roland Illig Avatar answered Mar 24 '26 22:03

Roland Illig



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!