Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why UTF-8 encoding does not use bytes of the form 11111xxx as the first byte?

Tags:

utf-8

utf

According to https://en.wikipedia.org/wiki/UTF-8, the first byte of the encoding of a character never start with bit patterns of neither 10xxxxxx nor 11111xxx. The reason for the first one is obvious: auto-synchronization. But how about the second? Is it for something like potential extension to enable 5-bytes encoding?

like image 472
Junekey Jeon Avatar asked Sep 12 '25 20:09

Junekey Jeon


1 Answers

Older versions of UTF-8 allowed up to 6-byte encodings. It was later restricted to 4-byte encodings, but there's no reason to make the format inconsistent in order to achieve that restriction. The number of leading 1s indicates the length of the sequence, so 11111xxx still means "at least 5 bytes," there just are no such legal sequences.

Having illegal code points is very useful in detecting corruption (or more commonly, attempts to decode data that is not actually UTF-8). So making the format inconsistent just to get back one bit of storage (which couldn't actually be used for anything), would hurt other goals.

like image 95
Rob Napier Avatar answered Sep 14 '25 16:09

Rob Napier