Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is it necessary to mark continuation bytes in UTF-8?

I've recently been reading up on the UTF-8 variable-width encoding, and I found it strange that UTF-8 specifies the first two bits of every continuation byte to be 10.

 Range           |  Encoding
-----------------+-----------------
     0 - 7f      |  0xxxxxx
    80 - 7ff     |  110xxxx 10xxxxxx
   800 - ffff    |  1110xxx 10xxxxxx 10xxxxxx
 10000 - 10ffff  |  11110xx 10xxxxxx 10xxxxxx 10xxxxxx

I was playing around with other possible variable width encodings, and found that by using the following scheme, at most 3 bytes are necessary to store all of Unicode. If the first bit is a 1, then the character is encoded in at least one more byte (read until the first bit is a 0).

 Range           |  Encoding
-----------------+-----------------
     0 - 7f      |  0xxxxxx
    80 - 407f    |  1xxxxxx 0xxxxxxx
  4080 - 20407f  |  1xxxxxx 1xxxxxxx 0xxxxxxx

Are the continuation bits in UTF-8 really that important? The second encoding seems much more efficient.

like image 473
crb233 Avatar asked Dec 02 '22 12:12

crb233


2 Answers

The UTF-8 is self-validating, fast on stepping forward, and easier to step backward.

Self-validating: Since the first byte in the sequence specifies the length, the next X bytes must fit 10xxxxxx, or you have an invalid sequence. Seeing a 10xxxxxx byte by itself is immediately recognizable as invalid.
Your suggested encoding has no validation built-in.

Fast on step forward: If you have to skip the character, you can immediately skip X bytes as determined by the first byte, without having to examine each intermediate byte.

Easier to step backward: If you have to read the bytes backwards, you can immediately recognize a continuation character by the 10xxxxxx. You'll then be able to scan backwards past the 10xxxxxx bytes for the 11xxxxxx lead byte, without having to scan past the lead byte.

See UTF-8 Invalid sequences and error handling on Wikipedia.

like image 105
Andreas Avatar answered Dec 09 '22 14:12

Andreas


Apart from ease of iteration as already mentioned: UTF-8 aims to be safe for ASCII-based (and other UTF-8-unaware) tools to process through such common manipulations as searching, concatenation, replacing, and escaping.

The advantages of ASCII-compatibilty for interop and security outweigh the costs of using an extra byte for characters U+0800 to U+407F.

80 - 407f | 1xxxxxx 0xxxxxxx

So there were a few East Asian multibyte encodings that did it like that, with some unfortunate results which UTF-8 was specifically trying to avoid.

In this proposed scheme the continuation bytes now overlap with ASCII, and many ASCII characters have special meanings to different languages and tools. So if you want to say ¢ that's 0x80,0x27 and the second byte of that looks like a " to any tool that manipulates byte strings without support for, and knowledge that this data using, the proposed encoding.

Cue security holes in everything that combines user input into control flow. SQL injection in queries, HTML injection on web pages, command injection in shell scripts and so on.

(The East Asian multibyte encodings weren't quite as bad as this encoding here, as they didn't reuse the ASCII control codes as continuation bytes. As proposed, text using this encoding can't be stored in a C null-terminated string, for example. Still, Shift-JIS and friends caused a whole bunch of security holes and we are all very glad to be rid of them.)

like image 33
bobince Avatar answered Dec 09 '22 14:12

bobince