UTF-8 encoding why prefix 10?

2 Answers

Historically there were many proposals to UTF-8's encoding. One of which uses no prefix in the following bytes and another named FSS-UTF uses the prefix 1

Number    First       Last
of bytes  code point  code point
1         U+0000      U+007F       0xxxxxxx
2         U+0080      U+07FF       110xxxxx 10xxxxxx
3         U+0800      U+FFFF       1110xxxx 10xxxxxx 10xxxxxx
4         U+10000     U+1FFFFF     11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
5         U+200000    U+3FFFFFF    111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
6         U+4000000   U+7FFFFFFF   1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

However finally a new encoding using the prefix 10 was chosen

A modification by Ken Thompson of the Plan 9 operating system group at Bell Labs made it somewhat less bit-efficient than the previous proposal but crucially allowed it to be self-synchronizing, letting a reader start anywhere and immediately detect byte sequence boundaries.

https://en.wikipedia.org/wiki/UTF-8#History

The most obvious advantage of the new encoding is self-synchronization as others mentioned. It allows the reader to find the character boundaries easily, so any dropped byte can be skipped quickly, and the current/previous/next characters can also be found immediately given any byte index in the string. If the indexed byte starts with 10 then just a middle byte, just go back or forward to find the start of the surrounding characters; otherwise if it starts with 0 or 11 then it's the start of a byte sequence

That property is very important because in a badly designed encoding without self-synchronization like Shift-JIS the reader has to maintain a table of character offsets, or it'll have to reparse the array from the beginning to edit a string. In DOS/V for Japanese (which uses Shift-JIS) probably due to the limited amount of memory the table wasn't used, hence every time you press Backspace the OS will need to reiterate from the start to know which character was deleted. There's no way to get the length of the previous character like in the case of UTF-8

The prefixed nature of UTF-8 also allows the old C string search functions to work without any modification because a search string's byte sequence can never appear in the middle of the another valid UTF-8 byte sequence. In Shift-JIS or other non-self-synchronized encoding you need a specialized search function because the a start byte can be a middle byte of another character

Some of the above advantages are also shared by UTF-16

Since the ranges for the high surrogates (0xD800–0xDBFF), low surrogates (0xDC00–0xDFFF), and valid BMP characters (0x0000–0xD7FF, 0xE000–0xFFFF) are disjoint, it is not possible for a surrogate to match a BMP character, or for two adjacent code units to look like a legal surrogate pair. This simplifies searches a great deal. It also means that UTF-16 is self-synchronizing on 16-bit words: whether a code unit starts a character can be determined without examining earlier code units (i.e. the type of code unit can be determined by the ranges of values in which it falls). UTF-8 shares these advantages, but many earlier multi-byte encoding schemes (such as Shift JIS and other Asian multi-byte encodings) did not allow unambiguous searching and could only be synchronized by re-parsing from the start of the string (UTF-16 is not self-synchronizing if one byte is lost or if traversal starts at a random byte).

https://en.wikipedia.org/wiki/UTF-16#Description

160

answered Sep 27 '22 21:09

phuclv

All follow-up bytes of multi-byte characters start with binary 10 to indicate that they are follow-up bytes.

This allows re-synchronization if parts of a transmission are broken and/or missing. For example if the first byte of a multi-byte sequence is missing, you can still figure out where the next character starts.

If the follow-up bytes could take any values then there would be no way to distinguish the follow-up bytes from single-byte encoded characters.

answered Sep 27 '22 20:09

Joachim Sauer

Related questions
                            
                                Java remove non Latin-basic characters from string
                            
                                SQL Query Where Column = '' returning Emoji characters 🎃 and 🍰
                            
                                How can I substitute in strings in Perl 6 by codepoint rather than by grapheme?
                            
                                Handling grapheme clusters in Dart
                            
                                Print chess symbols using UnicodeBlock?
                            
                                Convert GB2312 to UTF-8
                            
                                JavaScript: Unicode space character
                            
                                Greek characters, Regular Expressions, and C#
                            
                                Are 6 octet UTF-8 sequences valid?
                            
                                can't encode single quote (&#39;) using django's render_to_string
                            
                                should I eliminate TCHAR from Windows code?
                            
                                Python 3.x: Using string.maketrans() in order to create a unicode-character transformation
                            
                                How to search all CJK chars in vim?
                            
                                UTF-8 special characters don't show up
                            
                                Converting Unicode string to unicode chars in c# for indian languages
                            
                                Java Unicode to hex string [duplicate]
                            
                                How can I remove carriage return from a text file with Python?
                            
                                How to convert unicode code points to utf-8 in c++?
                            
                                Can't persist emojis with mysql and hibernate
                            
                                Flask JSON serializable error because of flask babel

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

UTF-8 encoding why prefix 10?

Tags:

encoding

unicode

utf-8

character

knowledge

People also ask

2 Answers

phuclv

Joachim Sauer

Recent Activity

Donate For Us