How does UTF-8 "variable-width encoding" work?

People also ask

How does variable-width encoding work?

A variable-width encoding is a type of character encoding scheme in which codes of differing lengths are used to encode a character set for representation, usually in a computer. Most common variable-width encodings are multibyte encodings, which use varying numbers of bytes to encode different characters.

What does UTF-8 encoding do?

UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”

What is difference between UTF-8 and UTF-16?

Both UTF-8 and UTF-16 are variable length encodings. However, in UTF-8 a character may occupy a minimum of 8 bits, while in UTF-16 character length starts with 16 bits. Main UTF-8 pros: Basic ASCII characters like digits, Latin characters with no accents, etc.

What is the meaning of the sentence both UTF-8 and UTF-16 use variable-width encoding?

In short, UTF-8 is variable length encoding and takes 1 to 4 bytes, depending upon code point. UTF-16 is also variable length character encoding but either takes 2 or 4 bytes.

Each byte starts with a few bits that tell you whether it's a single byte code-point, a multi-byte code point, or a continuation of a multi-byte code point. Like this:

0xxx xxxx    A single-byte US-ASCII code (from the first 127 characters)

The multi-byte code-points each start with a few bits that essentially say "hey, you need to also read the next byte (or two, or three) to figure out what I am." They are:

110x xxxx    One more byte follows
1110 xxxx    Two more bytes follow
1111 0xxx    Three more bytes follow

Finally, the bytes that follow those start codes all look like this:

10xx xxxx    A continuation of one of the multi-byte characters

Since you can tell what kind of byte you're looking at from the first few bits, then even if something gets mangled somewhere, you don't lose the whole sequence.

RFC3629 - UTF-8, a transformation format of ISO 10646 is the final authority here and has all the explanations.

In short, several bits in each byte of the UTF-8-encoded 1-to-4-byte sequence representing a single character are used to indicate whether it's a trailing byte, a leading byte, and if so, how many bytes follow. The remaining bits contain the payload.

UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.

Excerpt from The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Related questions
                            
                                What's the complete range for Chinese characters in Unicode?
                            
                                How to make the python interpreter correctly handle non-ASCII characters in string operations?
                            
                                Unicode Processing in C++
                            
                                Java equivalent to JavaScript's encodeURIComponent that produces identical output?
                            
                                HTML for the Pause symbol in audio and video control
                            
                                How can I perform a culture-sensitive "starts-with" operation from the middle of a string?
                            
                                How to decode Unicode escape sequences like "\u00ed" to proper UTF-8 encoded characters?
                            
                                Using awk to remove the Byte-order mark
                            
                                Python str vs unicode types
                            
                                UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 1
                            
                                How can I iterate through the unicode codepoints of a Java String?
                            
                                How to compare 'μ' and 'µ' in C# [duplicate]
                            
                                UnicodeEncodeError: 'latin-1' codec can't encode character
                            
                                What is the proper way to URL encode Unicode characters?
                            
                                Python Unicode Encode Error
                            
                                How to put a unicode character in XAML?
                            
                                Difference between open and codecs.open in Python
                            
                                Character reading from file in Python
                            
                                How to make unicode string with python3
                            
                                What are the most common non-BMP Unicode characters in actual use? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does UTF-8 "variable-width encoding" work?

Tags:

character-encoding

unicode

utf-8

multibyte

People also ask

Recent Activity

Donate For Us