I'm trying to figure out what "continuation bytes" are (for curiousity sake) in the UTF-8 encoding. Wikipedia introduces this term in the UTF-8 article without defining it at all Google search returns no useful information either. I'm about to jump into the official specification, but would preferably read a high-level summary first.

A continuation byte in UTF-8 is any byte where the top two bits are <code>10</code>. They are the subsequent bytes in multi-byte sequences. The following table may help: <pre class="prettyprint"><code>Unicode code points Encoding Binary value ------------------- -------- ------------ U+000000-U+00007f 0xxxxxxx 0xxxxxxx U+000080-U+0007ff 110yyyxx 00000yyy xxxxxxxx 10xxxxxx U+000800-U+00ffff 1110yyyy yyyyyyyy xxxxxxxx 10yyyyxx 10xxxxxx U+010000-U+10ffff 11110zzz 000zzzzz yyyyyyyy xxxxxxxx 10zzyyyy 10yyyyxx 10xxxxxx </code></pre> Here you can see how the Unicode code points map to UTF-8 multi-byte byte sequences, and their equivalent binary values. The basic rules are this: <ol> <li>If a byte starts with a <code>0</code> bit, it's a single byte value less than 128.</li> <li>If it starts with <code>11</code>, it's the first byte of a multi-byte sequence and the number of <code>1</code> bits at the start indicates how many bytes there are in total (<code>110xxxxx</code> has two bytes, <code>1110xxxx</code> has three and <code>11110xxx</code> has four).</li> <li>If it starts with <code>10</code>, it's a continuation byte.</li> </ol> This distinction allows quite handy processing such as being able to back up from any byte in a sequence to find the first byte of that code point. Just search backwards until you find one not beginning with the <code>10</code> bits. Similarly, it can also be used for a UTF-8 <code>strlen</code> by only counting non-<code>10xxxxxx</code> bytes.

UTF-8 Continuation bytes

2 Answers

A continuation byte in UTF-8 is any byte where the top two bits are 10.

They are the subsequent bytes in multi-byte sequences. The following table may help:

Unicode code points  Encoding  Binary value -------------------  --------  ------------  U+000000-U+00007f   0xxxxxxx  0xxxxxxx   U+000080-U+0007ff   110yyyxx  00000yyy xxxxxxxx                      10xxxxxx   U+000800-U+00ffff   1110yyyy  yyyyyyyy xxxxxxxx                      10yyyyxx                      10xxxxxx   U+010000-U+10ffff   11110zzz  000zzzzz yyyyyyyy xxxxxxxx                      10zzyyyy                      10yyyyxx                      10xxxxxx

Here you can see how the Unicode code points map to UTF-8 multi-byte byte sequences, and their equivalent binary values.

The basic rules are this:

If a byte starts with a 0 bit, it's a single byte value less than 128.
If it starts with 11, it's the first byte of a multi-byte sequence and the number of 1 bits at the start indicates how many bytes there are in total (110xxxxx has two bytes, 1110xxxx has three and 11110xxx has four).
If it starts with 10, it's a continuation byte.

This distinction allows quite handy processing such as being able to back up from any byte in a sequence to find the first byte of that code point. Just search backwards until you find one not beginning with the 10 bits.

Similarly, it can also be used for a UTF-8 strlen by only counting non-10xxxxxx bytes.

answered Sep 19 '22 16:09

paxdiablo

In short words, continuation bytes are the bytes except first byte or single byte. In UTF-8, continuation bytes are started with 0x10.

answered Sep 19 '22 16:09

rogerz

Related questions
                            
                                Filtering out certain bytes in python
                            
                                What is the best way to split a string into an array of Unicode characters in PHP?
                            
                                MIMEText UTF-8 encode problems when sending email
                            
                                How to find out Chinese or Japanese Character in a String in Python?
                            
                                How to remove this \xa0 from a string in python?
                            
                                Why don't scripting languages output Unicode to the Windows console?
                            
                                UTF8 MySQL problems on Rails - encoding issues with utf8_general_ci
                            
                                Convert unicode codepoint to string character in Ruby
                            
                                Writing unicode strings via sys.stdout in Python
                            
                                Decoding HTML entities with Python
                            
                                pyMySQL set connection character set
                            
                                Printing UTF-8 strings with printf - wide vs. multibyte string literals
                            
                                Unicode characters in MATLAB source files
                            
                                What's the deal with Python 3.4, Unicode, different languages and Windows?
                            
                                How do I specify a range of unicode characters
                            
                                What is the difference between UTF-32 and UCS-4?
                            
                                Haskell, Char, Unicode, and Turkish
                            
                                Printing Unicode characters to the PowerShell prompt
                            
                                UTF-8 in Windows
                            
                                Does Postgresql varchar count using unicode character length or ASCII character length?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

UTF-8 Continuation bytes

Tags:

unicode

utf-8

14 revs, 12 users 16%

People also ask

2 Answers

paxdiablo

rogerz

Recent Activity

Donate For Us