Are 6 octet UTF-8 sequences valid?

Tags:

Can UTF-8 encode 5 or 6 byte sequences, allowing all Unicode characters to be encoded? I'm getting conflicting standards. I need to be able to support every Unicode character, not just those in the U+0000..U+10FFFF range.

(All quotes are from RFC 3629)

Section 3:

In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16 accessible range) are encoded using sequences of 1 to 4 octets. The only octet of a "sequence" of one has the higher-order bit set to 0, the remaining 7 bits being used to encode the character number. In a sequence of n octets, n>1, the initial octet has the n higher-order bits set to 1, followed by a bit set to 0. The remaining bit(s) of that octet contain bits from the number of the character to be encoded. The following octet(s) all have the higher-order bit set to 1 and the following bit set to 0, leaving 6 bits in each to contain bits from the character to be encoded.

So not all possible characters can be encoded with UTF-8? Does this mean I cannot encode characters from different planes than the BMP?

Section 2:

The octet values C0, C1, F5 to FF never appear.

This means we cannot encode UTF-8 values with 5 or 6 octets (or even some with 4 that aren't within the above range)?

Section 12:

Restricted the range of characters to 0000-10FFFF (the UTF-16 accessible range).

Looking at the previous RFC confirms this...they reduced the range of characters.

Section 10:

Another security issue occurs when encoding to UTF-8: the ISO/IEC 10646 description of UTF-8 allows encoding character numbers up to U+7FFFFFFF, yielding sequences of up to 6 bytes. There is therefore a risk of buffer overflow if the range of character numbers is not explicitly limited to U+10FFFF or if buffer sizing doesn't take into account the possibility of 5- and 6-byte sequences.

So these sequences are allowed per the ISO/IEC 10646 definition, but not the RFC 3629 definition? Which one should I follow?

Thanks in advance.

256

asked Aug 24 '10 17:08

Patrick Niedzielski

2 Answers

They are no Unicode characters beyond 10FFFF, the BMP covers 0000 through FFFF.

UTF-8 is well-defined for 0-10FFFF.

169

answered Sep 20 '22 07:09

devio

Both UTF-8 and UTF-16 allow all Unicode characters to be encoded. What UTF-8 is not allowed to do is to encode upper and lower surrogate halves (which UTF-16 uses) or values above U+10FFFF, which aren't legal Unicode.

Note that the BMP ends at U+FFFF.

answered Sep 18 '22 07:09

chryss

Related questions
                            
                                Non-ASCII Python identifiers and reflectivity [duplicate]
                            
                                Unicode class names in C# - why do some work, when others don't?
                            
                                Unicode characters in a Ruby script?
                            
                                Are there character collections for all international full stop punctuations?
                            
                                Unicode and `decode()` in Python
                            
                                passing unicode strings from django to javascript
                            
                                How to print a variable that contains a unicode character?
                            
                                TIdHTTP character encoding of POST response
                            
                                Package inputenc Error: Unicode char \u8:β not set up for use with LaTeX
                            
                                Python bottle requests and unicode
                            
                                How can I raise an Exception that includes a Unicode string?
                            
                                Python 2 maketrans() function doesn't work with Unicode: "the arguments are different lengths" when they actually are
                            
                                Java remove non Latin-basic characters from string
                            
                                SQL Query Where Column = '' returning Emoji characters 🎃 and 🍰
                            
                                How can I substitute in strings in Perl 6 by codepoint rather than by grapheme?
                            
                                Handling grapheme clusters in Dart
                            
                                Print chess symbols using UnicodeBlock?
                            
                                Convert GB2312 to UTF-8
                            
                                JavaScript: Unicode space character
                            
                                Greek characters, Regular Expressions, and C#

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Are 6 octet UTF-8 sequences valid?

Tags:

unicode

utf-8

Patrick Niedzielski

People also ask

2 Answers

devio

chryss

Recent Activity

Donate For Us