Check for invalid UTF8

Tags:

I am converting from UTF8 format to actual value in hex. However there are some invalid sequences of bytes that I need to catch. Is there a quick way to check if a character doesn't belong in UTF8 in C++?

492

asked Jul 02 '11 03:07

Mark

2 Answers

Follow the tables in the Unicode standard, chapter 3. (I used the Unicode 5.1.0 version of the chapter (p103); it was Table 3-7 on p94 of the Unicode 6.0.0 version, and was on p95 in the Unicode 6.3 version — and it is on p125 of the Unicode 8.0.0 version.)

Bytes 0xC0, 0xC1, and 0xF5..0xFF cannot appear in valid UTF-8. The valid sequences are documented; all others are invalid.

Table 3-7. Well-Formed UTF-8 Byte Sequences

Code Points        First Byte Second Byte Third Byte Fourth Byte
U+0000..U+007F     00..7F
U+0080..U+07FF     C2..DF     80..BF
U+0800..U+0FFF     E0         A0..BF      80..BF
U+1000..U+CFFF     E1..EC     80..BF      80..BF
U+D000..U+D7FF     ED         80..9F      80..BF
U+E000..U+FFFF     EE..EF     80..BF      80..BF
U+10000..U+3FFFF   F0         90..BF      80..BF     80..BF
U+40000..U+FFFFF   F1..F3     80..BF      80..BF     80..BF
U+100000..U+10FFFF F4         80..8F      80..BF     80..BF

Note that the irregularities are in the second byte for certain ranges of values of the first byte. The third and fourth bytes, when needed, are consistent. Note that not every code point within the ranges identified as valid has been allocated (and some are explicitly 'non-characters'), so there is more validation needed still.

The code points U+D800..U+DBFF are for UTF-16 high surrogates and U+DC00..U+DFFF are for UTF-16 low surrogates; those cannot appear in valid UTF-8 (you encode the values outside the BMP — Basic Multilingual Plane — directly in UTF-8), which is why that range is marked invalid.

Other excluded ranges (initial byte C0 or C1, or initial byte E0 followed by 80..9F, or initial byte F0 followed by 80..8F) are non-minimal encodings. For example, C0 80 would encode U+0000, but that's encoded by 00, and UTF-8 defines that the non-minimal encoding C0 80 is invalid. And the maximum Unicode code point is U+10FFFF; UTF-8 encodings starting from F4 90 upwards generate values that are out of range.

101

answered Oct 19 '22 03:10

Jonathan Leffler

Good answer already, I'm just chipping in another take on this for fun.

UTF-8 uses a general scheme by Prosser and Thompson to encode large numbers in single-byte sequences. This scheme can actually represent 2^36 values, but for Unicode we only need 2^21. Here's how it works. Let N be the number you want to encode (e.g. a Unicode codepoint):

If N < 128, just one byte 0nnnnnnn. The highest bit is zero.
Otherwise, several bytes. The first byte starts with as many ones as there are bytes in the sequence, followed by a zero, and then the data bits; successive bytes start with 10 followed by six data bits. Examples:
3 byte sequence: 1110xxxx 10xxxxxx 10xxxxxx.
5 byte sequence: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx.
7 byte sequence: 11111110 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx.

A k-byte sequence fits 5 k + 1 bits (when k > 1), so you can determine how many bytes you need given N. For decoding, read one byte; if its top bit is zero, store its value as is, otherwise use the first byte to figure out how many bytes are in the sequence and process all those.

For Unicode as of today we only need at most k = 4 bytes.

answered Oct 19 '22 04:10

Kerrek SB

Related questions
                            
                                why no implicit conversion from pointer to reference to const pointer
                            
                                Access variable value using string representing variable's name in C++ [duplicate]
                            
                                C++: Return NULL instead of struct
                            
                                multiple definition linker error after adding a function to a previously linking file
                            
                                Should Direct3D be used over OpenGL in Windows?
                            
                                C++ template function default value
                            
                                isalpha equivalent for wchar_t
                            
                                Why are default arguments trailing ones?
                            
                                order of destruction using virtual
                            
                                Are all functions in the c++ standard library required have external linkage?
                            
                                C++ Qt signal and slot not firing
                            
                                How to insert a pair of std::pair inside another std::pair?
                            
                                Call derived class method from base class reference
                            
                                Uninitialized constant members in classes
                            
                                How can I fetch data from a website inside a C++ program
                            
                                Pass reference to output location vs using return
                            
                                sizeof(struct) returns unexpected value
                            
                                c++ libcurl json rest
                            
                                Is this well defined code?
                            
                                Lisp Interpreter in a C++ Program

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With