Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Check for invalid UTF8

Tags:

c++

utf-8

I am converting from UTF8 format to actual value in hex. However there are some invalid sequences of bytes that I need to catch. Is there a quick way to check if a character doesn't belong in UTF8 in C++?

like image 492
Mark Avatar asked Jul 02 '11 03:07

Mark


People also ask

How do I check my UTF-8 encoding?

Open the file in Notepad. Click 'Save As...'. In the 'Encoding:' combo box you will see the current file format. Yes, I opened the file in notepad and selected the UTF-8 format and saved it.

How do I know if my data is UTF-8?

You can use isutf8 from the moreutils collection. In a shell script, use the --quiet switch and check the exit status, which is zero for files that are valid utf-8.

What is an invalid UTF-8 string?

This error is created when the uploaded file is not in a UTF-8 format. UTF-8 is the dominant character encoding format on the World Wide Web. This error occurs because the software you are using saves the file in a different type of encoding, such as ISO-8859, instead of UTF-8.

How do I identify a UTF-8 character?

If our byte is positive (8th bit set to 0), this mean that it's an ASCII character. if ( myByte >= 0 ) return myByte; Codes greater than 127 are encoded into several bytes. On the other hand, if our byte is negative, this means that it's probably an UTF-8 encoded character whose code is greater than 127.


2 Answers

Follow the tables in the Unicode standard, chapter 3. (I used the Unicode 5.1.0 version of the chapter (p103); it was Table 3-7 on p94 of the Unicode 6.0.0 version, and was on p95 in the Unicode 6.3 version — and it is on p125 of the Unicode 8.0.0 version.)

Bytes 0xC0, 0xC1, and 0xF5..0xFF cannot appear in valid UTF-8. The valid sequences are documented; all others are invalid.

Table 3-7. Well-Formed UTF-8 Byte Sequences

Code Points        First Byte Second Byte Third Byte Fourth Byte
U+0000..U+007F     00..7F
U+0080..U+07FF     C2..DF     80..BF
U+0800..U+0FFF     E0         A0..BF      80..BF
U+1000..U+CFFF     E1..EC     80..BF      80..BF
U+D000..U+D7FF     ED         80..9F      80..BF
U+E000..U+FFFF     EE..EF     80..BF      80..BF
U+10000..U+3FFFF   F0         90..BF      80..BF     80..BF
U+40000..U+FFFFF   F1..F3     80..BF      80..BF     80..BF
U+100000..U+10FFFF F4         80..8F      80..BF     80..BF

Note that the irregularities are in the second byte for certain ranges of values of the first byte. The third and fourth bytes, when needed, are consistent. Note that not every code point within the ranges identified as valid has been allocated (and some are explicitly 'non-characters'), so there is more validation needed still.

The code points U+D800..U+DBFF are for UTF-16 high surrogates and U+DC00..U+DFFF are for UTF-16 low surrogates; those cannot appear in valid UTF-8 (you encode the values outside the BMP — Basic Multilingual Plane — directly in UTF-8), which is why that range is marked invalid.

Other excluded ranges (initial byte C0 or C1, or initial byte E0 followed by 80..9F, or initial byte F0 followed by 80..8F) are non-minimal encodings. For example, C0 80 would encode U+0000, but that's encoded by 00, and UTF-8 defines that the non-minimal encoding C0 80 is invalid. And the maximum Unicode code point is U+10FFFF; UTF-8 encodings starting from F4 90 upwards generate values that are out of range.

like image 101
Jonathan Leffler Avatar answered Oct 19 '22 03:10

Jonathan Leffler


Good answer already, I'm just chipping in another take on this for fun.

UTF-8 uses a general scheme by Prosser and Thompson to encode large numbers in single-byte sequences. This scheme can actually represent 2^36 values, but for Unicode we only need 2^21. Here's how it works. Let N be the number you want to encode (e.g. a Unicode codepoint):

  • If N < 128, just one byte 0nnnnnnn. The highest bit is zero.
  • Otherwise, several bytes. The first byte starts with as many ones as there are bytes in the sequence, followed by a zero, and then the data bits; successive bytes start with 10 followed by six data bits. Examples:
  • 3 byte sequence: 1110xxxx 10xxxxxx 10xxxxxx.
  • 5 byte sequence: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx.
  • 7 byte sequence: 11111110 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx.

A k-byte sequence fits 5 k + 1 bits (when k > 1), so you can determine how many bytes you need given N. For decoding, read one byte; if its top bit is zero, store its value as is, otherwise use the first byte to figure out how many bytes are in the sequence and process all those.

For Unicode as of today we only need at most k = 4 bytes.

like image 44
Kerrek SB Avatar answered Oct 19 '22 04:10

Kerrek SB