Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to detect UTF-8 in plain C?

Tags:

c

utf-8

I am looking for a code snippet in plain old C that detects that the given string is in UTF-8 encoding. I know the solution with regex, but for various reasons it would be better to avoid using anything but plain C in this particular case.

Solution with regex looks like this (warning: various checks omitted):

#define UTF8_DETECT_REGEXP  "^([\x09\x0A\x0D\x20-\x7E]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|\xED[\x80-\x9F][\x80-\xBF]|\xF0[\x90-\xBF][\x80-\xBF]{2}|[\xF1-\xF3][\x80-\xBF]{3}|\xF4[\x80-\x8F][\x80-\xBF]{2})*$"  const char *error; int         error_off; int         rc; int         vect[100];  utf8_re = pcre_compile(UTF8_DETECT_REGEXP, PCRE_CASELESS, &error, &error_off, NULL); utf8_pe = pcre_study(utf8_re, 0, &error);  rc = pcre_exec(utf8_re, utf8_pe, str, len, 0, 0, vect, sizeof(vect)/sizeof(vect[0]));  if (rc > 0) {     printf("string is in UTF8\n"); } else {     printf("string is not in UTF8\n") } 
like image 993
Konstantin Avatar asked Jun 23 '09 09:06

Konstantin


People also ask

How do I know if I have UTF-8?

Open the file in Notepad. Click 'Save As...'. In the 'Encoding:' combo box you will see the current file format. Yes, I opened the file in notepad and selected the UTF-8 format and saved it.

Does C use UTF-8?

Most C string library routines still work with UTF-8, since they only scan for terminating NUL characters.

How do I know if I have UTF-8 or UTF-16?

For your specific use-case, it's very easy to tell. Just scan the file, if you find any NULL ("\0"), it must be UTF-16. JavaScript got to have ASCII chars and they are represented by a leading 0 in UTF-16.

What is plain text UTF-8?

UTF-8 and UTF-16 are based on the Unicode Character Set, so they can be used to encode the same character information. UTF-8 is currently the dominant text encoding format on the web, and newer software applications often use it as the default format for plain text data (W3Techs 2017).


1 Answers

Here's a (hopefully bug-free) implementation of this expression in plain C:

_Bool is_utf8(const char * string) {     if(!string)         return 0;      const unsigned char * bytes = (const unsigned char *)string;     while(*bytes)     {         if( (// ASCII              // use bytes[0] <= 0x7F to allow ASCII control characters                 bytes[0] == 0x09 ||                 bytes[0] == 0x0A ||                 bytes[0] == 0x0D ||                 (0x20 <= bytes[0] && bytes[0] <= 0x7E)             )         ) {             bytes += 1;             continue;         }          if( (// non-overlong 2-byte                 (0xC2 <= bytes[0] && bytes[0] <= 0xDF) &&                 (0x80 <= bytes[1] && bytes[1] <= 0xBF)             )         ) {             bytes += 2;             continue;         }          if( (// excluding overlongs                 bytes[0] == 0xE0 &&                 (0xA0 <= bytes[1] && bytes[1] <= 0xBF) &&                 (0x80 <= bytes[2] && bytes[2] <= 0xBF)             ) ||             (// straight 3-byte                 ((0xE1 <= bytes[0] && bytes[0] <= 0xEC) ||                     bytes[0] == 0xEE ||                     bytes[0] == 0xEF) &&                 (0x80 <= bytes[1] && bytes[1] <= 0xBF) &&                 (0x80 <= bytes[2] && bytes[2] <= 0xBF)             ) ||             (// excluding surrogates                 bytes[0] == 0xED &&                 (0x80 <= bytes[1] && bytes[1] <= 0x9F) &&                 (0x80 <= bytes[2] && bytes[2] <= 0xBF)             )         ) {             bytes += 3;             continue;         }          if( (// planes 1-3                 bytes[0] == 0xF0 &&                 (0x90 <= bytes[1] && bytes[1] <= 0xBF) &&                 (0x80 <= bytes[2] && bytes[2] <= 0xBF) &&                 (0x80 <= bytes[3] && bytes[3] <= 0xBF)             ) ||             (// planes 4-15                 (0xF1 <= bytes[0] && bytes[0] <= 0xF3) &&                 (0x80 <= bytes[1] && bytes[1] <= 0xBF) &&                 (0x80 <= bytes[2] && bytes[2] <= 0xBF) &&                 (0x80 <= bytes[3] && bytes[3] <= 0xBF)             ) ||             (// plane 16                 bytes[0] == 0xF4 &&                 (0x80 <= bytes[1] && bytes[1] <= 0x8F) &&                 (0x80 <= bytes[2] && bytes[2] <= 0xBF) &&                 (0x80 <= bytes[3] && bytes[3] <= 0xBF)             )         ) {             bytes += 4;             continue;         }          return 0;     }      return 1; } 

Please note that this is a faithful translation of the regular expression recommended by W3C for form validation, which does indeed reject some valid UTF-8 sequences (in particular those containing ASCII control characters).

Also, even after fixing this by making the change mentioned in the comment, it still assumes zero-termination, which prevents embedding NUL characters, although it should technically be legal.

When I dabbled in creating my own string library, I went with modified UTF-8 (ie encoding NUL as an overlong two-byte sequence) - feel free to use this header as a template for providing a validation routine which doesn't suffer from the above shortcomings.

like image 161
Christoph Avatar answered Sep 29 '22 00:09

Christoph