Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UNICODE, UTF-8 and Windows mess

I'm trying to implement text support in Windows with the intention of also moving to a Linux platform later on. It would be ideal to support international languages in a uniform way but that doesn't seem to be easily accomplished when considering the two platforms in question. I have spent a considerable amount of time reading up on UNICODE, UTF-8 (and other encodings), widechars and such and here is what I have come to understand so far:

UNICODE, as the standard, describes the set of characters that are mappable and the order in which they occur. I refer to this as the "what": UNICODE specifies what will be available.

UTF-8 (and other encodings) specify the how: How each character will be represented in a binary format.

Now, on windows, they opted for a UCS-2 encoding originally, but that failed to meet the requirements, so UTF-16 is what they have, which is also multi-char when necessary.

So here is the delemma:

  1. Windows internally only does UTF-16, so if you want to support international characters you are forced to convert to their widechar versions to use the OS calls accordingly. There doesn't seem to be any support for calling something like CreateFileA() with a multi-byte UTF-8 string and have it come out looking proper. Is this correct?
  2. In C, there are some multi-byte supporting functions (_mbscat, _mbscpy, etc), however, on windows, the character type is defined as unsigned char* for those functions. Given the fact that the _mbs series of functions is not a complete set (i.e. there is no _mbstol to convert a multi-byte string to a long, for example) you are forced to use some of the char* versions of the runtime functions, which leads to compiler problems because of the signed/unsigned type difference between those functions. Does anyone even use those? Do you just do a big pile of casting to get around the errors?
  3. In C++, std::string has iterators, but these are based on char_type, not on code points. So if I do a ++ on an std::string::iterator, I get the next char_type, not the next code point. Similarly, if you call std::string::operator[], you get a reference to a char_type, which has the great potential to not be a complete code point. So how does one iterate an std::string by code point? (C has the _mbsinc() function).
like image 417
Murrgon Avatar asked Oct 26 '12 15:10

Murrgon


2 Answers

Just do UTF-8

There are lots of support libraries for UTF-8 in every plaftorm, also some are multiplaftorm too. The UTF-16 APIs in Win32 are limited and inconsistent as you've already noted, so it's better to keep everything in UTF-8 and convert to UTF-16 at last moment. There are also some handy UTF-8 wrappings for the windows API.

Also, at application-level documents, UTF-8 is getting more and more accepted as standard. Every text-handling application either accepts UTF-8, or at worst shows it as "ASCII with some dingbats", while there's only few applications that support UTF-16 documents, and those that don't, show it as "lots and lots of whitespace!"

like image 159
Javier Avatar answered Oct 27 '22 07:10

Javier


  1. Correct. You will convert UTF-8 to UTF-16 for your Windows API calls.

  2. Most of the time you will use regular string functions for UTF-8 -- strlen, strcpy (ick), snprintf, strtol. They will work fine with UTF-8 characters. Either use char * for UTF-8 or you will have to cast everything.

    Note that the underscore versions like _mbstowcs are not standard, they are normally named without an underscore, like mbstowcs.

  3. It is difficult to come up with examples where you actually want to use operator[] on a Unicode string, my advice is to stay away from it. Likewise, iterating over a string has surprisingly few uses:

    • If you are parsing a string (e.g., the string is C or JavaScript code, maybe you want syntax hilighting) then you can do most of the work byte-by-byte and ignore the multibyte aspect.

    • If you are doing a search, you will also do this byte-by-byte (but remember to normalize first).

    • If you are looking for word breaks or grapheme cluster boundaries, you will want to use a library like ICU. The algorithm is not simple.

    • Finally, you can always convert a chunk of text to UTF-32 and work with it that way. I think this is the sanest option if you are implementing any of the Unicode algorithms like collation or breaking.

    See: C++ iterate or split UTF-8 string into array of symbols?

like image 45
Dietrich Epp Avatar answered Oct 27 '22 06:10

Dietrich Epp