Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multi-platform Unicode handling based on char* in C without using 3rd party libraries?

The following are bare minimum examples (I know that e.g. UNICODE/_UNICODE should be defined) that I've found to work:

Linux:

#include <stdio.h>

int main() {
  char* str = "Rölf";
  printf("%s\n", str);
}

Windows:

#include <stdio.h>
#include <locale.h>

int main() {
  setlocale(LC_ALL, "");
  wchar_t* str = L"Rölf";
  wprintf(L"%s\n", str);
}

Now, I've read that one way of going about it is to basically "just use UTF-8/char everywhere and worry about platform-specific conversion when you do API calls".

And that would be great - have users provide char* as input for my library and "simply" convert that. So I've tried the following snippet based on this example (I've also seen it in variations elsewhere). If this would actually work, it would be amazing. But it doesn't:

  char* str = u8"Rölf";
  int len = mbstowcs(NULL, str, 0) + 1;
  wchar_t wstr[len];
  mbstowcs(wstr, str, len);
  wprintf(L"%s\n", wstr);

I've also stumbled across discussions about console fonts and whatnot being the cause of faulty rendering, so to demonstrate that this is not a console issue - the following doesn't work either (well - the L"" literal does. The converted u8 literal doesn't):

  MessageBoxW(NULL, wstr, L"Rölf", MB_OK);

enter image description here

Am I misunderstanding the conversion process? Is there a way to make to this work? (Without using e.g. ICU)

like image 841
AndyO Avatar asked Aug 04 '18 16:08

AndyO


1 Answers

The mbstowcs function converts from a string encoded in the current locale's encoding to wchar_t[], not from UTF-8 (unless that encoding is UTF-8). On post-April-beta-2018 versions of Windows 10 or later, you actually can fix Windows to use UTF-8 as the encoding for plain char[] strings either as a global setting, or presumably by calling _setmbcp(65001). Older versions of Windows explicitly forbid this however for dubious historical reasons.

Anyway, you second version of the code which you called "Windows" should work on arbitrary systems if not for a bug in MSVC's wprintf that you worked around: they have the meanings of %ls and %s backwards for the wide stdio functions. In standard C, you need %ls to format a wchar_t[] string. But there's actually no reason to use wprintf there at all, and in fact wprintf is highly problematic because you can't mix it with byte-oriented stdio (doing so invokes undefined behavior). So better would be:

#include <stdio.h>
#include <locale.h>

int main() {
  setlocale(LC_ALL, "");
  wchar_t* str = L"Rölf";
  printf("%ls\n", str);
}

and this version should work correctly on Windows and standards-conforming C implementations, since for the byte-oriented printf functions, MSVC doesn't have the meaning of %s and %ls reversed.

If you really want to, you can also use a variant of your third version of the code, but you can't use mbstowcs to convert from UTF-8 to wchar_t. Instead you need to either:

  1. Assume wchar_t is Unicode-encoded, and convert from UTF-8 to Unicode codepoints with your own (or a third-party library's) UTF-8 decoder. But this is a bad assumption, because MSVC is also non-conforming in that it uses UTF-16 for wchar_t (C explicitly forbids "multi-wchar_t-characters because the mb/wc APIs are inherently incompatible with them), not Unicode codepoint values (equivalent to UTF-32).

  2. Convert from UTF-8 to uchar32_t (UTF-32) with your own (or a third-party library's) UTF-8 decoder, then use c32rtomb to convert to wchar_t[].

  3. Use iconv (standard on POSIX systems; available as a third-party library on Windows) to convert directly from UTF-8 to wchar_t.


UTF8 option for Windows 10, version 1803+

enter image description here

like image 116
R.. GitHub STOP HELPING ICE Avatar answered Oct 09 '22 05:10

R.. GitHub STOP HELPING ICE