The following are bare minimum examples (I know that e.g. UNICODE/_UNICODE should be defined) that I've found to work:
Linux:
#include <stdio.h>
int main() {
char* str = "Rölf";
printf("%s\n", str);
}
Windows:
#include <stdio.h>
#include <locale.h>
int main() {
setlocale(LC_ALL, "");
wchar_t* str = L"Rölf";
wprintf(L"%s\n", str);
}
Now, I've read that one way of going about it is to basically "just use UTF-8/char everywhere and worry about platform-specific conversion when you do API calls".
And that would be great - have users provide char* as input for my library and "simply" convert that. So I've tried the following snippet based on this example (I've also seen it in variations elsewhere). If this would actually work, it would be amazing. But it doesn't:
char* str = u8"Rölf";
int len = mbstowcs(NULL, str, 0) + 1;
wchar_t wstr[len];
mbstowcs(wstr, str, len);
wprintf(L"%s\n", wstr);
I've also stumbled across discussions about console fonts and whatnot being the cause of faulty rendering, so to demonstrate that this is not a console issue - the following doesn't work either (well - the L"" literal does. The converted u8 literal doesn't):
MessageBoxW(NULL, wstr, L"Rölf", MB_OK);
Am I misunderstanding the conversion process? Is there a way to make to this work? (Without using e.g. ICU)
The mbstowcs
function converts from a string encoded in the current locale's encoding to wchar_t[]
, not from UTF-8 (unless that encoding is UTF-8). On post-April-beta-2018 versions of Windows 10 or later, you actually can fix Windows to use UTF-8 as the encoding for plain char[]
strings either as a global setting, or presumably by calling _setmbcp(65001)
. Older versions of Windows explicitly forbid this however for dubious historical reasons.
Anyway, you second version of the code which you called "Windows" should work on arbitrary systems if not for a bug in MSVC's wprintf
that you worked around: they have the meanings of %ls
and %s
backwards for the wide stdio functions. In standard C, you need %ls
to format a wchar_t[]
string. But there's actually no reason to use wprintf
there at all, and in fact wprintf
is highly problematic because you can't mix it with byte-oriented stdio (doing so invokes undefined behavior). So better would be:
#include <stdio.h>
#include <locale.h>
int main() {
setlocale(LC_ALL, "");
wchar_t* str = L"Rölf";
printf("%ls\n", str);
}
and this version should work correctly on Windows and standards-conforming C implementations, since for the byte-oriented printf
functions, MSVC doesn't have the meaning of %s
and %ls
reversed.
If you really want to, you can also use a variant of your third version of the code, but you can't use mbstowcs
to convert from UTF-8 to wchar_t
. Instead you need to either:
Assume wchar_t
is Unicode-encoded, and convert from UTF-8 to Unicode codepoints with your own (or a third-party library's) UTF-8 decoder. But this is a bad assumption, because MSVC is also non-conforming in that it uses UTF-16 for wchar_t
(C explicitly forbids "multi-wchar_t
-characters because the mb/wc APIs are inherently incompatible with them), not Unicode codepoint values (equivalent to UTF-32).
Convert from UTF-8 to uchar32_t
(UTF-32) with your own (or a third-party library's) UTF-8 decoder, then use c32rtomb
to convert to wchar_t[]
.
Use iconv
(standard on POSIX systems; available as a third-party library on Windows) to convert directly from UTF-8 to wchar_t
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With