I am experimenting with wctomb in order to convert a wchar_t into its UTF-8 equivalent stored in a char[]. It works nicely, but not for surrogate characters ranging U+D800 to U+DFFF.
int ret;
// null-terminated
// VS gives a warning on wctomb() for buffer overrunning on char mb[4]={0} for some reason ...
char mb[5] = { 0 };
setlocale(LC_ALL, "en-US.utf8");
// Gives 0xE2 0xAA 0x96 just fine, wctomb returns 3
ret = wctomb(mb, L'\x2A96');
// expected 0xED 0xBA 0xA0, but wctomb returns -1, i.e. invalid character
ret = wctomb(mb, L'\xDEA0');
Is there another way to get the UTF-8 form of the surrogate character alone?
I also tried wctomb_s through errno_t and &ret but it just yields the same outcome ...
Is there another way to get the UTF-8 form of the surrogate character alone?
No.
Single UTF-16 surrogates have no proper UTF-8 equivalent. Such UTF-8 encodings must be treated as an invalid byte sequence
If the source string lacks the proper pair (high surrogate, then low surrogate), then no proper UTF-8 equivalent exists.
Rather than pass along the ill formed data, consider detecting it and returning an error indication.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With