I know this question has been asked quite a few times here, and i did read some of the answers, But there are a few suggested solutions and im trying to figure out the best of them.
I'm writing a C99 app that basically receives XML text encoded in UTF-8.
Part of it's job is to copy and manipulate that string (finding a substr, cat it, ex..)
As i would rather not to use an outside not-standard library right now, im trying to implement it using wchar_t.
Currently, im using mbstowcs to convert it to wchar_t for easy manipulation, and for some input i tried in different languages - it worked fine.
Thing is, i did read some people out there had some issues with UTF-8 and mbstowcs, so i would like to hear out about whether this use is permitted/acceptable.
Other option i faced was using iconv with WCHAR_T parameter. Thing is, im working on a platform(not a PC) which it's locale is very very limit to only ANSI C locale. How about that?
I did also encounter some C++ library which is very popular. but im limited for C99 implementation.
Also, i would be compiling this code on another platform, which the sizeof of wchar_t is different (2 bytes versus 4 bytes on my machine). How can i overcome that? using fixed-size char containers? but then, which manipulation functions should i use instead?
Happy to hear some thoughts. thanks.
C does not define what encoding the char
and wchar_t
types are and the standard library only mandates some functions that translate between the two without saying how. If the implementation-dependent encoding of char
is not UTF-8 then mbstowcs
will result in data corruption.
As noted in the rationale for the C99 standard:
However, the five functions are often too restrictive and too primitive to develop portable international programs that manage characters.
...
C90 deliberately chose not to invent a more complete multibyte- and wide-character library, choosing instead to await their natural development as the C community acquired more experience with wide characters.
Sourced from here.
So, if you have UTF-8 data in your char
s there isn't a standard API way to convert that to wchar_t
s.
In my opinion wchar_t
should usually be avoided unless necessary - you might need it if you're using WIN32 APIs for example. I am not convinced it will simplify string manipulation. wchar_t
is always UTF-16LE on Windows so you may still need to have more than one wchar_t
to represent a single Unicode code point anyway.
I suggest you investigate the ICU project - at least from an educational standpoint.
Also, i would be compiling this code on another platform, which the sizeof of wchar_t is different (2 bytes versus 4 bytes on my machine). How can i overcome that? using fixed-size char containers?
You could do that with conditional typedefs like this:
#if defined(__STDC_UTF_16__)
typedef _Char16_t CHAR16;
#elif defined(_WIN32)
typedef wchar_t CHAR16;
#else
typedef uint16_t CHAR16;
#endif
#if defined(__STDC_UTF_32__)
typedef _Char32_t CHAR32;
#elif defined(__STDC_ISO_10646__)
typedef wchar_t CHAR32;
#else
typedef uint32_t CHAR32;
#endif
This will define the typedefs CHAR16
and CHAR32
to use the new C++11 character types if available, but otherwise fall back to using wchar_t
when possible and fixed-width unsigned integers otherwise.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With