Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting a UTF-8 text to wchar_t

Tags:

c

utf-8

wchar-t

I know this question has been asked quite a few times here, and i did read some of the answers, But there are a few suggested solutions and im trying to figure out the best of them.

I'm writing a C99 app that basically receives XML text encoded in UTF-8.

Part of it's job is to copy and manipulate that string (finding a substr, cat it, ex..)

As i would rather not to use an outside not-standard library right now, im trying to implement it using wchar_t.

Currently, im using mbstowcs to convert it to wchar_t for easy manipulation, and for some input i tried in different languages - it worked fine.

Thing is, i did read some people out there had some issues with UTF-8 and mbstowcs, so i would like to hear out about whether this use is permitted/acceptable.

Other option i faced was using iconv with WCHAR_T parameter. Thing is, im working on a platform(not a PC) which it's locale is very very limit to only ANSI C locale. How about that?

I did also encounter some C++ library which is very popular. but im limited for C99 implementation.

Also, i would be compiling this code on another platform, which the sizeof of wchar_t is different (2 bytes versus 4 bytes on my machine). How can i overcome that? using fixed-size char containers? but then, which manipulation functions should i use instead?

Happy to hear some thoughts. thanks.

like image 984
Yarel Avatar asked Jan 14 '14 18:01

Yarel


2 Answers

C does not define what encoding the char and wchar_t types are and the standard library only mandates some functions that translate between the two without saying how. If the implementation-dependent encoding of char is not UTF-8 then mbstowcs will result in data corruption.

As noted in the rationale for the C99 standard:

However, the five functions are often too restrictive and too primitive to develop portable international programs that manage characters.

...

C90 deliberately chose not to invent a more complete multibyte- and wide-character library, choosing instead to await their natural development as the C community acquired more experience with wide characters.

Sourced from here.

So, if you have UTF-8 data in your chars there isn't a standard API way to convert that to wchar_ts.

In my opinion wchar_t should usually be avoided unless necessary - you might need it if you're using WIN32 APIs for example. I am not convinced it will simplify string manipulation. wchar_t is always UTF-16LE on Windows so you may still need to have more than one wchar_t to represent a single Unicode code point anyway.

I suggest you investigate the ICU project - at least from an educational standpoint.

like image 153
McDowell Avatar answered Sep 19 '22 07:09

McDowell


Also, i would be compiling this code on another platform, which the sizeof of wchar_t is different (2 bytes versus 4 bytes on my machine). How can i overcome that? using fixed-size char containers?

You could do that with conditional typedefs like this:

#if defined(__STDC_UTF_16__)
   typedef _Char16_t CHAR16;
#elif defined(_WIN32)
   typedef wchar_t   CHAR16;
#else
   typedef uint16_t  CHAR16;
#endif

#if defined(__STDC_UTF_32__)
   typedef _Char32_t CHAR32;
#elif defined(__STDC_ISO_10646__)
   typedef wchar_t   CHAR32;
#else
   typedef uint32_t  CHAR32;
#endif

This will define the typedefs CHAR16 and CHAR32 to use the new C++11 character types if available, but otherwise fall back to using wchar_t when possible and fixed-width unsigned integers otherwise.

like image 40
dan04 Avatar answered Sep 20 '22 07:09

dan04