Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does gcc decide the wide character set when calling `mbtowc()`?

According to the gcc manual, the option -fwide-exec-charset specifies the wide character set of wide string and character constants at compile time.

But what is the wide character set when converting a multi-byte character to a wide character by calling mbtowc() at run time? The POSIX standard says that the character set of multi-byte characters is determined by the LC_CTYPE category of the current locale, but says nothing about the wide character set. I don't have a C standard at hand now so I don't know what the C standard says about this.

Does the gcc option -fwide-exec-charset determine the wide character set used by mbtowc(), just as it does at compile time?

like image 320
spockwang Avatar asked Mar 15 '13 06:03

spockwang


1 Answers

Short answer: the character set used for wide strings gets determined by the characteristics of wchar_t known at compile time. As mbtowc is a library function, this happens when libc is being built.

mbtowc reads a single character from a string encoded in an external charset and writes it out to a wchar_t value able to represent any character. Likewise, mbstowcs converts an externally encoded C string into a simple array of wchar_t. From the system's point of view, it doesn't make sense to specify the "charset" of the resulting wide character/string, because changing its output encoding in any way would break the usage of the resulting wide string as array of wchar_t.

You can describe mbstowcs as producing fixed-width Unicode encodings such as UCS-2 or UCS-4 (or more precisely UTF-16 or UTF-32) if the wide chars correspond to ISO 10646 code points, and depending on the width of wchar_t. You can also describe it as little-endian or big-endian depending on your the endianness of the processor's representation of wchar_t. But those are properties of the platform, which you can't change at run-time any more than you can change endianness, or ASCII to EBCDIC.

-fwide-exec-charset serves to explicitly specify to the compiler the charset that corresponds to the internal representation of array-of-wchar_t. This is useful when it differs from the representation the compiler would normally generate (because you are crosscompiling, or because the compiler was misconfigured). This is why the manual goes on to warn that "you will have problems with encodings that do not fit exactly in wchar_t."

like image 169
user4815162342 Avatar answered Oct 18 '22 03:10

user4815162342