I'm primarily interested in the Unix-like systems (e.g., portable POSIX) as it seems like Windows does strange things for wide characters.
Do the read and write wide character functions (like getwchar()
and putwchar()
) always "do the right thing", for example read from utf-8 and write to utf-8 when that is the set locale, or do I have to manually call wcrtomb()
and print the string using e.g. fputs()
? On my system (openSUSE 12.3) where $LANG
is set to en_GB.UTF-8
they do seem to do the right thing (inspecting the output I see what looks like UTF-8 even though strings were stored using wchar_t and written using the wide character functions).
However I am unsure if this is guaranteed. For example cprogramming.com states that:
[wide characters] should not be used for output, since spurious zero bytes and other low-ASCII characters with common meanings (such as '/' and '\n') will likely be sprinkled throughout the data.
Which seems to indicate that outputting wide characters (presumably using the wide character output functions) can wreak havoc.
Since the C standard does not seem to mention coding at all I really have no idea who/when/how coding is applied when using wchar_t. So my question is basically if reading, writing and using wide characters exclusively is a proper thing to do when my application has no need to know about the encoding used. I only need string lengths and console widths (wcswidth()
), so to me using wchar_t everywhere when dealing with text seems ideal.
The relevant text governing the behavior of the wide character stdio functions and their relationship to locale is from POSIX XSH 2.5.2 Stream Orientation and Encoding Rules:
http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_05_02
Basically, the wide character stdio functions always write in the encoding that's in effect (per the LC_CTYPE
locale category) at the time the FILE
stream becomes wide-oriented; this means the first time a wide stdio function is called on it, or fwide
is used to set the orientation to wide. So as long as a proper LC_CTYPE
locale is in effect matching the desired "system" encoding (e.g. UTF-8) when you start working with the stream, everything should be fine.
However, one important consideration you should not overlook is that you must not mix byte and wide oriented operations on the same FILE
stream. Failure to observe this rule is not a reportable error; it simply results in undefined behavior. As a good deal of library code assumes stderr
is byte oriented (and some even makes the same assumption about stdout
), I would strongly discourage ever using wide-oriented functions on the standard streams. If you do, you need to be very careful about which library functions you use.
Really, I can't think of any reason at all to use wide-oriented functions. fprintf
is perfectly capable of sending wide-character strings to byte-oriented FILE
streams using the %ls
specifier.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With