Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Any downsides using '?' instead of L'?' with wchar_t?

Are there any downsides to using '?'-style character literals to compare against, or assign to, values known to be of type wchar_t, instead of using L'?'-style literals?

like image 578
user541686 Avatar asked Jul 17 '12 16:07

user541686


3 Answers

They have the wrong datatype and encoding, so that's a bad idea. The compiler will silently widen character literals (for strings you'd get a type mismatch compile error), using the standard integral conversions (such as sign-extension). But the value might not match.

For example, characters 0x80 through 0xff often map to different Unicode codepoints, and the exact mapping varies depending on the compiler's codepage.

Clearly, it's not possible for Unicode to map all the various codepages using an identity conversion. If merely widening were enough, there'd be no need for functions like mbtowcs.

WRT your specific question about '\xAB' vs L'\xAB', they probably are not equal. See http://ideone.com/b1E39

like image 105
Ben Voigt Avatar answered Nov 13 '22 01:11

Ben Voigt


As I mentioned, the standard says

A char array (whether plain char, signed char, or unsigned char), char16_t array, char32_t array, or wchar_t array can be initialized by a narrow character literal...

However, in the section for the __STDC_MB_MIGHT_NEQ_WC__ preprocessor definition, it says

The integer constant 1, intended to indicate that, in the encoding for wchar_t, a member of the basic character set need not have a code value equal to its value when used as the lone character in an ordinary character literal.

And for __STDC_ISO_10646__:

An integer constant of the form yyyymmL (for example, 199712L). If this symbol is defined, then every character in the Unicode required set, when stored in an object of type wchar_t, has the same value as the short identifier of that character.

I am not exactly a professional at interpreting the standard, but I think that means the answer to your question is that they may have different representations, and you should always use the L.

like image 3
Seth Carnegie Avatar answered Nov 13 '22 00:11

Seth Carnegie


The only downside is that your program might fail on stone-age systems using EBCDIC. On any real world system worth consideration, char and wchar_t values for the portable character set are all ASCII, and on increasingly many (but not all), wchar_t is a Unicode codepoint number.

like image 1
R.. GitHub STOP HELPING ICE Avatar answered Nov 12 '22 23:11

R.. GitHub STOP HELPING ICE