Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

conflicts: definition of wchar_t string in C++ standard and Windows implementation?

From c++2003 2.13

A wide string literal has type “array of n const wchar_t” and has static storage duration, where n is the size of the string as defined below

The size of a wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating L’\0’.

From c++0x 2.14.5

A wide string literal has type “array of n const wchar_t”, where n is the size of the string as defined below

The size of a char32_t or wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating U’\0’ or L’\0’.

The size of a char16_t string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for each character requiring a surrogate pair, plus one for the terminating u’\0’.

The statement in C++2003 is quite vague. But in C++0x, when counting the length of the string, the wide string literal wchar_t shall be treated as same as char32_t, and different from char16_t.

There's a post that states clearly how windows implements wchar_t in https://stackoverflow.com/questions/402283?tab=votes%23tab-top

In short, wchar_t in windows is 16bits and encoded using UTF-16. The statement in standard apparently leaves something conflicting in Windows.

for example,

wchar_t kk[] = L"\U000E0005";

This exceeds 16bits and for UTF-16 it needs two 16 bits to encode it (a surrogate pair).

However, from standard, kk is an array of 2 wchar_t (1 for the universal-name \U000E005, 1 for \0).

But in the internal storage, Windows need 3 16-bit wchar_t objects to store it, 2 wchar_t for the surrogate pair, and 1 wchar_t for the \0. Therefore, from array's definition, kk is an array of 3 wchar_t.

It's apparently conflicting to each other.

I think one simplest solution for Windows is to "ban" anything that requires surrogate pair in wchar_t ("ban" any unicode outside BMP).

Is there anything wrong with my understanding?

Thanks.

like image 538
user534498 Avatar asked Dec 08 '10 03:12

user534498


People also ask

What does wchar_t mean in C?

The wchar_t type is an implementation-defined wide character type. In the Microsoft compiler, it represents a 16-bit wide character used to store Unicode encoded as UTF-16LE, the native character type on Windows operating systems.

What is the default value of array data type wchar_t?

The default value for wchar_t is zero so there's no need to even give any values in the brackets.

How do you declare a wide character in string literal?

A wide string literal is a null-terminated array of constant wchar_t that is prefixed by ' L ' and contains any graphic character except the double quotation mark ( " ), backslash ( \ ), or newline character.

Is wchar_t signed?

wchar_t is unsigned. Corresponding assembly code says movzwl _BOM, %eax .


2 Answers

The standard requires that wchar_t be large enough to hold any character in the supported character set. Based on this, I think your premise is correct -- it is wrong for VC++ to represent the single character \U000E0005 using two wchar_t units.

Characters outside the BMP are rarely used, and Windows itself internally uses UTF-16 encoding, so it is simply convenient (even if incorrect) for VC++ to behave this way. However, rather than "banning" such characters, it is likely that the size of wchar_t will increase in the future while char16_t takes its place in the Windows API.

The answer you linked to is somewhat misleading as well:

On Linux, a wchar_t is 4-bytes, while on Windows, it's 2-bytes

The size of wchar_t depends solely on the compiler and has nothing to do with the operating system. It just happens that VC++ uses 2 bytes for wchar_t, but once again, this could very well change in the future.

like image 184
casablanca Avatar answered Sep 30 '22 08:09

casablanca


Windows knows nothing about wchar_t, because wchar_t is a programming concept. Conversely, wchar_t is just storage, and it knows nothing about the semantic value of the data you store in it (that is, it knows nothing about Unicode or ASCII or whatever.)

If a compiler or SDK that targets Windows defines wchar_t to be 16 bits, then that compiler may be in conflict with the C++0x standard. (I don't know whether there are some get-out clauses that allow wchar_t to be 16 bits.) But in any case the compiler could define wchar_t to be 32 bits (to comply with the standard) and provide runtime functions to convert to/from UTF-16 for when you need to pass your wchar_t* to Windows APIs.

like image 45
Ciaran Keating Avatar answered Sep 30 '22 10:09

Ciaran Keating