Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Relationship between 'x' and L'x' and widen('x')

Let x be any member of the basic source character set. 'x' and L'x' are members of the basic execution character set and the basic execution wide-character set, respectively.

Is it true that integral values of 'x' and L'x' must be equal? It looks like the standard does not require that, which makes sense. One can conceivably use say EBCDIC as the narrow charset and Unicode as the wide charset.

Is it true that std::use_facet<std::ctype<wchar_t>>(std::locale()).widen('x') should be equal to L'x' in some (or any) locale? In this case it does make sense to require that but I cannot find such requirement in the standard either. Likewise, is std::use_facet<std::ctype<wchar_t>>(std::locale()).narrow(L'x') the same as 'x'?

If the above is not true, then which one of these

std::wcout << L'x';
std::wcout << ct.widen('x');

should output x? ct is an appropriate locale facet.

like image 823
n. 1.8e9-where's-my-share m. Avatar asked Aug 12 '15 08:08

n. 1.8e9-where's-my-share m.


1 Answers

There is little that can be guaranteed in practice about wide character sets, because the C and C++ standards require that all wide characters can be represented with a single encoding value, while the standard in Windows programming is UTF-16 encoded wide text. Originally Windows wide text was simply original 16-bit Unicode, now called UCS-2, which is still used in Windows console windows, and which conforms to the C and C++ requirements. UTF-16 is an extension of UCS-2 that uses two encoding values, called a surrogate pair, for characters outside the original Unicode's Basic Multilingual Plane, a.k.a. the BMP.


Re

Is it true that integral values of 'x' and L'x' must be equal? [When x is a member of the C++ basic source character set]

The basic source character set is a subset of ASCII, and nearly all extant general character encodings, including in particular the Unicode encodings, are extensions of ASCII. There is one exception, namely IBM's EBCDIC character encodings (there are multiple variants). However, if it's still used at all, then that's on IBM mainframes.

Thus in practice you have that guarantee, but in the formal you don't have it. More importantly, though, it's irrelevant. For example, the basic source character set lacks the $ sign, which you can hardly expect to do without, i.e. limiting oneself to the basic source character set is not a practical proposition.


Re

Is it true that std::use_facet<std::ctype<wchar_t>>(std::locale()).widen('x') should be equal to L'x' in some (or any) locale [When x is a member of the C++ basic source character set]

For the same reason as for the literals, yes in practice, no in the formal (since encodings like EBCDIC are supported), and also this is irrelevant for the practitioner.

In particular, for the in-practice, a more relevant consideration is that Microsoft's Visual C++ has (undocumented) Windows ANSI as its execution character set, and UTF-16 as the wide character encoding. E.g. on my machine the execution character set is Windows 1252, a.k.a. Windows ANSI Western. And some characters, in particular €, have totally different Unicode character codes. Worse, there might just be some narrow character set that could be used as execution character set where the UTF-16 encoding of some character would use a surrogate pair of encoding values. And in that case widen can't even represent the result; there's no room for it.

like image 141
Cheers and hth. - Alf Avatar answered Oct 11 '22 17:10

Cheers and hth. - Alf