I've recently tried to get the full picture about what steps it takes to create platform independent C++ applications that support unicode. A thing that is confusing to me is that most howtos and stuff equalize the character encoding (i.e. ANSI or Unicode) and the character type (char or wchar_t). As I've learned so far, these are different things and there may exist a character sequence encodeded in Unicode but represented by std::string as well as a character sequence encoded in ANSI but represented as std::wstring, right?
So the question that comes to my mind is whether the C++ standard gives any guarantee about the encoding of string literals starting with L
or does it just say it's of type wchar_t with implementation specific character encoding?
If there is no such guaranty, does that mean I need some sort of external resource system to provide non ASCII string literals for my application in a platform independent way? What is the prefered way for this? Resource system or proper encoding of source files plus proper compiler options?
A wide string literal is a null-terminated array of constant wchar_t that is prefixed by ' L ' and contains any graphic character except the double quotation mark ( " ), backslash ( \ ), or newline character. A wide string literal may contain the escape sequences listed above and any universal character name.
L is a prefix used for wide strings. Each character uses several bytes (depending on the size of wchar_t ). The encoding used is independent from this prefix.
L is the prefix for wide character literals and wide-character string literals which tells the compiler that the char or string is of type wide-char. w is prefixed in operations like scanning (wcin) or printing (wcout) while operating wide-char type.
A character literal contains a sequence of characters or escape sequences enclosed in single quotation mark symbols, for example 'c' . A character literal may be prefixed with the letter L, for example L'c' . A character literal without the L prefix is an ordinary character literal or a narrow character literal.
The L
symbol in front of a string literal simply means that each character in the string will be stored as a wchar_t
. But this doesn't necessarily imply Unicode. For example, you could use a wide character string to encode GB 18030, a character set used in China which is similar to Unicode. The C++03 standard doesn't have anything to say about Unicode, (however C++11 defines Unicode char types and string literals) so it's up to you to properly represent Unicode strings in C++03.
Regarding string literals, Chapter 2 (Lexical Conventions) of the C++ standard mentions a "basic source character set", which is basically equivalent to ASCII. So this essentially guarantees that "abc"
will be represented as a 3-byte string (not counting the null), and L"abc"
will be represented as a 3 * sizeof(wchar_t)
-byte string of wide-characters.
The standard also mentions "universal-character-names" which allow you to refer to non-ASCII characters using the \uXXXX
hexadecimal notation. These "universal-character-names" usually map directly to Unicode values, but the standard doesn't guarantee that they have to. However, you can at least guarantee that your string will be represented as a certain sequence of bytes by using universal-character-names. This will guarantee Unicode output provided the runtime environment supports Unicode, has the appropriate fonts installed, etc.
As for string literals in C++03 source files, again there is no guarantee. If you have a Unicode string literal in your code which contains characters outside of the ASCII range, it is up to your compiler to decide how to interpret these characters. If you want to explicitly guarantee that the compiler will "do the right thing", you'd need to use \uXXXX
notation in your string literals.
The C++03 does not mention unicode (future C++0x does). Currently you have to either use external libraries (ICU, UTF-CPP, etc.) or build your own solution using platform specific code. As others have mentioned, wchar_t encoding (or even size) is not specified. Consequently, string literal encoding is implementation specific. However, you can give unicode codepoints in string literals by using \x \u \U escapes.
Typically unicode apps in Windows use wchar_t (with UTF-16 encoding) as internal character format, because it makes using Windows APIs easier as Windows itself uses UTF-16. Unix/Linux unicode apps in turn usually use char (with UTF-8 encoding) internally. If you want to exchange data between different platforms, UTF-8 is usual choice for data transfer encoding.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With