I've recently tried to get the full picture about what steps it takes to create platform independent C++ applications that support unicode. A thing that is confusing to me is that most howtos and stuff equalize the character encoding (i.e. ANSI or Unicode) and the character type (char or wchar_t). As I've learned so far, these are different things and there may exist a character sequence encodeded in Unicode but represented by std::string as well as a character sequence encoded in ANSI but represented as std::wstring, right? So the question that comes to my mind is whether the C++ standard gives any guarantee about the encoding of string literals starting with <code>L</code> or does it just say it's of type wchar_t with implementation specific character encoding? If there is no such guaranty, does that mean I need some sort of external resource system to provide non ASCII string literals for my application in a platform independent way? What is the prefered way for this? Resource system or proper encoding of source files plus proper compiler options?

The C++03 does not mention unicode (future C++0x does). Currently you have to either use external libraries (ICU, UTF-CPP, etc.) or build your own solution using platform specific code. As others have mentioned, wchar_t encoding (or even size) is not specified. Consequently, string literal encoding is implementation specific. However, you can give unicode codepoints in string literals by using \x \u \U escapes. Typically unicode apps in Windows use wchar_t (with UTF-16 encoding) as internal character format, because it makes using Windows APIs easier as Windows itself uses UTF-16. Unix/Linux unicode apps in turn usually use char (with UTF-8 encoding) internally. If you want to exchange data between different platforms, UTF-8 is usual choice for data transfer encoding.

Is a wide character string literal starting with L like L"Hello World" guaranteed to be encoded in Unicode?

Tags:

c++

unicode

I've recently tried to get the full picture about what steps it takes to create platform independent C++ applications that support unicode. A thing that is confusing to me is that most howtos and stuff equalize the character encoding (i.e. ANSI or Unicode) and the character type (char or wchar_t). As I've learned so far, these are different things and there may exist a character sequence encodeded in Unicode but represented by std::string as well as a character sequence encoded in ANSI but represented as std::wstring, right?

So the question that comes to my mind is whether the C++ standard gives any guarantee about the encoding of string literals starting with L or does it just say it's of type wchar_t with implementation specific character encoding?

If there is no such guaranty, does that mean I need some sort of external resource system to provide non ASCII string literals for my application in a platform independent way? What is the prefered way for this? Resource system or proper encoding of source files plus proper compiler options?

979

asked Nov 27 '09 19:11

Peter

2 Answers

The L symbol in front of a string literal simply means that each character in the string will be stored as a wchar_t. But this doesn't necessarily imply Unicode. For example, you could use a wide character string to encode GB 18030, a character set used in China which is similar to Unicode. The C++03 standard doesn't have anything to say about Unicode, (however C++11 defines Unicode char types and string literals) so it's up to you to properly represent Unicode strings in C++03.

Regarding string literals, Chapter 2 (Lexical Conventions) of the C++ standard mentions a "basic source character set", which is basically equivalent to ASCII. So this essentially guarantees that "abc" will be represented as a 3-byte string (not counting the null), and L"abc" will be represented as a 3 * sizeof(wchar_t)-byte string of wide-characters.

The standard also mentions "universal-character-names" which allow you to refer to non-ASCII characters using the \uXXXX hexadecimal notation. These "universal-character-names" usually map directly to Unicode values, but the standard doesn't guarantee that they have to. However, you can at least guarantee that your string will be represented as a certain sequence of bytes by using universal-character-names. This will guarantee Unicode output provided the runtime environment supports Unicode, has the appropriate fonts installed, etc.

As for string literals in C++03 source files, again there is no guarantee. If you have a Unicode string literal in your code which contains characters outside of the ASCII range, it is up to your compiler to decide how to interpret these characters. If you want to explicitly guarantee that the compiler will "do the right thing", you'd need to use \uXXXX notation in your string literals.

188

answered Oct 26 '22 06:10

Charles Salvia

The C++03 does not mention unicode (future C++0x does). Currently you have to either use external libraries (ICU, UTF-CPP, etc.) or build your own solution using platform specific code. As others have mentioned, wchar_t encoding (or even size) is not specified. Consequently, string literal encoding is implementation specific. However, you can give unicode codepoints in string literals by using \x \u \U escapes.

Typically unicode apps in Windows use wchar_t (with UTF-16 encoding) as internal character format, because it makes using Windows APIs easier as Windows itself uses UTF-16. Unix/Linux unicode apps in turn usually use char (with UTF-8 encoding) internally. If you want to exchange data between different platforms, UTF-8 is usual choice for data transfer encoding.

answered Oct 26 '22 06:10

eidolon

Related questions
                            
                                What are real significant cases when memcpy() is faster than memmove()?
                            
                                Retrieving the current frame number in OpenCV
                            
                                HOG features visualisation with OpenCV, HOGDescriptor in C++
                            
                                Why calling a function that accepts no parameters with a parameter compiles in C but doesn't in C++
                            
                                Creating a Gaussian Random Generator with a mean and standard deviation
                            
                                Python Embedding in C++ : ImportError: No module named pyfunction
                            
                                Building a compiled application with Docker
                            
                                Default argument vs overloads in C++
                            
                                Can we increase an iterator multiple positions without the 'advance' function?
                            
                                Protobuf vs Flatbuffers vs Cap'n proto which is faster?
                            
                                How much functionality is "acceptable" for a C++ struct?
                            
                                Scripting language for C/C++?
                            
                                Converting a UINT32 value into a UINT8 array[4]
                            
                                Merging interfaces, without merging
                            
                                How to get current date and time? [duplicate]
                            
                                Can an object destroy itself?
                            
                                GDI versus Direct2D
                            
                                What is the reason behind std::chrono::duration's lack of immediate tick count manipulation?
                            
                                Why does operator = return *this?
                            
                                How to target Windows XP in Microsoft Visual Studio C++ [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With