When I have C++ code like this:
std::string narrow( "This is a narrow source string" );
std::string n2( "Win-1252 (that's the encoding we use for source files): ä,ö,ü,ß,€, ..." );
// What encoding should I pass to Win32's `MultiByteToWideChar` function
// to convert these string to a propoer wchar_t (= UTF-16 on Windows)?
Can I always assume Win-1252 if that's the (implicit) encoding of our cpp files? How does the Visual-C++ compiler decide which character encoding the source files are in?
What would happen if, say, a developer uses a machine where "normal" text files default to another single/multibyte encoding?
I assume the encoding is only an issue on the machine used to compile the code? That is, once the executable is built, converting a static string from a fixed narrow encoding to Windows' UTF-16 wchar_t will always yield the same result regardless of the laguage/locale on the users PC?
Note: Since the below answer was written VC++ has added additional options for source and execution charset encodings. See here.
For wide literals VC++ will always produce UTF-16, and for narrow literals VC++ will always convert from the source encoding to the "encoding for non-Unicode programs" set on the host machine (the system you run the compiler on). So as long as VC++ correctly recognizes the source encoding that's what you'll get, UTF-16 and the encoding for non-Unicode programs.
To determine the source encoding VC++ detects so-called BOMs. It will recognize UTF-16 and UTF-8. If there is no BOM then it assumes that the source is encoded using the system's encoding for non-Unicode programs.
If this results in the wrong encoding being used then any conversions performed by the compiler on character and string literals will result in the wrong values for any characters outside the ASCII range.
Once the program is compiled then yes, the locale will stop mattering as far as these compile-time conversions go, as the data is static.
Encoding may matter for other things though, such as if you print one of these strings to the console. You'll either have to perform an appropriate conversion to whatever the console is using or ensure the console is set to accept the encoding you're using.
Note on #pragma setlocale
#pragma setlocale affects only the conversion to wide literals and it does so neither by setting the source encoding nor by changing the wide execution encoding. What it actually does is, frankly, horrifying. Just as an example the following assertion fails:
#pragma setlocale(".1251")
static_assert(L'Я' != L'ß', "wtf...");
It should definitely be avoided if you use any Unicode encoding for your source.
The language specification merely says that source characters are mapped in an implementation-defined way. You need to consult the documentation for the compiler you are using in order to see what that implementation's definition says. For example, Microsoft Visual C++ uses #pragma setlocale to specify the code page.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With