Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does the multibyte-to-wide-string conversion function "mbstowcs", when passed a string literal, use the encoding of the source file?

ADDENDUM A tentative answer of my own appears at the bottom of the question.


I am converting an archaic VC6 C++/MFC project to VS2013 and Unicode, based on the recommendations at utf8everywhere.org.

Along the way, I have been studying Unicode, UTF-16, UCS-2, UTF-8, the standard library and STL support of Unicode & UTF-8 (or, rather, the standard library's lack of support), ICU, Boost.Locale, and of course the Windows SDK and MFC's API that requires UTF-16 wchar's.

As I have been studying the above issues, a question continues to recur that I have not been able to answer to my satisfaction in a clarified way.

Consider the C library function mbstowcs. This function has the following signature:

size_t mbstowcs (wchar_t* dest, const char* src, size_t max);

The second parameter src is (according to the documentation) a

C-string with the multibyte characters to be interpreted. The multibyte sequence shall begin in the initial shift state.

My question is in regards to this multibyte string. It is my understanding that the encoding of a multibyte string can differ from string to string, and the encoding is not specified by the standard. Nor does a particular encoding seem to be specified by the MSVC documentation for this function.

My understanding at this point is that on Windows, this multibyte string is expected to be encoded with the ANSI code page of the active locale. But my clarity begins to fade at this point.

I have been wondering whether the encoding of the source code file itself makes a difference in the behavior of mbstowcs, at least on Windows. And, I'm also confused about what happens at compile time vs. what happens at run time for the code snippet above.

Suppose you have a string literal passed to mbstowcs, like this:

wchar_t dest[1024];
mbstowcs (dest, "Hello, world!", 1024);

Suppose this code is compiled on a Windows machine. Suppose that the code page of the source code file itself is different than the code page of the current locale on the machine on which the compiler runs. Will the compiler take into consideration the source code file's encoding? Will the resulting binary be effected by the fact that the code page of the source code file is different than the code page of the active locale on which the compiler runs?

On the other hand, maybe I have it wrong - maybe the active locale of the runtime machine determines the code page that is expected of the string literal. Therefore, does the code page with which the source code file is saved need to match the code page of the computer on which the program ultimately runs? That seems so whacked to me that I find it hard to believe this would be the case. But as you can see, my clarity is lacking here.

On the other hand, if we change the call to mbstowcs to explicitly pass a UTF-8 string:

wchar_t dest[1024];
mbstowcs (dest, u8"Hello, world!", 1024);

... I assume that mbstowcs will always do the right thing - regardless of the code page of the source file, the current locale of the compiler, or the current locale of the computer on which the code runs. Am I correct about this?

I would appreciate clarity on these matters, in particular in regards to the specific questions I have raised above. If any or all of my questions are ill-formed, I would appreciate knowing that, as well.


ADDENDUM From the lengthy comments beneath @TheUndeadFish's answer, and from the answer to a question on a very similar topic here, I believe I have a tentative answer to my own question that I'd like to propose.

Let's follow the raw bytes of the source code file to see how the actual bytes are transformed through the entire process of compilation to runtime behavior:

  • The C++ standard 'ostensibly' requires that all characters in any source code file be a (particular) 96-character subset of ASCII called the basic source character set. (But see following bullet points.)

    In terms of the actual byte-level encoding of these 96 characters in the source code file, the standard does not specify any particular encoding, but all 96 characters are ASCII characters, so in practice, there is never a question about what encoding the source file is in, because all encodings in existence represent these 96 ASCII characters using the same raw bytes.

  • However, character literals and code comments might commonly contain characters outside these basic 96.

    This is typically supported by the compiler (even though this isn't required by the C++ standard). The source code's character set is called the source character set. But the compiler needs to have these same characters available in its internal character set (called the execution character set), or else those missing characters will be replaced by some other (dummy) character (such as a square or a question mark) prior to the compiler actually processing the source code - see the discussion that follows.

    How the compiler determines the encoding that is used to encode the characters of the source code file (when characters appear that are outside the basic source character set) is implementation-defined.

    Note that it is possible for the compiler to use a different character set (encoded however it likes) for its internal execution character set than the character set represented by the encoding of the source code file!

    This means that even if the compiler knows about the encoding of the source code file (which implies that the compiler also knows about all the characters in the source code's character set), the compiler might still be forced to convert some characters in the source code's character set to different characters in the execution character set (thereby losing information). The standard states that this is acceptable, but that the compiler must not convert any characters in the source character set to the NULL character in the execution character set.

    Nothing is said by the C++ standard about the encoding used for the execution character set, just as nothing is said about the characters that are required to be supported in the execution character set (other than the characters in the basic execution character set, which include all characters in the basic source character set plus a handful of additional ones such as the NULL character and the backspace character).

    It is not really seemingly documented anywhere very clearly, even by Microsoft, how any of this process is handled in MSVC. I.e., how the compiler figures out what the encoding and corresponding character set of the source code file is, and/or what the choice of execution character set is, and/or what the encoding is that will be used for the execution character set during compilation of the source code file.

    It seems that in the case of MSVC, the compiler will make a best-guess effort in its attempt to select an encoding (and corresponding character set) for any given source code file, falling back on the current locale's default code page of the machine the compiler is running on. Or you can take special steps to save the source code files as Unicode using an editor that will provide the proper byte-order mark (BOM) at the beginning of each source code file. This includes UTF-8, for which the BOM is typically optional or excluded - in the case of source code files read by the MSVC compiler, you must include the UTF-8 BOM.

    And in terms of the execution character set and its encoding for MSVC, continue on with the next bullet point.

  • The compiler proceeds to read the source file and converts the raw bytes of the characters of the source code file from the encoding for the source character set into the (potentially different) encoding of the corresponding character in the execution character set (which will be the same character, if the given character is present in both character sets).

    Ignoring code comments and character literals, all such characters are typically in the basic execution character set noted above. This is a subset of the ASCII character set, so encoding issues are irrelevant (all of these characters are, in practice, encoded identically on all compilers).

    Regarding the code comments and character literals, though: the code comments are discarded, and if the character literals contain only characters in the basic source character set, then no problem - these characters will belong in the basic execution character set and still be ASCII.

    But if the character literals in the source code contain characters outside of the basic source character set, then these characters are, as noted above, converted to the execution character set (possibly with some loss). But as noted, neither the characters, nor the encoding for this character set is defined by the C++ standard. Again, the MSVC documentation seems to be very weak on what this encoding and character set will be. Perhaps it is the default ANSI encoding indicated by the active locale on the machine on which the compiler runs? Perhaps it is UTF-16?

  • In any case, the raw bytes that will be burned into the executable for the character string literal correspond exactly to the compiler's encoding of the characters in the execution character set.

  • At runtime, mbstowcs is called and it is passed the bytes from the previous bullet point, unchanged.

    It is now time for the C runtime library to interpret the bytes that are passed to mbstowcs.

    Because no locale is provided with the call to mbstowcs, the C runtime has no idea what encoding to use when it receives these bytes - this is arguably the weakest link in this chain.

    It is not documented by the C++ (or C) standard what encoding should be used to read the bytes passed to mbstowcs. I am not sure if the standard states that the input to mbstowcs is expected to be in the same execution character set as the characters in the execution character set of the compiler, OR if the encoding is expected to be the same for the compiler as for the C runtime implementation of mbstowcs.

    But my tentative guess is that in the MSVC C runtime, apparently the locale of the current running thread will be used to determine both the runtime execution character set, and the encoding representing this character set, that will be used to interpret the bytes passed to mbstowcs.

    This means that it will be very easy for these bytes to be mis-interpreted as different characters than were encoded in the source code file - very ugly, as far as I'm concerned.

    If I'm right about all this, then if you want to force the C runtime to use a particular encoding, you should call the Window SDK's MultiByteToWideChar, as @HarryJohnston's comment indicates, because you can pass the desired encoding to that function.

  • Due to the above mess, there really isn't an automatic way to deal with character literals in source code files.

    Therefore, as https://stackoverflow.com/a/1866668/368896 mentions, if there's a chance you'll have non-ASCII characters in your character literals, you should use resources (such as GetText's method, which also works via Boost.Locale on Windows in conjunction with the xgettext .exe that ships with Poedit), and in your source code, simply write functions to load the resources as raw (unchanged) bytes.

    Make sure to save your resource files as UTF-8, and then make sure to call functions at runtime that explicitly support UTF-8 for their char *'s and std::string's, such as (from the recommendations at utf8everywhere.org) using Boost.Nowide (not really in Boost yet, I think) to convert from UTF-8 to wchar_t at the last possible moment prior to calling any Windows API functions that write text to dialog boxes, etc. (and using the W forms of these Windows API functions). For console output, you must call the SetConsoleOutputCP-type functions, such as is also described at https://stackoverflow.com/a/1866668/368896.

Thanks to those who took the time to read the lengthy proposed answer here.

like image 449
Dan Nissenbaum Avatar asked Jan 09 '15 23:01

Dan Nissenbaum


1 Answers

The encoding of the source code file doesn't affect the behavior of mbstowcs. After all, the internal implementation of the function is unaware of what source code might be calling it.

On the MSDN documentation you linked is:

mbstowcs uses the current locale for any locale-dependent behavior; _mbstowcs_l is identical except that it uses the locale passed in instead. For more information, see Locale.

That linked page about locales then references setlocale which is how the behavior of mbstowcs can be affected.

Now, taking a look at your proposed way of passing UTF-8:

mbstowcs (dest, u8"Hello, world!", 1024);

Unfortunately, that isn't going to work properly as far as I know once you use interesting data. If it even compiles, it only does do because the compiler would have to be treating u8 the same as a char*. And as far as mbstowcs is concerned, it will believe the string is encoded under whatever the locale is set for.

Even more unfortunately, I don't believe there's any way (on the Windows / Visual Studio platform) to set a locale such that UTF-8 would be used.

So that would happen to work for ASCII characters (the first 128 characters) only because they happen to have the exact same binary values in various ANSI encodings as well as UTF-8. If you try with any characters beyond that (for instance anything with an accent or umlaut) then you'll see problems.


Personally, I think mbstowcs and such are rather limited and clunky. I've found the Window's API function MultiByteToWideChar to be more effective in general. In particular it can easily handle UTF-8 just by passing CP_UTF8 for the code page parameter.

like image 104
TheUndeadFish Avatar answered Oct 20 '22 00:10

TheUndeadFish