Note: I'm asking an implementation defined behavior which is on Microsoft Visual C++ 2008(possibly the same on 2005+). OS: simplified Chinese installation of Win7.
It surprises me when I'm performing non-ASCII I/O w/ printf
. E.g.
// This won't be necessary as it's the system default code page.
//system("chcp 936");
// NULL to show current locale, which is "C"
printf ("%s\n", setlocale(LC_ALL, NULL));
printf ("中\n");
printf ("%s\n", setlocale(LC_ALL, "English"));
printf ("中\n");
Output:
Active code page: 936
C
中
English_United States.1252
?D
The memory footprint in debugger shows that "中"
is encoded in two bytes: 0xD6
, 0xD0
, which is the code point of that character in code page 936, for simplified Chinese. It shouldn't be in the code point range of "C" locale
which, most likely, is 0x0 ~ 0x7F
.
Question:
Why can it still display the character correctly in "C" locale? So I made a guess that locale had no bearing on printf
? But then, I shall ask, why can't it display anymore when changing to "English"
locale, which is also different from 936? Interesting?
Edit:
I redirected the standard output to a file and took some test. It shows that whatever locale is set, the correct character "中"
is saved in the file. It suggests that setlocale()
is connected to the way console displays the character, which contradicts my understanding of how it works: printf
puts the bytes/code points into input buffer of console, which interprets these bytes using its own code page(what chcp
returns).
936 is rather tricky codepage, it allows 2 symbols character (similar it is done by UTF-8). For example Cyrillic (866) - doesn't allows two-byte characters and it behavior will be the same as "English".
So when you use default(936) codepage it knows how to process 2-symbol character, while "English" deals with 0x0 ~ 0x7f
only.
Let me also answer why wprintf(L"中")
fails. There are big difference between console application and Windows-window application, they use different codepages
Follow is matches between console and windows:
DOS | Windows
------+----------
850 | 1252
936 | 54936
866 | 1251
So if you would like to see in console correct symbols use WideCharToMultiByte
first - that provides expected conversion to allow console work in 936
The fact that the C locale prints out the string exactly as given is not surprising. That's what I would expect. What is surprising is that the English locale would do something different.
According do the locale documentation on MSDN, the only effect that locale should have on printf
is in determining the radix character for numeric values (i.e. the decimal point).
I suspect perhaps that it's a bug in Microsoft's Compiler. Or at the very least it's undocumented behaviour.
For what it's worth, on my compiler (Borland) the locale has no effect on the output of those strings. It does effect the radix though.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With