Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why printf can display non-ASCII characters when "C" locale is used?

Note: I'm asking an implementation defined behavior which is on Microsoft Visual C++ 2008(possibly the same on 2005+). OS: simplified Chinese installation of Win7.

It surprises me when I'm performing non-ASCII I/O w/ printf. E.g.

   // This won't be necessary as it's the system default code page.
   //system("chcp 936");
   
   // NULL to show current locale, which is "C"
   printf ("%s\n", setlocale(LC_ALL, NULL));
   printf ("中\n");
   printf ("%s\n", setlocale(LC_ALL, "English"));
   printf ("中\n");

Output:

Active code page: 936
C
中
English_United States.1252
?D

The memory footprint in debugger shows that "中" is encoded in two bytes: 0xD6, 0xD0, which is the code point of that character in code page 936, for simplified Chinese. It shouldn't be in the code point range of "C" locale which, most likely, is 0x0 ~ 0x7F.

Question:

Why can it still display the character correctly in "C" locale? So I made a guess that locale had no bearing on printf? But then, I shall ask, why can't it display anymore when changing to "English" locale, which is also different from 936? Interesting?

Edit:

I redirected the standard output to a file and took some test. It shows that whatever locale is set, the correct character "中" is saved in the file. It suggests that setlocale() is connected to the way console displays the character, which contradicts my understanding of how it works: printf puts the bytes/code points into input buffer of console, which interprets these bytes using its own code page(what chcp returns).

like image 521
Eric Z Avatar asked May 05 '13 09:05

Eric Z


2 Answers

936 is rather tricky codepage, it allows 2 symbols character (similar it is done by UTF-8). For example Cyrillic (866) - doesn't allows two-byte characters and it behavior will be the same as "English".

So when you use default(936) codepage it knows how to process 2-symbol character, while "English" deals with 0x0 ~ 0x7f only.

Let me also answer why wprintf(L"中") fails. There are big difference between console application and Windows-window application, they use different codepages Follow is matches between console and windows:

DOS   |   Windows
------+----------
850   |  1252
936   | 54936
866   |  1251

So if you would like to see in console correct symbols use WideCharToMultiByte first - that provides expected conversion to allow console work in 936

like image 130
Dewfy Avatar answered Oct 07 '22 00:10

Dewfy


The fact that the C locale prints out the string exactly as given is not surprising. That's what I would expect. What is surprising is that the English locale would do something different.

According do the locale documentation on MSDN, the only effect that locale should have on printf is in determining the radix character for numeric values (i.e. the decimal point).

I suspect perhaps that it's a bug in Microsoft's Compiler. Or at the very least it's undocumented behaviour.

For what it's worth, on my compiler (Borland) the locale has no effect on the output of those strings. It does effect the radix though.

like image 29
James Holderness Avatar answered Oct 07 '22 01:10

James Holderness