Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the different between printf & std::ostream under windows console using UTF-8 output

I have a program that prints UTF-8 string to the console:

#include <stdio.h>

int main()
{
    printf("Мир Peace Ειρήνη\n");
    return 0;   
}

I configure the console to use True Type fonts (Lucida Console), define UTF-8 code-page (chcp 65001) compile this program with both MinGW GCC and Visual Studio 2010 it works perfectly, I see: the output:

Мир Peace Ειρήνη

I do the same using std::cout

#include <iostream>

int main()
{
    std::cout << "Мир Peace Ειρήνη\n" ;
    return 0;   
}

This works perfectly fine as above using MinGW GCC but with Visual Studio 2010 I get squares, more than that the squares (two per each non-ASCII letter).

If I run the program with redirection test >test.txt I get perfect UTF-8 output in the file.

Both tests done on Windows 7.

Questions:

  1. What is the difference between printf and std::cout in the Visual Studio standard library in handling of the output stream - clearly one of them works and other does not?
  2. How can this be fixed?

Real Answer:

In short: you are screwed - std::cout does not really work with MSVC + UTF-8 - or at least requires enormous effort to make it behave reasonably.

In long: read two articles referenced in the answer.

like image 879
Artyom Avatar asked Apr 29 '12 12:04

Artyom


1 Answers

You have a number of flawed assumptions, lemme correct those first:

  • That things appear to work with g++ does not mean that g++ works correctly.

  • Visual Studio is not a compiler, it's an IDE that supports many languages and compilers.

  • The conclusion that the Visual C++'s standard library needs to be fixed is correct, but the reasoning leading to that conclusion is wrong. Also g++ standard library needs to be fixed. Not to mention the g++ compiler itself.

Now, Visual C++ has Windows ANSI, the encoding specified by the GetACP API function, as its undocumented C++ execution character set. Even if your source code is UTF-8 with BOM, narrow strings end up translated to Windows ANSI. If that, on your computer at the time of compilation, is a code page that includes all the non-ASCII characters, then OK, but otherwise the narrow strings will get garbled. The description of your test results is therefore seriously incomplete without mentioning the source code encoding and what your Windows ANSI codepage is.

But anyway, "If I run the program with redirection test >test.txt I get perfect UTF-8 output in the file" indicates that what you're up against is a bit of C++ level help from the Visual C++ runtime, where it bypasses the stream output and uses direct console output in order to get correct characters displayed in the console window.

This help results in garbage when its assumptions, such as Windows ANSI encoded narrow string literals, don't hold.

It also means that the effect mysteriously disappears when you redirect the stream. The runtime library then detects that the stream goes to a file, and turns off the direct console output feature. You're not guaranteed to then get the raw original byte values, but evidently you did, which was bad luck because it masked the problem.

By the way, codepage 65001 in the console in Windows is not usable in practice. Many programs just crash. Including e.g. more.


One way to get correct output is to use the Windows API level directly, with direct console output.

Getting correct output with the C++ streams is much more complicated.

It's so complicated that there's no room to describe it (correctly!) here, so I have to instead refer you to my 2-part blog article series about it: part 1 and part 2.

like image 153
Cheers and hth. - Alf Avatar answered Nov 15 '22 00:11

Cheers and hth. - Alf