Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

printing Unicode characters C++

Tags:

c++

unicode

I'm trying to write a simple command line app to teach myself Japanese, but can't seem to get Unicode characters to print. What am I missing?

#include <iostream>
using namespace std;

int main()
{
        wcout << L"こんにちは世界\n";
        wcout << L"Hello World\n"
        system("pause");
}

In this example only "Press any key to continue" is displayed. Tested on Visual C++ 2013.

like image 256
Jeff Linahan Avatar asked Sep 19 '13 20:09

Jeff Linahan


2 Answers

This is not easy on Windows. Even when you manage to get the text to the Windows console you still need to configure cmd.exe to be able to display Japanese characters.


#include <iostream>

int main() {
  std::cout << "こんにちは世界\n";
}

This works fine on any system where:

  • The compiler's source and execution encodings include the characters.
  • The output device (e.g., the console) expects text in the same encoding as the compiler's execution encoding.
  • A font with the appropriate characters is available (usually not a problem).

Most platforms these days use UTF-8 by default for all these encodings and so can support the entire Unicode range with code similar to the above. Unfortunately Windows is not one of these platforms.

wcout << L"こんにちは世界\n";

In this line the string literal data is (at compile time) converted from the source encoding to the execution wide encoding and then (at run time) wcout uses the locale it is imbued with to convert the wchar_t data to char data for output. Where things go wrong is that the default locale is only required to support characters from the basic source character set, which doesn't even include all ASCII characters, let alone non-ASCII characters.

So the conversion results in an error, putting wcout into a bad state. The error has to be cleared before wcout will function again, which is why the second print statement does not output anything.


You can work around this for a limited range of characters by imbuing wcout with a locale that will successfully convert the characters. Unfortunately the encoding that is needed to support the entire Unicode range this way is UTF-8; Although Microsoft's implementation of streams supports other multibyte encodings it very specifically does not support UTF-8.

For example:

wcout.imbue(std::locale(std::locale::classic(), new std::codecvt_utf8_utf16<wchar_t>()));

SetConsoleOutputCP(CP_UTF8);

wcout << L"こんにちは世界\n";

Here wcout will correctly convert the string to UTF-8, and if the output were written to a file instead of the console then the file would contain the correct UTF-8 data. However the Windows console, even though configured here to accept UTF-8 data, simply will not accept UTF-8 data written in this way.


There are a few options:

  • Avoid the standard library entirely:

    DWORD n;
    WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), L"こんにちは世界\n", 8, &n, nullptr);
    
  • Use non-standard magical incantation that will break standard code:

    #include <fcntl.h>
    #include <io.h>
    
    _setmode(_fileno(stdout), _O_U8TEXT);
    std::wcout << L"こんにちは世界\n";
    

    After setting this mode std::cout << "Hello, World"; will crash.

  • Use a low level IO API along with manual conversion:

    #include <codecvt>
    #include <locale>
    
    SetConsoleOutputCP(CP_UTF8);
    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> convert;
    std::puts(convert.to_bytes(L"こんにちは世界\n"));
    

Using any of these methods, cmd.exe will display the correct text to the best of its ability, by which I mean it will display unreadable boxes. Seven little boxes, for the given string.

                            Little Boxes

You can copy the text out of cmd.exe and into notepad.exe or whatever to see the correct glyphs.

like image 127
bames53 Avatar answered Oct 09 '22 15:10

bames53


There's a whole article about dealing with Unicode in Windows console

http://alfps.wordpress.com/2011/11/22/unicode-part-1-windows-console-io-approaches/
http://alfps.wordpress.com/2011/12/08/unicode-part-2-utf-8-stream-mode/

Basically, you may implement you own streambuf for std::cout (or std::wcout) in terms of WriteConsoleW and enjoy writing UTF-8 (or whatever Unicode you want) to Windows console without depending on locales, console code pages and even without using wide characters.
It may not look very straightforward, but it's convenient and reusable solution, which is also able to give you a portable utf8-everywhere style user code. Please, don't beat me for my English :)

like image 21
user2665887 Avatar answered Oct 09 '22 16:10

user2665887