Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to print UTF-8 strings to std::cout on Windows?

I'm writing a cross-platform application in C++. All strings are UTF-8-encoded internally. Consider the following simplified code:

#include <string> #include <iostream>  int main() {     std::string test = u8"Greek: αβγδ; German: Übergrößenträger";     std::cout << test;      return 0; } 

On Unix systems, std::cout expects 8-bit strings to be UTF-8-encoded, so this code works fine.

On Windows, however, std::cout expects 8-bit strings to be in Latin-1 or a similar non-Unicode format (depending on the codepage). This leads to the following output:

Greek: ╬▒╬▓╬│╬┤; German: ├£bergr├Â├ƒentr├ñger

What can I do to make std::cout interpret 8-bit strings as UTF-8 on Windows?

This is what I tried:

#include <string> #include <iostream> #include <io.h> #include <fcntl.h>  int main() {     _setmode(_fileno(stdout), _O_U8TEXT);     std::string test = u8"Greek: αβγδ; German: Übergrößenträger";     std::cout << test;      return 0; } 

I was hoping that _setmode would do the trick. However, this results in the following assertion error in the line that calls operator<<:

Microsoft Visual C++ Runtime Library

Debug Assertion Failed!

Program: d:\visual studio 2015\Projects\utf8test\Debug\utf8test.exe File: minkernel\crts\ucrt\src\appcrt\stdio\fputc.cpp Line: 47

Expression: ( (_Stream.is_string_backed()) || (fn = _fileno(_Stream.public_stream()), ((_textmode_safe(fn) == __crt_lowio_text_mode::ansi) && !_tm_unicode_safe(fn))))

For information on how your program can cause an assertion failure, see the Visual C++ documentation on asserts.

like image 375
Daniel Wolf Avatar asked Aug 08 '17 18:08

Daniel Wolf


People also ask

Does std::string support UTF-8?

UTF-8 actually works quite well in std::string . Most operations work out of the box because the UTF-8 encoding is self-synchronizing and backward compatible with ASCII.

What is the code page for UTF-8?

Windows XP and later, including all supported Windows versions, have codepage 65001, as a synonym for UTF-8 (since Windows 7 support for UTF-8 is better).

Is std::string Unicode?

And as std::string works with char , so std::string is already unicode-ready. Note that std::string , like the C string API, will consider the "olé" string to have 4 characters, not three. So you should be cautious when truncating/playing with unicode chars because some combination of chars is forbidden in UTF-8.


2 Answers

At last, I've got it working. This answer combines input from Miles Budnek, Paul, and mkluwe with some research of my own. First, let me start with code that will work on Windows 10. After that, I'll walk you through the code and explain why it won't work out of the box on Windows 7.

#include <string> #include <iostream> #include <Windows.h> #include <cstdio>  int main() {     // Set console code page to UTF-8 so console known how to interpret string data     SetConsoleOutputCP(CP_UTF8);      // Enable buffering to prevent VS from chopping up UTF-8 byte sequences     setvbuf(stdout, nullptr, _IOFBF, 1000);      std::string test = u8"Greek: αβγδ; German: Übergrößenträger";     std::cout << test << std::endl; } 

The code starts by setting the code page, as suggested by Miles Budnik. This will tell the console to interpret the byte stream it receives as UTF-8, not as some variation of ANSI.

Next, there is a problem in the STL code that comes with Visual Studio. std::cout prints its data to a stream buffer of type std::basic_filebuf. When that buffer receives a string (via std::basic_streambuf::sputn()), it won't pass it on to the underlying file as a whole. Instead, it will pass each byte separately. As explained by mkluwe, if the console receives a UTF-8 byte sequence as individual bytes, it won't interpret them as a single code point. Instead, it will treat them as multiple characters. Each byte within a UTF-8 byte sequence is an invalid code point on its own, so you'll see �'s instead. There is a related bug report for Visual Studio, but it was closed as By Design. The workaround is to enable buffering for the stream. As an added bonus, that will give you better performance. However, you may now need to regularly flush the stream as I do with std::endl, or your output may not show.

Lastly, the Windows console supports both raster fonts and TrueType fonts. As pointed out by Paul, raster fonts will simply ignore the console's code page. So non-ASCII Unicode characters will only work if the console is set to a TrueType Font. Up until Windows 7, the default is a raster font, so the user will have to change it manually. Luckily, Windows 10 changes the default font to Consolas, so this part of the problem should solve itself with time.

like image 63
Daniel Wolf Avatar answered Sep 21 '22 00:09

Daniel Wolf


The problem is not std::cout but the windows console. Using C-stdio you will get the ü with fputs( "\xc3\xbc", stdout ); after setting the UTF-8 codepage (either using SetConsoleOutputCP or chcp) and setting a Unicode supporting font in cmd's settings (Consolas should support over 2000 characters and there are registry hacks to add more capable fonts to cmd).

If you output one byte after the other with putc('\xc3'); putc('\xbc'); you will get the double tofu as the console gets them interpreted separately as illegal characters. This is probably what the C++ streams do.

See UTF-8 output on Windows console for a lenghty discussion.

For my own project, I finally implemented a std::stringbuf doing the conversion to Windows-1252. I you really need full Unicode output, this will not really help you, however.

An alternative approach would be overwriting cout's streambuf, using fputs for the actual output:

#include <iostream> #include <sstream>  #include <Windows.h>  class MBuf: public std::stringbuf { public:     int sync() {         fputs( str().c_str(), stdout );         str( "" );         return 0;     } };  int main() {     SetConsoleOutputCP( CP_UTF8 );     setvbuf( stdout, nullptr, _IONBF, 0 );     MBuf buf;     std::cout.rdbuf( &buf );     std::cout << u8"Greek: αβγδ\n" << std::flush; } 

I turned off output buffering here to prevent it to interfere with unfinished UTF-8 byte sequences.

like image 30
mkluwe Avatar answered Sep 18 '22 00:09

mkluwe