I have Microsoft Visual Studio 2010 on Windows 7 64bit. (In project properties "Character set" is set to "Not set", however every setting leads to same output.)
Source code:
using namespace std;
char const charTest[] = "árvíztűrő tükörfúrógép ÁRVÍZTŰRŐ TÜKÖRFÚRÓGÉP\n";
cout << charTest;
printf(charTest);
if(set_codepage()) // SetConsoleOutputCP(CP_UTF8); // *1
cerr << "DEBUG: set_codepage(): OK" << endl;
else
cerr << "DEBUG: set_codepage(): FAIL" << endl;
cout << charTest;
printf(charTest);
*1: Including windows.h
messes up things, so I'm including it from a separate cpp.
The compiled binary contains the string as correct UTF-8 byte sequence. If I set the console to UTF-8 with chcp 65001
and issue type main.cpp
, the string displays correctly.
Test (console set to use Lucida Console font):
D:\dev\user\geometry\Debug>chcp
Active code page: 852
D:\dev\user\geometry\Debug>listProcessing.exe
├írv├şzt┼▒r┼Ĺ t├╝k├Ârf├║r├│g├ęp ├üRV├ŹZT┼░R┼É T├ťK├ľRF├ÜR├ôG├ëP
├írv├şzt┼▒r┼Ĺ t├╝k├Ârf├║r├│g├ęp ├üRV├ŹZT┼░R┼É T├ťK├ľRF├ÜR├ôG├ëP
DEBUG: set_codepage(): OK
��rv��zt��r�� t��k��rf��r��g��p ��RV��ZT��R�� T��K��RF��R��G��P
árvíztűrő tükörfúrógép ÁRVÍZTŰRŐ TÜKÖRFÚRÓGÉP
What is the explanation behind that? Can I somehow ask cout
to work as printf
?
ATTACHMENT
Many says that Windows console does not support UTF-8 characters at all. I'm a Hungarian guy in Hungary, my Windows is set to English (except date formats, they are set to Hungarian) and Cyrillic letters are still displayed correctly alongside Hungarian letters:
(My default console codepage is CP852)
ANSI and UTF-8 are both encoding formats. ANSI is the common one byte format used to encode Latin alphabet; whereas, UTF-8 is a Unicode format of variable length (from 1 to 4 bytes) which can encode all possible characters.
UTF-8 encodes a character into a binary string of one, two, three, or four bytes. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names. In UTF-8, the smallest binary representation of a character is one byte, or eight bits.
UTF-8 is a character encoding system. It lets you represent characters as ASCII text, while still allowing for international characters, such as Chinese characters. As of the mid 2020s, UTF-8 is one of the most popular encoding systems.
The Difference Between Unicode and UTF-8Unicode is a character set. UTF-8 is encoding. Unicode is a list of characters with unique decimal numbers (code points).
The differences here is how C++ runtime and C library is handles system locale.
To achieve same result with std::cout you'll can try std::ios::imbue method and std::locale
But main issue with utf-8 and C++ described here
C++03 offers two kinds of string literals. The first kind, contained within double quotes, produces a null-terminated array of type const char. The second kind, defined as L"", produces a null-terminated array of type const wchar_t, where wchar_t is a wide-character. Neither literal type offers support for string literals with UTF-8, UTF-16, or any other kind of Unicode encodings.
So anyway it is all implementation specific and thus non-portable, because non of the standard C++ output streams can understand utf-8.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With