Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Explanation needed for an UTF-8 vs cpp case

I have Microsoft Visual Studio 2010 on Windows 7 64bit. (In project properties "Character set" is set to "Not set", however every setting leads to same output.)

Source code:

  using namespace std;
  char const charTest[] = "árvíztűrő tükörfúrógép ÁRVÍZTŰRŐ TÜKÖRFÚRÓGÉP\n";
  cout << charTest;
  printf(charTest);
  if(set_codepage()) // SetConsoleOutputCP(CP_UTF8); // *1
    cerr << "DEBUG: set_codepage(): OK" << endl;
  else
    cerr << "DEBUG: set_codepage(): FAIL" << endl;
  cout << charTest;
  printf(charTest);

*1: Including windows.h messes up things, so I'm including it from a separate cpp.

The compiled binary contains the string as correct UTF-8 byte sequence. If I set the console to UTF-8 with chcp 65001 and issue type main.cpp, the string displays correctly.

Test (console set to use Lucida Console font):

D:\dev\user\geometry\Debug>chcp
Active code page: 852

D:\dev\user\geometry\Debug>listProcessing.exe
├írv├şzt┼▒r┼Ĺ t├╝k├Ârf├║r├│g├ęp ├üRV├ŹZT┼░R┼É T├ťK├ľRF├ÜR├ôG├ëP
├írv├şzt┼▒r┼Ĺ t├╝k├Ârf├║r├│g├ęp ├üRV├ŹZT┼░R┼É T├ťK├ľRF├ÜR├ôG├ëP
DEBUG: set_codepage(): OK
��rv��zt��r�� t��k��rf��r��g��p ��RV��ZT��R�� T��K��RF��R��G��P
árvíztűrő tükörfúrógép ÁRVÍZTŰRŐ TÜKÖRFÚRÓGÉP

What is the explanation behind that? Can I somehow ask cout to work as printf?

ATTACHMENT

Many says that Windows console does not support UTF-8 characters at all. I'm a Hungarian guy in Hungary, my Windows is set to English (except date formats, they are set to Hungarian) and Cyrillic letters are still displayed correctly alongside Hungarian letters:

Hungarian and Cyrillic letters on console at the same time

(My default console codepage is CP852)

like image 882
Notinlist Avatar asked Sep 22 '12 15:09

Notinlist


People also ask

What is the difference between ANSI and UTF-8?

ANSI and UTF-8 are both encoding formats. ANSI is the common one byte format used to encode Latin alphabet; whereas, UTF-8 is a Unicode format of variable length (from 1 to 4 bytes) which can encode all possible characters.

What is UTF-8 and UTF-16?

UTF-8 encodes a character into a binary string of one, two, three, or four bytes. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names. In UTF-8, the smallest binary representation of a character is one byte, or eight bits.

What is the purpose of UTF-8?

UTF-8 is a character encoding system. It lets you represent characters as ASCII text, while still allowing for international characters, such as Chinese characters. As of the mid 2020s, UTF-8 is one of the most popular encoding systems.

What is difference between Unicode and UTF-8?

The Difference Between Unicode and UTF-8Unicode is a character set. UTF-8 is encoding. Unicode is a list of characters with unique decimal numbers (code points).


1 Answers

The differences here is how C++ runtime and C library is handles system locale.

To achieve same result with std::cout you'll can try std::ios::imbue method and std::locale

But main issue with utf-8 and C++ described here

C++03 offers two kinds of string literals. The first kind, contained within double quotes, produces a null-terminated array of type const char. The second kind, defined as L"", produces a null-terminated array of type const wchar_t, where wchar_t is a wide-character. Neither literal type offers support for string literals with UTF-8, UTF-16, or any other kind of Unicode encodings.

So anyway it is all implementation specific and thus non-portable, because non of the standard C++ output streams can understand utf-8.

like image 194
Sergei Nikulov Avatar answered Oct 06 '22 00:10

Sergei Nikulov