Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C++ unicode characters printing

Tags:

c++

unicode

cout

I need to print some unicode characters on the Linux terminal using iostream. Strange things happen though. When I write:

cout << "\u2780";

I get: , which is almost exactly what I want. However if I write:

cout << '\u2780';

I get: 14851712.

The problem is, I don't know the exact character to be printed at compile-time. Therefore I'd like to do something like:

int x;
// some calculations...
cout << (char)('\u2780' + x);

Which prints: . Using wcout or wchar_t instead don't work either. How do I get correct printing?

From what I found around on the Internet it seems important that I use g++ 4.7.2 compiler straight from Debian Wheezy repository.

like image 651
Sventimir Avatar asked Jun 05 '13 16:06

Sventimir


People also ask

Can C print Unicode?

Syntax: To print a Unicode in C language, we use a function named_setmode in which we define the bits of character. In the following figure, we are giving U16 as a parameter, so it prints only the characters with 16 Bit limit. By default, C language only prints 8 Bit characters.

What is the Unicode for C '?

Unicode Character “C” (U+0043)

Does C use Unicode or Ascii?

As far as I know, the standard C's char data type is ASCII, 1 byte (8 bits).

What is Unicode printing?

Unicode is a universal encoding scheme for written characters and text, which enables the exchange of data internationally. A Unicode field can contain all types of characters used on the IBM® i operating system, including ideographic (DBCS) characters.


2 Answers

The program prints an integer because of C++11 §2.14.3/1:

A multicharacter literal, or an ordinary character literal containing a single c-char not representable in the execution character set, is conditionally-supported, has type int, and has an implementation-defined value.

The execution character set is what char can represent, i.e. ASCII.

What you got is 14851712, or in hexadecimal e29e80, which is the UTF-8 representation of U+2780. Putting UTF-8, a multibyte encoding, into an int is insane and stupid, but that's what you get from a "conditionally supported, implementation-defined" feature.

To get a UTF-32 value, use U'\u2780'. The first U specifies the char32_t type and UTF-32 encoding (i.e. up to 31 bits but no surrogate pairs). The second \u specifies a universal-character-name containing the code point. To get a value supposedly compatible with wcout, use L'\u2780', but that doesn't necessarily use a Unicode runtime value nor get you more than two bytes of storage.

As for reliably manipulating and printing the Unicode codepoint, as other answers have noted, the C++ standard hasn't quite gotten there yet. Joni's answer is the best way, yet it still assumes that the compiler and the user's environment are using the same locale, which often isn't true.

You can also specify UTF-8 strings in the source using u8"\u2780" and force the runtime environment to UTF-8 using something like std::locale::global( std::locale( "en_US.UTF-8" ) );. But that still has rough edges. Joni suggests using the C interface std::setlocale from <clocale> instead of the C++ interface std::locale::global from <locale>, which is a workaround to the C++ interface being broken in GCC on OS X and perhaps other platforms. The issues are platform-sensitive enough that your Linux distro might well have put a patch into their own GCC package.

like image 137
Potatoswatter Avatar answered Sep 26 '22 23:09

Potatoswatter


The Unicode character \u2780 is outside of the range for the char datatype. You should have received this compiler warning to tell you about it: (at least my g++ 4.7.3 gives it)

test.cpp:6:13: warning: multi-character character constant [-Wmultichar]

If you want to work with characters like U+2780 as single units you'll have to use the widechar datatype wchar_t, or if you are lucky enough to be able to work with C++11, char32_t or char16_t. Note that one 16-bit unit is not enough to represent the full range of Unicode characters.

If that's not working for you, it's probably because the default "C" locale doesn't have support for non-ASCII output. To fix that problem you can call setlocale in the start of the program; that way you can output the full range of characters supported by the user's locale: (which may or may not have support for all of the characters you use)

#include <clocale>
#include <iostream>

using namespace std;

int main() {
    setlocale(LC_ALL, "");
    wcout << L'\u2780';
    return 0;
}
like image 29
Joni Avatar answered Sep 24 '22 23:09

Joni