How to use Unicode in C++?

Tags:

Assuming a very simple program that:

ask a name.
store the name in a variable.
display the variable content on the screen.

It's so simple that is the first thing that one learns.

But my problem is that I don't know how to do the same thing if I enter the name using japanese characters.

So, if you know how to do this in C++, please show me an example (that I can compile and test)

Thanks.

user362981 : Thanks for your help. I compiled the code that you wrote without problem, them the console window appears and I cannot enter any Japanese characters on it (using IME). Also if I change a word in your code ("hello") to one that contains Japanese characters, it also will not display these.

Svisstack : Also thanks for your help. But when I compile your code I get the following error:

warning: deprecated conversion from string constant to 'wchar_t*' error: too few arguments to function 'int swprintf(wchar_t*, const wchar_t*, ...)' error: at this point in file warning: deprecated conversion from string constant to 'wchar_t*'

750

asked Jun 09 '10 23:06

Dox

1 Answers

You're going to get a lot of answers about wide characters. Wide characters, specifically wchar_t do not equal Unicode. You can use them (with some pitfalls) to store Unicode, just as you can an unsigned char. wchar_t is extremely system-dependent. To quote the Unicode Standard, version 5.2, chapter 5:

With the wchar_t wide character type, ANSI/ISO C provides for inclusion of fixed-width, wide characters. ANSI/ISO C leaves the semantics of the wide character set to the specific implementation but requires that the characters from the portable C execution set correspond to their wide character equivalents by zero extension.

and that

The width of wchar_t is compiler-specific and can be as small as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for storing Unicode text. The wchar_t type is intended for storing compiler-defined wide characters, which may be Unicode characters in some compilers.

So, it's implementation defined. Here's two implementations: On Linux, wchar_t is 4 bytes wide, and represents text in the UTF-32 encoding (regardless of the current locale). (Either BE or LE depending on your system, whichever is native.) Windows, however, has a 2 byte wide wchar_t, and represents UTF-16 code units with them. Completely different.

A better path: Learn about locales, as you'll need to know that. For example, because I have my environment setup to use UTF-8 (Unicode), the following program will use Unicode:

#include <iostream>  int main() {     setlocale(LC_ALL, "");     std::cout << "What's your name? ";     std::string name;     std::getline(std::cin, name);     std::cout << "Hello there, " << name << "." << std::endl;     return 0; }

...

$ ./uni_test What's your name? 佐藤 幹夫 Hello there, 佐藤 幹夫. $ echo $LANG en_US.UTF-8

But there's nothing Unicode about it. It merely reads in characters, which come in as UTF-8 because I have my environment set that way. I could just as easily say "heck, I'm part Czech, let's use ISO-8859-2": Suddenly, the program is getting input in ISO-8859-2, but since it's just regurgitating it, it doesn't matter, the program will still perform correctly.

Now, if that example had read in my name, and then tried to write it out into an XML file, and stupidly wrote <?xml version="1.0" encoding="UTF-8" ?> at the top, it would be right when my terminal was in UTF-8, but wrong when my terminal was in ISO-8859-2. In the latter case, it would need to convert it before serializing it to the XML file. (Or, just write ISO-8859-2 as the encoding for the XML file.)

On many POSIX systems, the current locale is typically UTF-8, because it provides several advantages to the user, but this isn't guaranteed. Just outputting UTF-8 to stdout will usually be correct, but not always. Say I am using ISO-8859-2: if you mindlessly output an ISO-8859-1 "è" (0xE8) to my terminal, I'll see a "č" (0xE8). Likewise, if you output a UTF-8 "è" (0xC3 0xA8), I'll see (ISO-8859-2) "Ă¨" (0xC3 0xA8). This barfing of incorrect characters has been called Mojibake.

Often, you're just shuffling data around, and it doesn't matter much. This typically comes into play when you need to serialize data. (Many internet protocols use UTF-8 or UTF-16, for example: if you got data from an ISO-8859-2 terminal, or a text file encoded in Windows-1252, then you have to convert it, or you'll be sending Mojibake.)

Sadly, this is about the state of Unicode support, in both C and C++. You have to remember: these languages are really system-agnostic, and don't bind to any particular way of doing it. That includes character-sets. There are tons of libraries out there, however, for dealing with Unicode and other character sets.

In the end, it's not all that complicated really: Know what encoding your data is in, and know what encoding your output should be in. If they're not the same, you need to do a conversion. This applies whether you're using std::cout or std::wcout. In my examples, stdin or std::cin and stdout/std::cout were sometimes in UTF-8, sometimes ISO-8859-2.

177

answered Sep 19 '22 20:09

Thanatos

Related questions
                            
                                Aliasing struct and array the C++ way
                            
                                dot asterisk operator in c++
                            
                                Call of overloaded function is ambiguous
                            
                                Logical AND, OR: Is left-to-right evaluation guaranteed?
                            
                                Calling a const function rather than its non-const version
                            
                                Is everything in C++11 STL user-implementable?
                            
                                function template overloading
                            
                                What happens when C++ reference leaves its scope?
                            
                                C++ STL allocator vs operator new
                            
                                Why compile error with enable_if
                            
                                How can I store objects of differing types in a C++ container?
                            
                                What is the use of a constant union object?
                            
                                Code analysis says Inconsistent annotation for 'wWinMain' : this instance has no annotations
                            
                                When Does Move Constructor get called?
                            
                                input_event structure description (from linux/input.h)
                            
                                c++ : code explanation for method prototype with const = 0
                            
                                Handling void assignment in C++ generic programming
                            
                                std::is_constructible on incomplete types
                            
                                What does __sync_synchronize do?
                            
                                Name of C/C++ stdlib naming convention?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to use Unicode in C++?

Tags:

c++

string

unicode

Dox

People also ask

1 Answers

Thanatos

Recent Activity

Donate For Us