Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I convert string like "\u94b1" to one real character in C++?

Tags:

c++

unicode

We know in string literal, "\u94b1" will be converted to a character, in this case a Chinese word '钱'. But if it is literally 6 character in a string, saying '\', 'u', '9', '4', 'b', '1', how can I convert it to a character manually?

For example:

string s1;
string s2 = "\u94b1";
cin >> s1;            //here I input \u94b1
cout << s1 << endl;   //here output \u94b1
cout << s2 << endl;   //and here output 钱

I want to convert s1 so that cout << s1 << endl; will also output .

Any suggestion please?

like image 595
Eric Zheng Avatar asked Jun 01 '16 07:06

Eric Zheng


2 Answers

In fact the conversion is a little more complicated.

string s2 = "\u94b1";

is in fact the equivalent of:

char cs2 = { 0xe9, 0x92, 0xb1, 0}; string s2 = cs2;

That means that you are initializing it the the 3 characters that compose the UTF8 representation of 钱 - you char just examine s2.c_str() to make sure of that.


So to process the 6 raw characters '\', 'u', '9', '4', 'b', '1', you must first extract the wchar_t from string s1 = "\\u94b1"; (what you get when you read it). It is easy, just skip the two first characters and read it as hexadecimal:

unsigned int ui;
std::istringstream is(s1.c_str() + 2);
is >> hex >> ui;

ui is now 0x94b1.

Now provided you have a C++11 compliant system, you can convert it with std::convert_utf8:

wchar_t wc = ui;
std::codecvt_utf8<wchar_t> conv;
const wchar_t *wnext;
char *next;
char cbuf[4] = {0}; // initialize the buffer to 0 to have a terminating null
std::mbstate_t state;
conv.out(state, &wc, &wc + 1, wnext, cbuf, cbuf+4, next);

cbuf contains now the 3 characters representing 钱 in utf8 and a terminating null, and you finaly can do:

string s3 = cbuf;
cout << s3 << endl;
like image 140
Serge Ballesta Avatar answered Nov 03 '22 11:11

Serge Ballesta


You do this by writing code that checks whether the string contains a backslash, a letter u, and four hexadecimal digits, and converts this to a Unicode code point. Then your std::string implementation probably assumes UTF-8, so you translate that code point into 1, 2, or 3 UTF-8 bytes.

For extra points, figure out how to enter code points outside the basic plane.

like image 20
gnasher729 Avatar answered Nov 03 '22 10:11

gnasher729