Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does C++ char distinguish ASCII and UNICODE

Tags:

c++

I am currently programming with c++ a program that handles both alphabets and Korean characters.

However I learned that the size of char in c++ is only 1 bytes. This meant that in order to handle foreign characters or UNICODE, it needs to use two chars for one character.

string s = string("a가b나c다");
cout<< s.length();

prints 9

but my question is how does the c++ execution distinguish between the two different type of characters?

for example if I make a char array the size of 9, How does it know whether its 9 ascii characters or 4 unicode + 1 ascii ??

and then i figured out this :

    char c;
    int a;
    char* cp = "가나다라마바사아";
    for (int i = 0; i < 20; i++) {
        c = a = cp[i];
        cout << "\n c val : " << c;
        cout << "\n a val : " << a;
    }

ONLY prints out negative values for a.

 c val :
 a val : -80
 c val :
 a val : -95
 c val :
 a val : -77
 c val :
 a val : -86
 c val :
 a val : -76
 c val :
 a val : -39

Which i can infer that for non ascii characters it only uses negative values? but isn't this quite a waste ?

My question in summary: Do c++ distinguish ascii chars and unicode chars only by looking if they are negative ?


Answer in summary : the parser decides whether to consider 1~4 char as 1 glyph by looking up the first few bits of the char, so to some extent my assumption was valid.

like image 909
NamHo Lee Avatar asked Nov 16 '17 03:11

NamHo Lee


1 Answers

how does the c++ execution distinguish between the two different type of characters?

It doesn't. The compiler decided to encode your string as Unicode at compile-time. In this case, it appears to have chosen UTF-8.

How does it know whether its 9 ascii characters or 4 unicode + 1 ascii ??

Again, it doesn't. Your string contains 9 char values (excluding any termination character). The number of actual "characters" (or "glyphs") that represents can only be determined by parsing the string. If you know it's UTF-8, you parse it accordingly.

Which i can infer that for non ascii characters it only uses negative values? but isn't this quite a waste ?

No. Well, sort of. If you're interested, read a primer on Unicode (specifically UTF-8). You could read the actual standard, but it's enormous. Wikipedia should be sufficient for a better understanding.

You'll see that multi-byte strings have the high-bit set. This makes it possible to parse multi-byte values correctly. It's not really that wasteful, because the standard is arranged such that wider encodings are generally reserved for less common values.

The reason it output negatives is that you are using signed char types. If you cast as unsigned, you'll see the values are simply greater than 127. When you read more about how UTF-8 is encoded, you'll understand why.

My question in summary: Do c++ distinguish ascii chars and unicode chars only by looking if they are negative ?

My answer in summary: No. "Negative" is a numeric system. You are probably accustomed to 2's-complement. Encode, or encode not: there is no "negative".

like image 139
paddy Avatar answered Sep 18 '22 19:09

paddy