Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can char store two numbers?

Tags:

c

char

cyrillic

The case in next: I have cyrillic symbol "б". Running next code:

int main() {
    char c;
    scanf("%c", &c);
    printf("%d\n", c);
    return 0;
}

Shows -48. BUT when i am debugging this variable c, it shows me next: -48 '\320'enter image description here.

So how does this work? Is this a pointer to a 2-length array? Or how is it able to store two numbers?

like image 589
V. Dalechyn Avatar asked Jan 27 '23 01:01

V. Dalechyn


1 Answers

A char variable may either be used to store a small1 integer, or a character (more properly, code unit) in some not-so-well-defined, generally-ASCII-based encoding. Here the debugger is just trying to be helpful by displaying two (disputably) meaningful representations of the content of c.


Let's imagine for a moment that you actually wrote a instead of б; in that case, the debugger would write something like

c = {char} 97 'a'

because the actual number stored in c is 97, and, decoded as ASCII, it corresponds to the letter a.

Unfortunately, the idea that you can fit every possible character in a single 8-bit char value is completely flawed, so the most widespread encoding used nowadays (UTF-8), which happens to be the one in use on your machine, requires multiple code units (≈bytes) to represent a single code point (≈logical character) (some more details in this question). In particular, б is represented as a string of two bytes, namely byte 0xD0 and 0xB1.

C knows nothing about UTF-8 or code points; if you specify %c to scanf, it reads in a single byte, regardless of the fact that it suffices or not to represent a full UTF-8 code point. So, only the first of those bytes got read, and c just contains the 0xD0 value; the 0xB1 is still in the buffer, yet to be read.

Coming back to the value displayed by the debugger, first of all it must be noted that on your platform (as, unfortunately, on many platforms), char is signed. Hence, the 0xD0 byte is interpreted as a signed value as -48 (indeed, 0xD0 = 208, which "wraps around" at 127; 208 - 256 = -48).

As for '\320': the debugger here would like to display the ASCII representation of that value; however, the byte 0xD0 is outside the ASCII character range2, so here it gets displayed with an escape sequence. You may be familiar with '\n' to represent the newline character or \0 for the NUL character; in general, a \ followed by one to three digits in C means the byte with the corresponding octal value; 0320 is indeed octal for 208, which is decimal for 0xD0.

So, no mystery here: c still contains a single value (which is just "half" of your character); what you are seeing are just two (equally inconvenient) representations of its content.


Notes

  1. On most platforms, [-128, 127] or [0, 255], depending on the signedness of char (which, unfortunately, is implementation-defined).
  2. Indeed, UTF-8 extends ASCII by using only bytes with the high bit set (unused by ASCII) for its multibyte sequences; this makes sure that they cannot be misinterpreted for ASCII text.
like image 194
Matteo Italia Avatar answered Feb 07 '23 16:02

Matteo Italia