I'm learning the C language on Linux now and I've came across a little weird situation. As far as I know, the standard C's <code>char</code> data type is ASCII, 1 byte (8 bits). It should mean, that it can hold only ASCII characters. In my program I use <code>char input[]</code>, which is filled by <code>getchar</code> function like this pseudocode: <pre class="prettyprint"><code>char input[20]; int z, i; for(i = 0; i < 20; i++) { z = getchar(); input[i] = z; } </code></pre> The weird thing is that it works not only for ASCII characters, but for any character I imagine, such as <code>@&@{čřžŧ¶'`[łĐŧđĐ¶←^€~[←^ø{&}čž</code> on the input. My question is - how is it possible? It seems to be one of many beautiful exceptions in C, but I would really appreciate explanation. Is it a matter of OS, compiler, hidden language's additional super-feature? Thanks.

ASCII is a 7 bit character set. In C normally represented by an 8 bit char. If highest bit in an 8 bit byte is set, it is not an ASCII character. Also notice that you are not guaranteed ASCII as base, tho many ignore other scenarios. If you want to check if a "primitive" byte is a alpha character you can in other words not, when taking heed to all systems, say: <pre class="prettyprint"><code>is_alpha = (c > 0x40 && c < 0x5b) || (c > 0x60 && c < 0x7b); </code></pre> Instead you'll have to use <code>ctype.h</code> and say: <pre class="prettyprint"><code>isalpha(c); </code></pre> Only exception, AFAIK, is for numbers, on most tables at least, they have contiguous values. Thus this works; <pre class="prettyprint"><code>char ninec = '9'; char eightc = '8'; int nine = ninec - '0'; int eight = eightc - '0'; printf("%d\n", nine); printf("%d\n", eight); </code></pre> But this is not guaranteed to be 'a': <pre class="prettyprint"><code>alhpa_a = 0x61; </code></pre> Systems not based on ASCII, i.e. using EBCDIC; C on such a platform still runs fine but here they (mostly) use 8 bits instead of 7 and i.e. <code>A</code> can be coded as decimal <code>193</code> and not <code>65</code> as it is in ASCII. <hr> For ASCII however; bytes having decimal 128 - 255, (8 bits in use), is extended, and not part of the ASCII set. I.e. ISO-8859 uses this range. What is often done; is also to combine two or more bytes to one character. So if you print two bytes after each other that is defined as say, utf8 <code>0xc3 0x98</code> == Ø, then you'll get this character. This again depends on which environment you are in. On many systems/environments printing ASCII values give same result across character sets, systems etc. But printing bytes > 127 or double byted characters gives a different result depending on local configuration. I.e.: Mr. A running the program gets Jasŋ€ While Mr. B gets Jasπß This is perhaps especially relevant to the ISO-8859 series and Windows-1252 of single byte representation of extended characters, etc. <ul> <li> ASCII_printable_characters , notice they are 7 not 8 bits.</li> <li> ISO_8859-1 and ISO_8859-15, widely used sets, with ASCII as core.</li> <li> Windows-1252, legacy of Windows.</li> </ul> <hr> <ul> <li> UTF-8#Codepage_layout, In UTF-8 you have ASCII, then you have special sequences of byes. <ul> <li>Each sequence starts with a byte > 127 (which is last ASCII byte), </li> <li>followed by a given number of bytes which all starts with the bits <code>10</code>. </li> <li>In other words, you will never find an ASCII byte in a multi byte UTF-8 representation.</li> </ul> </li> </ul> That is; the first byte in UTF-8, if not ASCII, tells how many bytes this character has. You could also say ASCII characters say no more bytes follow - because highest bit is 0. I.e if file interpreted as UTF-8: <pre class="prettyprint"><code>fgetc(c); if c < 128, 0x80, then ASCII if c == 194, 0xC2, then one more byte follow, interpret to symbol if c == 226, 0xE2, then two more byte follows, interpret to symbol ... </code></pre> As an example. If we look at one of the characters you mention. If in an UTF-8 terminal: <blockquote> $ echo -n "č" | xxd </blockquote> Should yield: <blockquote> 0000000: c48d .. </blockquote> In other words "č" is represented by the two bytes 0xc4 and 0x8d. Add -b to the xxd command and we get the binary representation of the bytes. We dissect them as follows: <pre class="prettyprint"><code> ___ byte 1 ___ ___ byte 2 ___ | | | | 0xc4 : 1100 0100 0x8d : 1000 1101 | | | +-- all "follow" bytes starts with 10, rest: 00 1101 | + 11 -> 2 bits set = two byte symbol, the "bits set" sequence end with 0. (here 3 bits are used 110) : rest 0 0100 Rest bits combined: xxx0 0100 xx00 1101 => 00100001101 \____/ \_____/ | | | +--- From last byte +------------ From first byte </code></pre> <blockquote> This give us: 00100001101 2 = 26910 = 0x10D => Uncode codepoint U+010D == "č". </blockquote> This number can also be used in HTML as <code>&#269;</code> == č Common for this and lots of other code systems is that an 8 bit byte is the base. <hr> Often it is also a question about context. As an example take GSM SMS, with ETSI GSM 03.38/03.40 (3GPP TS 23.038, 3GPP 23038). There we also find an 7bit character table, 7-bits GSM default alphabet, but instead of storing them as 8 bits they are stored as 7 bits1. This way you can pack more characters into a given number of bytes. Ie standard SMS 160 characters becomes 1280 bits or 160 bytes as ASCII and 1120 or 140 bytes as SMS. 1 Not without exception, (it is more to the story). I.e. a simple example of bytes saved as septets (7bit) C8329BFD06 in SMS UDP format to ASCII: <pre class="prettyprint"><code> _________ 7 bit UDP represented | +--- Alphas has same bits as ASCII as 8 bit hex '0.......' C8329BFDBEBEE56C32 1100100 d * Prev last 6 bits + pp 1 | | | | | | | | +- 00 110010 -> 1101100 l * Prev last 7 bits | | | | | | | +--- 0 1101100 -> 1110010 r * Prev 7 + 0 bits | | | | | | +----- 1110010 1 -> 1101111 o * Last 1 + prev 6 | | | | | +------- 101111 10 -> 1010111 W * Last 2 + prev 5 | | | | +--------- 10111 110 -> 1101111 o * Last 3 + prev 4 | | | +----------- 1111 1101 -> 1101100 l * Last 4 + prev 3 | | +------------- 100 11011 -> 1101100 l * Last 5 + prev 2 | +--------------- 00 110010 -> 1100101 e * Last 6 + prev 1 +----------------- 1 1001000 -> 1001000 H * Last 7 bits '------' | +----- GSM Table as binary </code></pre> And 9 bytes "unpacked" becomes 10 characters.

Unicode stored in C char

Tags:

c

unicode

ascii

I'm learning the C language on Linux now and I've came across a little weird situation.

As far as I know, the standard C's char data type is ASCII, 1 byte (8 bits). It should mean, that it can hold only ASCII characters.

In my program I use char input[], which is filled by getchar function like this pseudocode:

Click to copy

char input[20]; int z, i; for(i = 0; i < 20; i++) {    z = getchar();    input[i] = z; }

The weird thing is that it works not only for ASCII characters, but for any character I imagine, such as @&@{čřžŧ¶'`[łĐŧđĐ¶←^€~[←^ø{&}čž on the input.

My question is - how is it possible? It seems to be one of many beautiful exceptions in C, but I would really appreciate explanation. Is it a matter of OS, compiler, hidden language's additional super-feature?

Thanks.

547

asked Apr 04 '12 18:04

Miroslav Mares

2 Answers

There is no magic here - The C language gives you acess to the raw bytes, as they are stored in the computer memory. If your terminal is using utf-8 (which is likely), non-ASCII chars take more than one byte in memory. When you display then again, is our terminal code which converts these sequences into a single displayed character.

Just change your code to print the strlen of the strings, and you will see what I mean.

To properly handle utf-8 non-ASCII chars in C you have to use some library to handle them for you, like glib, qt, or many others.

104

answered Oct 03 '22 23:10

jsbueno

ASCII is a 7 bit character set. In C normally represented by an 8 bit char. If highest bit in an 8 bit byte is set, it is not an ASCII character.

Also notice that you are not guaranteed ASCII as base, tho many ignore other scenarios. If you want to check if a "primitive" byte is a alpha character you can in other words not, when taking heed to all systems, say:

Click to copy

is_alpha = (c > 0x40 && c < 0x5b) || (c > 0x60 && c < 0x7b);

Instead you'll have to use ctype.h and say:

Click to copy

isalpha(c);

Only exception, AFAIK, is for numbers, on most tables at least, they have contiguous values.

Thus this works;

Click to copy

char ninec  = '9'; char eightc = '8';  int nine  = ninec  - '0'; int eight = eightc - '0';  printf("%d\n", nine); printf("%d\n", eight);

But this is not guaranteed to be 'a':

Click to copy

alhpa_a = 0x61;

Systems not based on ASCII, i.e. using EBCDIC; C on such a platform still runs fine but here they (mostly) use 8 bits instead of 7 and i.e. A can be coded as decimal 193 and not 65 as it is in ASCII.

For ASCII however; bytes having decimal 128 - 255, (8 bits in use), is extended, and not part of the ASCII set. I.e. ISO-8859 uses this range.

What is often done; is also to combine two or more bytes to one character. So if you print two bytes after each other that is defined as say, utf8 0xc3 0x98 == Ø, then you'll get this character.

This again depends on which environment you are in. On many systems/environments printing ASCII values give same result across character sets, systems etc. But printing bytes > 127 or double byted characters gives a different result depending on local configuration.

I.e.:

Mr. A running the program gets

Jasŋ€

While Mr. B gets

Jasπß

This is perhaps especially relevant to the ISO-8859 series and Windows-1252 of single byte representation of extended characters, etc.

ASCII_printable_characters , notice they are 7 not 8 bits.
ISO_8859-1 and ISO_8859-15, widely used sets, with ASCII as core.
Windows-1252, legacy of Windows.

UTF-8#Codepage_layout, In UTF-8 you have ASCII, then you have special sequences of byes.
- Each sequence starts with a byte > 127 (which is last ASCII byte),
- followed by a given number of bytes which all starts with the bits 10.
- In other words, you will never find an ASCII byte in a multi byte UTF-8 representation.

That is; the first byte in UTF-8, if not ASCII, tells how many bytes this character has. You could also say ASCII characters say no more bytes follow - because highest bit is 0.

I.e if file interpreted as UTF-8:

Click to copy

fgetc(c);  if c  < 128, 0x80, then ASCII if c == 194, 0xC2, then one more byte follow, interpret to symbol if c == 226, 0xE2, then two more byte follows, interpret to symbol ...

As an example. If we look at one of the characters you mention. If in an UTF-8 terminal:

$ echo -n "č" | xxd

Should yield:

0000000: c48d ..

In other words "č" is represented by the two bytes 0xc4 and 0x8d. Add -b to the xxd command and we get the binary representation of the bytes. We dissect them as follows:

Click to copy

 ___  byte 1 ___     ___ byte 2 ___                        |               |   |              | 0xc4 : 1100 0100    0x8d : 1000 1101        |                    |        |                    +-- all "follow" bytes starts with 10, rest: 00 1101        |        + 11 -> 2 bits set = two byte symbol, the "bits set" sequence                end with 0. (here 3 bits are used 110) : rest 0 0100  Rest bits combined: xxx0 0100 xx00 1101 => 00100001101                        \____/   \_____/                          |        |                          |        +--- From last byte                          +------------ From first byte

This give us: 00100001101 ₂ = 269₁₀ = 0x10D => Uncode codepoint U+010D == "č".

This number can also be used in HTML as č == č

Common for this and lots of other code systems is that an 8 bit byte is the base.

Often it is also a question about context. As an example take GSM SMS, with ETSI GSM 03.38/03.40 (3GPP TS 23.038, 3GPP 23038). There we also find an 7bit character table, 7-bits GSM default alphabet, but instead of storing them as 8 bits they are stored as 7 bits¹. This way you can pack more characters into a given number of bytes. Ie standard SMS 160 characters becomes 1280 bits or 160 bytes as ASCII and 1120 or 140 bytes as SMS.

_{1 Not without exception, (it is more to the story).}

I.e. a simple example of bytes saved as septets (7bit) C8329BFD06 in SMS UDP format to ASCII:

Click to copy

                                _________ 7 bit UDP represented          |         +--- Alphas has same bits as ASCII as 8 bit hex                   '0.......' C8329BFDBEBEE56C32               1100100 d * Prev last 6 bits + pp 1  | | | | | | | | +- 00 110010 -> 1101100 l * Prev last 7 bits   | | | | | | | +--- 0 1101100 -> 1110010 r * Prev 7 + 0 bits  | | | | | | +----- 1110010 1 -> 1101111 o * Last 1 + prev 6  | | | | | +------- 101111 10 -> 1010111 W * Last 2 + prev 5  | | | | +--------- 10111 110 -> 1101111 o * Last 3 + prev 4  | | | +----------- 1111 1101 -> 1101100 l * Last 4 + prev 3  | | +------------- 100 11011 -> 1101100 l * Last 5 + prev 2  | +--------------- 00 110010 -> 1100101 e * Last 6 + prev 1  +----------------- 1 1001000 -> 1001000 H * Last 7 bits                                  '------'                                     |                                     +----- GSM Table as binary

And 9 bytes "unpacked" becomes 10 characters.

answered Oct 03 '22 23:10

Morpfh

Related questions
                            
                                Why (int)((unsigned int)((int)v)?
                            
                                What does getting the address of an array variable mean?
                            
                                Doxygen, too heavy to maintain? [closed]
                            
                                atoi implementation in C
                            
                                M option in make command, Makefile
                            
                                Is const a lie? (since const can be cast away) [duplicate]
                            
                                What does the LEAL assembly instruction do?
                            
                                hardcode byte array in C
                            
                                How can i get UTCTime in millisecond since January 1, 1970 in c language
                            
                                How to print out the memory contents of a variable in C?
                            
                                When abort() is preferred over exit()?
                            
                                @ sign in C variable declaration
                            
                                How can a shared library (.so) call a function that is implemented in its loading program?
                            
                                File based configuration handling in C (Unix)
                            
                                How to create a Singleton in C?
                            
                                Execution of printf() and Segmentation Fault
                            
                                What is the difference between "short int" and "int" in C?
                            
                                Why put the constant before the variable in a comparison?
                            
                                In a C expression where unsigned int and signed int are present, which type will be promoted to what type?
                            
                                Working of fork() in linux gcc [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Unicode stored in C char

Tags:

c

unicode

ascii

Miroslav Mares

People also ask

2 Answers

jsbueno

Morpfh

Recent Activity

Donate For Us