Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Char C question about encoding signed/unsigned

Tags:

c

char

utf-8

I read that C not define if a char is signed or unsigned, and in GCC page this says that it can be signed on x86 and unsigned in PowerPPC and ARM.

Okey, I'm writing a program with GLIB that define char as gchar (not more than it, only a way for standardization).

My question is, what about UTF-8? It use more than an block of memory?

Say that I have a variable

unsigned char *string = "My string with UTF8 enconding ~> çã";

See, if I declare my variable as

unsigned

I will have only 127 values (so my program will to store more blocks of mem) or the UTF-8 change to negative too?

Sorry if I can't explain it correctly, but I think that i is a bit complex.

NOTE: Thanks for all answer

I don't understand how it is interpreted normally.

I think that like ascii, if I have a signed and unsigned char on my program, the strings have diferently values, and it leads to confuse, imagine it in utf8 so.

like image 271
drigoSkalWalker Avatar asked Mar 26 '10 15:03

drigoSkalWalker


2 Answers

I've had a couple requests to explain a comment I made.

The fact that a char type can default to either a signed or unsigned type can be significant when you're comparing characters and expect a certain ordering. In particular, UTF8 uses the high bit (assuming that char is an 8-bit type, which is true in the vast majority of platforms) to indicate that a character code point requires more than one byte to be represented.

A quick and dirty example of the problem:

#include <stdio.h>
int main( void)
{
    signed char flag = 0xf0;
    unsigned char uflag = 0xf0;

    if (flag < (signed char) 'z') {
        printf( "flag is smaller than 'z'\n");
    }
    else {
        printf( "flag is larger than 'z'\n");
    }    


    if (uflag < (unsigned char) 'z') {
        printf( "uflag is smaller than 'z'\n");
    }
    else {
        printf( "uflag is larger than 'z'\n");
    }
    return 0;
}

On most projects that I work, the unadorned char type is typically avoided in favor us using a typedef that explicitly specifies an unsigned char. Something like the uint8_t from stdint.h or

typedef unsigned char u8;

Generally dealing with an unsigned char type seems to work well and have few problems - the one area that I have seen occasional problems is when using something of that type to control a loop:

while (uchar_var-- >= 0) {
    // infinite loop...
}
like image 60
Michael Burr Avatar answered Sep 29 '22 11:09

Michael Burr


Two things:

  1. Whether a char type is signed or unsigned won't affect your ability to translate UTF8-encoded-strings to and from whatever display string type you're using (WCHAR or whatnot). Don't worry about it, in other words: the UTF8 bytes are just bytes, and whatever you're using as an encoder/decoder will do the right thing.

  2. Some of your confusion may be that you're trying to do this:

    unsigned char *string = "This is a UTF8 string";
    

    Don't do this-- you're mixing different concepts. A UTF-8 encoded string is just a sequence of bytes. C string literals (as above) were not really designed to represent this; they're designed to represent "ASCII-encoded" strings. Although for some cases (like mine here) they end up being the same thing, in your example in the question, they may not. And certainly in other cases they won't be. Load your Unicode strings from an external resource. In general I'd be wary of embedding non-ASCII characters in a .c source file; even if the compiler knows what to do with them, other software in your toolchain may not.

like image 25
Ben Zotto Avatar answered Sep 29 '22 11:09

Ben Zotto