I read that C not define if a char is signed or unsigned, and in GCC page this says that it can be signed on x86 and unsigned in PowerPPC and ARM.
Okey, I'm writing a program with GLIB that define char as gchar (not more than it, only a way for standardization).
My question is, what about UTF-8? It use more than an block of memory?
Say that I have a variable
unsigned char *string = "My string with UTF8 enconding ~> çã";
See, if I declare my variable as
unsigned
I will have only 127 values (so my program will to store more blocks of mem) or the UTF-8 change to negative too?
Sorry if I can't explain it correctly, but I think that i is a bit complex.
NOTE: Thanks for all answer
I don't understand how it is interpreted normally.
I think that like ascii, if I have a signed and unsigned char on my program, the strings have diferently values, and it leads to confuse, imagine it in utf8 so.
I've had a couple requests to explain a comment I made.
The fact that a char
type can default to either a signed or unsigned type can be significant when you're comparing characters and expect a certain ordering. In particular, UTF8 uses the high bit (assuming that char
is an 8-bit type, which is true in the vast majority of platforms) to indicate that a character code point requires more than one byte to be represented.
A quick and dirty example of the problem:
#include <stdio.h>
int main( void)
{
signed char flag = 0xf0;
unsigned char uflag = 0xf0;
if (flag < (signed char) 'z') {
printf( "flag is smaller than 'z'\n");
}
else {
printf( "flag is larger than 'z'\n");
}
if (uflag < (unsigned char) 'z') {
printf( "uflag is smaller than 'z'\n");
}
else {
printf( "uflag is larger than 'z'\n");
}
return 0;
}
On most projects that I work, the unadorned char
type is typically avoided in favor us using a typedef that explicitly specifies an unsigned char
. Something like the uint8_t
from stdint.h
or
typedef unsigned char u8;
Generally dealing with an unsigned char
type seems to work well and have few problems - the one area that I have seen occasional problems is when using something of that type to control a loop:
while (uchar_var-- >= 0) {
// infinite loop...
}
Two things:
Whether a char type is signed or unsigned won't affect your ability to translate UTF8-encoded-strings to and from whatever display string type you're using (WCHAR or whatnot). Don't worry about it, in other words: the UTF8 bytes are just bytes, and whatever you're using as an encoder/decoder will do the right thing.
Some of your confusion may be that you're trying to do this:
unsigned char *string = "This is a UTF8 string";
Don't do this-- you're mixing different concepts. A UTF-8 encoded string is just a sequence of bytes. C string literals (as above) were not really designed to represent this; they're designed to represent "ASCII-encoded" strings. Although for some cases (like mine here) they end up being the same thing, in your example in the question, they may not. And certainly in other cases they won't be. Load your Unicode strings from an external resource. In general I'd be wary of embedding non-ASCII characters in a .c source file; even if the compiler knows what to do with them, other software in your toolchain may not.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With