This question:
What is an unsigned char?
does a great job of discussing char vs. unsigned char vs. signed char in C.
However, it doesn't directly address what should be used for non-ASCII text. Thus if I have an array of bytes that represents text in some arbitrary character set like UTF-8 or Big5 (or sometimes ASCII), should I use an array of char or unsigned char?
I'm leaning towards using char because otherwise gcc gives me warnings about signedness of pointers when the array is ASCII and I use strlen. But I would like to know what is correct.
If your values range is [0,255] you should use unsigned char but if it is [-128,127] then you should use signed char . Suppose you are use the first range ( signed char ), then you can perform the operation 100+100 . Otherwise that operation will overflow and give you an unexpected value.
The rest part of the ASCII is known as extended ASCII. Using char or signed char we cannot store the extended ASCII values. By using the unsigned char, we can store the extended part as its range is 0 to 255.
unsigned char is a character datatype where the variable consumes all the 8 bits of the memory and there is no sign bit (which is there in signed char). So it means that the range of unsigned char data type ranges from 0 to 255.
An unsigned type can only represent postive values (and zero) where as a signed type can represent both positive and negative values (and zero). In the case of a 8-bit char this means that an unsigned char variable can hold a value in the range 0 to 255 while a signed char has the range -128 to 127.
Use normal char to represent characters. Use signed char when you want a signed integer type that covers values from -127 to +127
. Use unsigned char for having an unsigned integer type that has range of values from 0 to 255
.
The question you are asking is probably much broader that you expect.
To answer it directly, most implementations use "byte" as underlying buffer. In that terms standard uint8_t
typedef is your best bet. That is primarily because most character sets use variable number of bytes to store characters, so separate byte processing is essential in encoding and decoding process. It also simplifies conversion between different "endianess".
In general it's incorrect to use strlen
on anything other than ASCII encoding or other single-byte code pages (0-255 range). It's certainly incorrect on any multi-byte encoding like Big5, UTF-8/16 or Shift-JIS.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With