Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C: char vs. unsigned char for non-ASCII text data

This question:

What is an unsigned char?

does a great job of discussing char vs. unsigned char vs. signed char in C.

However, it doesn't directly address what should be used for non-ASCII text. Thus if I have an array of bytes that represents text in some arbitrary character set like UTF-8 or Big5 (or sometimes ASCII), should I use an array of char or unsigned char?

I'm leaning towards using char because otherwise gcc gives me warnings about signedness of pointers when the array is ASCII and I use strlen. But I would like to know what is correct.

like image 959
Craig S. Anderson Avatar asked Oct 24 '14 03:10

Craig S. Anderson


People also ask

Should I use char or unsigned char?

If your values range is [0,255] you should use unsigned char but if it is [-128,127] then you should use signed char . Suppose you are use the first range ( signed char ), then you can perform the operation 100+100 . Otherwise that operation will overflow and give you an unexpected value.

Why do we use unsigned char in C?

The rest part of the ASCII is known as extended ASCII. Using char or signed char we cannot store the extended ASCII values. By using the unsigned char, we can store the extended part as its range is 0 to 255.

Is unsigned char a data type in C?

unsigned char is a character datatype where the variable consumes all the 8 bits of the memory and there is no sign bit (which is there in signed char). So it means that the range of unsigned char data type ranges from 0 to 255.

What is the difference between unsigned char and char in C?

An unsigned type can only represent postive values (and zero) where as a signed type can represent both positive and negative values (and zero). In the case of a 8-bit char this means that an unsigned char variable can hold a value in the range 0 to 255 while a signed char has the range -128 to 127.


2 Answers

Use normal char to represent characters. Use signed char when you want a signed integer type that covers values from -127 to +127 . Use unsigned char for having an unsigned integer type that has range of values from 0 to 255 .

like image 60
Dr. Debasish Jana Avatar answered Sep 20 '22 06:09

Dr. Debasish Jana


The question you are asking is probably much broader that you expect.

To answer it directly, most implementations use "byte" as underlying buffer. In that terms standard uint8_t typedef is your best bet. That is primarily because most character sets use variable number of bytes to store characters, so separate byte processing is essential in encoding and decoding process. It also simplifies conversion between different "endianess".

In general it's incorrect to use strlen on anything other than ASCII encoding or other single-byte code pages (0-255 range). It's certainly incorrect on any multi-byte encoding like Big5, UTF-8/16 or Shift-JIS.

like image 21
Petr Abdulin Avatar answered Sep 18 '22 06:09

Petr Abdulin