Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do the string functions in C work on arrays with char instead of unsigned char?

Tags:

c

string

In the C standard library functions, the elements of the strings are chars. Is there a good reason why it was decided instead of unsigned char ?

Using unsigned char for 8-bit strings has some, albeit small advantages:

  • it is more intuitive, as we usually memorize ASCII codes as unsigned values, and when working on binary data, we prefer the range 0x00 to 0xFF, unsigned, instead of dealing with negative numbers. So we have to cast.
  • working with unsigned integers might be faster/more effective, or generate smaller code on some processors.
like image 249
vsz Avatar asked Aug 24 '12 08:08

vsz


6 Answers

C provides three different character types:

  • char represents a character (which C also calls a "byte").
  • unsigned char represents a byte-sized pattern of bits, or an unsigned integer.
  • signed char represents a byte-sized signed integer.

It is implementation-defined whether char is a signed or an unsigned type, so I think the question amounts to either "why does char exist at all as this maybe-signed type?" or "why doesn't C require char to be unsigned?".

The first thing to know is that Ritchie added the "char" type to the B language in 1971, and C inherited it from there. Prior to that, B was word-oriented rather than byte-oriented (so says the man himself, see "The Problems of B".)

With that done, the answer to both of my questions might be that early versions of C didn't have unsigned types.

Once char and the string-handling functions were established, changing them all to unsigned char would be a serious breaking change (i.e. almost all existing code would stop working), and one of the ways C has tried to cultivate its user-base over the decades is by mostly avoiding catastrophic incompatible changes. So it would be surprising for C to make that change.

Given that char is going to be the character type, and that (as you observe) it makes a lot of sense for it to be unsigned, but that plenty of implementations already existed in which char was signed, I suppose that making the signedness of char implementation-defined was a workable compromise -- existing code would continue working. Provided that it was using char only as a character and not for arithmetic or order comparisons, it would also be portable to implementations where char is unsigned.

Unlike some of C's age-old implementation-defined variations, implementers do still choose signed characters (Intel). The C standard committee cannot help but observe that some people seem to stick with signed characters for some reason. Whatever those people's reasons are, current or historical, C has to allow it because existing C implementations rely on it being allowed. So forcing char to be unsigned is far lower on the list of achievable goals than forcing int to be 2's complement, and C hasn't even done that.

A supplementary question is "why does Intel still specify char to be signed in its ABIs?", to which I don't know an answer but I'd guess that they've never had an opportunity to do otherwise without massive disruption. Maybe they even like them.

like image 83
Steve Jessop Avatar answered Oct 25 '22 21:10

Steve Jessop


Good question. As the standard does not define char to be either unsigned or signed (this is left to the implementation), I guess that the preference over char came combined from two angles:

  • char takes less time to type than unsigned char, making the prototypes of the string manipulation functions nicer to read and use.
  • Since the original ASCII spec was 7-bit, it didn't matter for the C spec's sake whether the valid values are in the range 0 to 127 or 0 to 255. Standardization of 8-bit characters sets has occurred much later.
like image 21
Dan Aloni Avatar answered Oct 25 '22 23:10

Dan Aloni


The signedness of char is implementation-defined.

A cleaner solution to the problem you're describing would be to mandate that plain char must be unsigned.

The reason plain char may be either signed or unsigned is partly historical, and partly related to performance.

Very early versions of C didn't have unsigned types. Since ASCII only covers the range 0 to 127, it was assumed that there was no particular disadvantage in making char a signed type. Once that decision was made, some programmers might have written code that depends on that, and later compilers kept char as a signed type to avoid breaking such code.

Quoting a C Reference Manual from 1975, 3 years before the publication of K&R1:

Characters (declared, and hereinafter called, char) are chosen from the ASCII set; they occupy the right- most seven bits of an 8-bit byte. It is also possible to interpret chars as signed, 2’s complement 8-bit numbers.

EBCDIC requires 8-bit unsigned char, but apparently EBCDIC-based machines weren't yet supported at that time.

As for performance, values of type char are implicitly converted, in many contexts, to int (assuming that int can represent all values of type char, which is usually the case). This is done via the "integer promotions". For example, this:

char ch = '0';
ch ++;

doesn't just perform an 8-bit increment. It converts the value of ch from char to int, adds 1 to the result, and converts the sum back from int to char to store it in ch. (The compiler can generate any code that provably achieves the same effect.)

Converting an 8-bit signed char to a 32-bit signed int requires sign extension. Converting an 8-bit unsigned char to a 32-bit signed int requires zero-filling the high-order 24 bits of the target. (The actual widths of these types may vary.) Depending on the CPU, one of these operations may be faster than the other. On some CPUs, making plain char signed might result in faster generated code.

(I don't know what the magnitude of this effect is.)

like image 34
Keith Thompson Avatar answered Oct 25 '22 23:10

Keith Thompson


No, there is no good reason. Nor is there any good reason why the signedness of char is implementation-defined. There exists no symbol table of any kind that uses negative number indexing.

I think all of this originates from the incorrect, weird assumption that there are 8 bit integers and then there are "characters", where "characters" is some sort of magical mysterious thing.

This is just one of many irrational flaws the C standard, inherited from the days when dinosaurs walked the earth. The mysterious signedness of char adds nothing to the language, except perhaps a potential for signedness-related bugs caused by implicit integer promotions.

EDIT:

Likely they let char be signed because they wanted it to behave just as the other integer types: short, int, long, which are all guaranteed by the standard to alwasy be signed by default.

working with unsigned integers might be faster/more effective, or generate smaller code on some processors.

What type you end up with in the end isn't exactly intuitive. Whenever you use char as operand in an expression, it will always get promoted to int. Similarly, constant character literals 'a', '\n' etc are of type int, not char. The C language forces the compiler to promote the types according to the implicit promotion rules (known as "integer promotions" and "the usual arithmetic conversions"/"balancing").

Once that promotion is done, the compiler may optimize the type into the one that is most effective, if it can prove that the optimization doesn't change the result.

If you have this code:

char a = 'a';
char b = 'b';
char c = a + b;

there are many obscure things going on between the lines. First of all, the literals 'a' and 'b' get silently truncated from int into signed/unsigned char. Then in the expression a + b, both a and b are implicitly promoted by the integer promotion rules into int types. The addition is performed on two int. Then the result is silently truncated back into a signed/unsigned char.

If the compiler can prove that optimization does not affect any of the above obscurities, it may replace it all with sane, 8-bit operations.

like image 33
Lundin Avatar answered Oct 25 '22 21:10

Lundin


Because standard doesn't define char as signed char

like image 32
David Ranieri Avatar answered Oct 25 '22 22:10

David Ranieri


There are three related types:

  • signed char, designed to store small signed integers
  • unsigned char, designed to store small unsigned intergers
  • char, designed to store characters

I think that what you really want to know is why isn't char an unsigned type?

There was a time when C hadn't unsigned types[1]. char were described as signed (see page 4), but even at that time, "the sign propagation feature disappears in other implementations", so it behaved already as signed at places, unsigned at other. And I think that the implementations choices simply reflected what was the easiest for them (for instance on the PDP-11, for which the first C implementation was made, MOVB did the sign extension and I don't remember there was a way to move a byte to a word without getting the sign extension).

Nowadays, most implementations I know are using signed char. The only one I know which have unsigned char are those from IBM were the support of EBCDIC mandate it (character codes for the characters in the basic character sets have to be positive, and EBCDIC has most of them above 128).

[1] Pointers where used instead...

like image 34
AProgrammer Avatar answered Oct 25 '22 23:10

AProgrammer