In the C
standard library functions, the elements of the strings are char
s. Is there a good reason why it was decided instead of unsigned char
?
Using unsigned char
for 8-bit strings has some, albeit small advantages:
C provides three different character types:
char
represents a character (which C also calls a "byte").unsigned char
represents a byte-sized pattern of bits, or an unsigned integer.signed char
represents a byte-sized signed integer.It is implementation-defined whether char
is a signed or an unsigned type, so I think the question amounts to either "why does char
exist at all as this maybe-signed type?" or "why doesn't C require char
to be unsigned?".
The first thing to know is that Ritchie added the "char" type to the B language in 1971, and C inherited it from there. Prior to that, B was word-oriented rather than byte-oriented (so says the man himself, see "The Problems of B".)
With that done, the answer to both of my questions might be that early versions of C didn't have unsigned types.
Once char
and the string-handling functions were established, changing them all to unsigned char
would be a serious breaking change (i.e. almost all existing code would stop working), and one of the ways C has tried to cultivate its user-base over the decades is by mostly avoiding catastrophic incompatible changes. So it would be surprising for C to make that change.
Given that char
is going to be the character type, and that (as you observe) it makes a lot of sense for it to be unsigned, but that plenty of implementations already existed in which char was signed, I suppose that making the signedness of char implementation-defined was a workable compromise -- existing code would continue working. Provided that it was using char
only as a character and not for arithmetic or order comparisons, it would also be portable to implementations where char
is unsigned.
Unlike some of C's age-old implementation-defined variations, implementers do still choose signed characters (Intel). The C standard committee cannot help but observe that some people seem to stick with signed characters for some reason. Whatever those people's reasons are, current or historical, C has to allow it because existing C implementations rely on it being allowed. So forcing char
to be unsigned is far lower on the list of achievable goals than forcing int
to be 2's complement, and C hasn't even done that.
A supplementary question is "why does Intel still specify char
to be signed in its ABIs?", to which I don't know an answer but I'd guess that they've never had an opportunity to do otherwise without massive disruption. Maybe they even like them.
Good question. As the standard does not define char
to be either unsigned or signed (this is left to the implementation), I guess that the preference over char
came combined from two angles:
char
takes less time to type than unsigned char
, making the prototypes of the string manipulation functions nicer to read and use.The signedness of char
is implementation-defined.
A cleaner solution to the problem you're describing would be to mandate that plain char
must be unsigned.
The reason plain char
may be either signed or unsigned is partly historical, and partly related to performance.
Very early versions of C didn't have unsigned types. Since ASCII only covers the range 0 to 127, it was assumed that there was no particular disadvantage in making char
a signed type. Once that decision was made, some programmers might have written code that depends on that, and later compilers kept char
as a signed type to avoid breaking such code.
Quoting a C Reference Manual from 1975, 3 years before the publication of K&R1:
Characters (declared, and hereinafter called,
char
) are chosen from the ASCII set; they occupy the right- most seven bits of an 8-bit byte. It is also possible to interpretchar
s as signed, 2’s complement 8-bit numbers.
EBCDIC requires 8-bit unsigned char
, but apparently EBCDIC-based machines weren't yet supported at that time.
As for performance, values of type char
are implicitly converted, in many contexts, to int
(assuming that int
can represent all values of type char
, which is usually the case). This is done via the "integer promotions". For example, this:
char ch = '0';
ch ++;
doesn't just perform an 8-bit increment. It converts the value of ch
from char
to int
, adds 1 to the result, and converts the sum back from int
to char
to store it in ch
. (The compiler can generate any code that provably achieves the same effect.)
Converting an 8-bit signed char
to a 32-bit signed int
requires sign extension. Converting an 8-bit unsigned char
to a 32-bit signed int
requires zero-filling the high-order 24 bits of the target. (The actual widths of these types may vary.) Depending on the CPU, one of these operations may be faster than the other. On some CPUs, making plain char
signed might result in faster generated code.
(I don't know what the magnitude of this effect is.)
No, there is no good reason. Nor is there any good reason why the signedness of char is implementation-defined. There exists no symbol table of any kind that uses negative number indexing.
I think all of this originates from the incorrect, weird assumption that there are 8 bit integers and then there are "characters", where "characters" is some sort of magical mysterious thing.
This is just one of many irrational flaws the C standard, inherited from the days when dinosaurs walked the earth. The mysterious signedness of char adds nothing to the language, except perhaps a potential for signedness-related bugs caused by implicit integer promotions.
EDIT:
Likely they let char be signed because they wanted it to behave just as the other integer types: short, int, long, which are all guaranteed by the standard to alwasy be signed by default.
working with unsigned integers might be faster/more effective, or generate smaller code on some processors.
What type you end up with in the end isn't exactly intuitive. Whenever you use char as operand in an expression, it will always get promoted to int. Similarly, constant character literals 'a', '\n' etc are of type int, not char. The C language forces the compiler to promote the types according to the implicit promotion rules (known as "integer promotions" and "the usual arithmetic conversions"/"balancing").
Once that promotion is done, the compiler may optimize the type into the one that is most effective, if it can prove that the optimization doesn't change the result.
If you have this code:
char a = 'a';
char b = 'b';
char c = a + b;
there are many obscure things going on between the lines. First of all, the literals 'a' and 'b' get silently truncated from int
into signed/unsigned char. Then in the expression a + b
, both a and b are implicitly promoted by the integer promotion rules into int
types. The addition is performed on two int
. Then the result is silently truncated back into a signed/unsigned char.
If the compiler can prove that optimization does not affect any of the above obscurities, it may replace it all with sane, 8-bit operations.
Because standard doesn't define char as signed char
There are three related types:
signed char
, designed to store small signed integersunsigned char
, designed to store small unsigned intergerschar
, designed to store charactersI think that what you really want to know is why isn't char
an unsigned type?
There was a time when C hadn't unsigned types[1]. char
were described as signed (see page 4), but even at that time, "the sign propagation feature disappears in other implementations", so it behaved already as signed at places, unsigned at other. And I think that the implementations choices simply reflected what was the easiest for them (for instance on the PDP-11, for which the first C implementation was made, MOVB
did the sign extension and I don't remember there was a way to move a byte to a word without getting the sign extension).
Nowadays, most implementations I know are using signed char
. The only one I know which have unsigned char
are those from IBM were the support of EBCDIC mandate it (character codes for the characters in the basic character sets have to be positive, and EBCDIC has most of them above 128).
[1] Pointers where used instead...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With