I know that a char
is allowed to be signed or unsigned depending on the implementation. This doesn't really bother me if all I want to do is manipulate bytes. (In fact, I don't think of the char
datatype as a character, but a byte).
But, if I understand, string literals are signed char
s (actually they're not, but see the update below), and the function fgetc() returns unsigned char
s casted into int
. So if I want to manipulate characters, is it preferred style to use signed, unsigned, or ambiguous characters? Why does reading characters from a file have a different convention than literals?
I ask because I have some code in c that does string comparison between string literals and the contents of files, but having a signed char *
vs unsigned char *
might really make my code error prone.
Update 1
Ok as a few people pointed out (in answers and comments) string literals are in fact char
arrays, not signed char
arrays. That means I really should use char *
for string literals, and not think about whether they are signed or unsigned. This makes me perfectly happy (until I have to start making conversion/comparisons with unsigned chars).
However the important question remains, how do I read characters from a file, and compare them to a string literal. The crux of which is the conversion from the int
read using fgetc(), which explicitly reads an unsigned char
from the file, to the char
type, which is allowed to be either signed or unsigned.
Allow me to provide a more detailed example.
int main(void)
{
FILE *someFile = fopen("ThePathToSomeRealFile.html", "r");
assert(someFile);
char substringFromFile[25];
memset((void*)substringFromFile,0,sizeof(substringFromFile));
//Alright, the real example is to read the first few characters from the file
//And then compare them to the string I expect
const char *expectedString = "<!DOCTYPE";
for( int counter = 0; counter < sizeof(expectedString)/sizeof(*expectedString); ++counter )
{
//Read it as an integer, because the function returns an `int`
const int oneCharacter = fgetc(someFile);
if( ferror(someFile) )
return EXIT_FAILURE;
if( int == EOF || feof(someFile) )
break;
assert(counter < sizeof(substringFromFile)/sizeof(*substringFromFile));
//HERE IS THE PROBLEM:
//I know the data contained in oneCharacter must be an unsigned char
//Therefore, this is valid
const unsigned char uChar = (const unsigned char)oneCharacter;
//But then how do I assign it to the char?
substringFromFile[counter] = (char)oneCharacter;
}
//and ultimately here's my goal
int headerIsCorrect = strncmp(substringFromFile, expectedString, 9);
if(headerIsCorrect != 0)
return EXIT_SUCCESS;
//else
return EXIT_FAILURE;
}
Essentially, I know my fgetc() function is returning something that (after some error checking) is code-able as an unsigned char
. I know that char
may or may not be an unsigned char
. That means, depending on the implementation of the c standard, doing a cast to char
will involve no reinterpretation. However, in the case that the system is implemented with a signed char
, I have to worry about values that can be coded by an unsigned char
that aren't code-able by char
(i.e. those values between (INT8_MAX UINT8_MAX]).
tl;dr
The question is this, should I (1) copy their underlying data read by fgetc() (by casting pointers - don't worry, I know how to do that), or (2) cast down from unsigned char
to char
(which is only safe if I know that the values can't exceed INT8_MAX, or those values can be ignored for whatever reason)?
The historical reasons are (as I've been told, I don't have a reference) that the char
type was poorly specified from the beginning.
Some implementations used "consistent integer types" where char
, short
, int
and so on were all signed by default. This makes sense because it makes the types consistent with each other.
Other implementations used unsigned for character, since there never existed any symbol tables with negative indices (that would be stupid) and since they saw a need for more than 128 characters (a very valid concern).
By the time C got standardized properly, it was too late to change this, too many different compilers and programs written for them were already out on the market. So the signedness of char
was made implementation-defined, for backwards compatibility reasons.
The signedness of char
does not matter if you only use it to store characters/strings. It only matters when you decide to involve the char
type in arithmetic expressions or use it to store integer values - this is a very bad idea.
char
(or wchar_t).uint8_t
or int8_t
.But, if I understand, string literals are signed char
No, string literals are char
arrays.
the function fgetc() returns unsigned chars casted into int
No, it returns a char
converted to an int
. It is int
because the return type may contain EOF
, which is an integer constant and not a character constant.
having a signed char * vs unsigned char * might really make my code error prone.
No, not really. Formally, this rule from the standard applies:
A pointer to an object type may be converted to a pointer to a different object type. If the resulting pointer is not correctly aligned for the referenced type, the behavior is undefined. Otherwise, when converted back again, the result shall compare equal to the original pointer.
There exists no case where casting from pointer to signed char to pointer to unsigned char or vice versa, would cause any alignment issues or other issues.
I know that a char is allowed to be signed or unsigned depending on the implementation. This doesn't really bother me if all I want to do is manipulate bytes.
If you're going to do comparison or assign char
to other integer types, it should bother you.
But, if I understand, string literals are signed chars
They are of type char[]
, so if char
=== unsigned char
, all string literals are unsigned char[]
.
the function fgetc() returns unsigned chars casted into int.
That's correct and is required to omit undesired sign extension.
So if I want to manipulate characters, is it preferred style to use signed, unsigned, or ambiguous characters?
For portability I'd advise to follow practice adapted by various libc implementations: use char
, but before processing cast to unsigned char
(char*
to unsigned char*
). This way implicit integer promotions won't turn characters in the range 0x80
-- 0xff
into negative numbers of wider types.
In short: (signed char)a < (signed char)b
is NOT always equivalent to (unsigned char)a < (unsigned char)b
. Here is an example.
Why does reading characters from a file have a different convention than literals?
getc()
needs a way to return EOF
such that it couldn't be confused with any real char
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With