Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the best way to represent characters in C?

I know that a char is allowed to be signed or unsigned depending on the implementation. This doesn't really bother me if all I want to do is manipulate bytes. (In fact, I don't think of the char datatype as a character, but a byte).

But, if I understand, string literals are signed chars (actually they're not, but see the update below), and the function fgetc() returns unsigned chars casted into int. So if I want to manipulate characters, is it preferred style to use signed, unsigned, or ambiguous characters? Why does reading characters from a file have a different convention than literals?

I ask because I have some code in c that does string comparison between string literals and the contents of files, but having a signed char * vs unsigned char * might really make my code error prone.

Update 1

Ok as a few people pointed out (in answers and comments) string literals are in fact char arrays, not signed char arrays. That means I really should use char * for string literals, and not think about whether they are signed or unsigned. This makes me perfectly happy (until I have to start making conversion/comparisons with unsigned chars).

However the important question remains, how do I read characters from a file, and compare them to a string literal. The crux of which is the conversion from the int read using fgetc(), which explicitly reads an unsigned char from the file, to the char type, which is allowed to be either signed or unsigned.

Allow me to provide a more detailed example.

int main(void)
{
    FILE *someFile = fopen("ThePathToSomeRealFile.html", "r");
    assert(someFile);

    char substringFromFile[25];
    memset((void*)substringFromFile,0,sizeof(substringFromFile));

    //Alright, the real example is to read the first few characters from the file
    //And then compare them to the string I expect
    const char *expectedString = "<!DOCTYPE";

    for( int counter = 0; counter < sizeof(expectedString)/sizeof(*expectedString); ++counter )
    {
        //Read it as an integer, because the function returns an `int`
        const int oneCharacter = fgetc(someFile);
        if( ferror(someFile) )
            return EXIT_FAILURE;
        if( int == EOF || feof(someFile) )
            break;

        assert(counter < sizeof(substringFromFile)/sizeof(*substringFromFile));

        //HERE IS THE PROBLEM:
        //I know the data contained in oneCharacter must be an unsigned char
        //Therefore, this is valid
        const unsigned char uChar = (const unsigned char)oneCharacter;
        //But then how do I assign it to the char?
        substringFromFile[counter] = (char)oneCharacter;
    }

    //and ultimately here's my goal
    int headerIsCorrect = strncmp(substringFromFile, expectedString, 9);

    if(headerIsCorrect != 0)
        return EXIT_SUCCESS;
    //else
    return EXIT_FAILURE;
}

Essentially, I know my fgetc() function is returning something that (after some error checking) is code-able as an unsigned char. I know that char may or may not be an unsigned char. That means, depending on the implementation of the c standard, doing a cast to char will involve no reinterpretation. However, in the case that the system is implemented with a signed char, I have to worry about values that can be coded by an unsigned char that aren't code-able by char (i.e. those values between (INT8_MAX UINT8_MAX]).

tl;dr

The question is this, should I (1) copy their underlying data read by fgetc() (by casting pointers - don't worry, I know how to do that), or (2) cast down from unsigned char to char (which is only safe if I know that the values can't exceed INT8_MAX, or those values can be ignored for whatever reason)?

like image 882
xaviersjs Avatar asked Oct 19 '25 04:10

xaviersjs


2 Answers

The historical reasons are (as I've been told, I don't have a reference) that the char type was poorly specified from the beginning.

Some implementations used "consistent integer types" where char, short, int and so on were all signed by default. This makes sense because it makes the types consistent with each other.

Other implementations used unsigned for character, since there never existed any symbol tables with negative indices (that would be stupid) and since they saw a need for more than 128 characters (a very valid concern).

By the time C got standardized properly, it was too late to change this, too many different compilers and programs written for them were already out on the market. So the signedness of char was made implementation-defined, for backwards compatibility reasons.

The signedness of char does not matter if you only use it to store characters/strings. It only matters when you decide to involve the char type in arithmetic expressions or use it to store integer values - this is a very bad idea.

  • For characters/string, always use char (or wchar_t).
  • For any other form of 1 byte large data, always use uint8_t or int8_t.

But, if I understand, string literals are signed char

No, string literals are char arrays.

the function fgetc() returns unsigned chars casted into int

No, it returns a char converted to an int. It is int because the return type may contain EOF, which is an integer constant and not a character constant.

having a signed char * vs unsigned char * might really make my code error prone.

No, not really. Formally, this rule from the standard applies:

A pointer to an object type may be converted to a pointer to a different object type. If the resulting pointer is not correctly aligned for the referenced type, the behavior is undefined. Otherwise, when converted back again, the result shall compare equal to the original pointer.

There exists no case where casting from pointer to signed char to pointer to unsigned char or vice versa, would cause any alignment issues or other issues.

like image 118
Lundin Avatar answered Oct 21 '25 18:10

Lundin


I know that a char is allowed to be signed or unsigned depending on the implementation. This doesn't really bother me if all I want to do is manipulate bytes.

If you're going to do comparison or assign char to other integer types, it should bother you.

But, if I understand, string literals are signed chars

They are of type char[], so if char === unsigned char, all string literals are unsigned char[].

the function fgetc() returns unsigned chars casted into int.

That's correct and is required to omit undesired sign extension.

So if I want to manipulate characters, is it preferred style to use signed, unsigned, or ambiguous characters?

For portability I'd advise to follow practice adapted by various libc implementations: use char, but before processing cast to unsigned char (char* to unsigned char*). This way implicit integer promotions won't turn characters in the range 0x80 -- 0xff into negative numbers of wider types.

In short: (signed char)a < (signed char)b is NOT always equivalent to (unsigned char)a < (unsigned char)b. Here is an example.

Why does reading characters from a file have a different convention than literals?

getc() needs a way to return EOF such that it couldn't be confused with any real char.

like image 40
xaizek Avatar answered Oct 21 '25 17:10

xaizek