Perhaps I'm overthinking this, as it seems like it should be a lot easier. I want to take a value of type int, such as is returned by fgetc(), and record it in a char buffer if it is not an end-of-file code. E.g.:
char buf;
int c = fgetc(stdin);
if (c < 0) {
/* handle end-of-file */
} else {
buf = (char) c; /* not quite right */
}
However, if the platform has signed default chars then the value returned by fgetc() may be outside the range of char, in which case casting or assigning it to (signed) char produces implementation-defined behavior (right?). Surely, though, there is tons of code out there that does exactly the equivalent of the example. Is it all relying on implementation-defined behavior and/or assuming 7-bit data?
It looks to me like if I want to be certain that the behavior of my code is defined by C to be what I want, then I need to do something like this:
buf = (char) ((c > CHAR_MAX) ? (c - (UCHAR_MAX + 1)) : c);
I think that produces defined, correct behavior whether default chars are signed or unsigned, and regardless even of the size of char. Is that right? And is it really needful to do that to ensure portability?
fgetc()
returns unsigned char
and EOF. EOF is always < 0. If the system's char
is signed
or unsigned
, it makes no difference.
C11dr 7.21.7.1 2
If the end-of-file indicator for the input stream pointed to by stream is not set and a next character is present, the fgetc function obtains that character as an unsigned char converted to an int and advances the associated file position indicator for the stream (if defined).
The concern I have about is that is looks to be 2's compliment dependent and implying the range of unsigned char
and char
are both just as wide. Both of these assumptions are certainly nearly always true today.
buf = (char) ((c > CHAR_MAX) ? (c - (UCHAR_MAX + 1)) : c);
[Edit per OP comment]
Let's assume fgetc()
returns no more different characters than stuff-able in the range CHAR_MIN
to CHAR_MAX
, then (c - (UCHAR_MAX + 1))
would be more portable is replaced with (c - CHAR_MAX + CHAR_MIN)
. We do not know (c - (UCHAR_MAX + 1))
is in range when c is CHAR_MAX + 1
.
A system could exist that has a signed char
range of -127 to +127 and an unsigned char
range 0 to 255. (5.2.4.2.1), but as fgetc()
gets a character, it seems to have all be unsigned char
or all ready limited itself to the smaller signed char
range, before converting to unsigned char
and return that value to the user. OTOH, if fgetc()
returned 256 different characters, conversion to a narrow ranged signed char
would not be portable regardless of formula.
Practically, it's simple - the obvious cast to char
always works.
But you're asking about portability...
I can't see how a real portable solution could work.
This is because the guaranteed range of char
is -127 to 127, which is only 255 different values. So how could you translate the 256 possible return values of fgetc
(excluding EOF
), to a char
, without losing information?
The best I can think of is to use unsigned char
and avoid char
.
With thanks to those who responded, and having now read relevant portions of the C99 standard, I have come to agree with the somewhat surprising conclusion that storing an arbitrary non-EOF value returned by fgetc()
as type char
without loss of fidelity is not guaranteed to be possible. In large part, that arises from the possibility that char
cannot represent as many distinct values as unsigned char
.
For their part, the stdio functions guarantee that if data are written to a (binary) stream and subsequently read back, then the read back data will compare equal to the original data. That turns out to have much narrower implications than I at first thought, but it does mean that fputs()
must output a distinct value for each distinct char
it successfully outputs, and that whatever conversion fgets()
applies to store input bytes as type char
must accurately reverse the conversion, if any, by which fputs()
would produce the input byte as its output. As far as I can tell, however, fputs()
and fgets()
are permitted to fail on any input they don't like, so it is not certain that fputs() maps every possible char
value to an unsigned char
.
Moreover, although fputs()
and fgets()
operate as if by performing sequences of fputc()
and fgetc()
calls, respectively, it is not specified what conversions they might perform between char
values in memory and the underlying unsigned char
values on the stream. If a platform's fputs()
uses standard integer conversion for that purpose, however, then the correct back-conversion is as I proposed:
int c = fgetc(stream);
char buf;
if (c >= 0) buf = (char) ((c > CHAR_MAX) ? (c - (UCHAR_MAX + 1)) : c);
That arises directly from the integer conversion rules, which specify that integer values are converted to unsigned types by adding or subtracting the integer multiple of <target type>_MAX + 1 needed to bring the result into the range of the target type, supported by the constraints on representation of integer types. Its correctness for that purpose does not depend on the specific representation of char
values or on whether char
is treated as signed or unsigned.
However, if char
cannot represent as many distinct values as unsigned char
, or if there are char
values that fgets()
refuses to output (e.g. negative ones), then there are possible values of c
that could not have resulted from a char
conversion in the first place. No back-conversion argument is applicable to such bytes, and there may not even be a meaningful sense of char
values corresponding to them. In any case, whether the given conversion is the correct reverse-conversion for data written by fputs()
seems to be implementation defined. It is certainly implementation-defined whether buf = (char) c
will have the same effect, though it does have on very many systems.
Overall, I am struck by just how many details of C I/O behavior are implementation defined. That was an eye-opener for me.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With