Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Undefined behaviour when using iostream read and signed char

Tags:

c++

My question is similar to this but a bit more specific. I am writing a function to read a 32-bit unsigned integer from a istream represented using little endian. In C something like this would work:

#include <stdio.h>
#include <inttypes.h>

uint_least32_t foo(FILE* file)
{
    unsigned char buffer[4];
    fread(buffer, sizeof(buffer), 1, file);

    uint_least32_t ret = buffer[0];
    ret |= (uint_least32_t) buffer[1] << 8;
    ret |= (uint_least32_t) buffer[2] << 16;
    ret |= (uint_least32_t) buffer[3] << 24;
    return ret;
}

But if I try to do something similar using a istream I run into what I think is undefined behaviour

uint_least32_t bar(istream& file)
{
    char buffer[4];
    file.read(buffer, sizeof(buffer));

    // The casts to unsigned char are to prevent sign extension on systems where
    // char is signed.
    uint_least32_t ret = (unsigned char) buffer[0];
    ret |= (uint_least32_t) (unsigned char) buffer[1] << 8;
    ret |= (uint_least32_t) (unsigned char) buffer[2] << 16;
    ret |= (uint_least32_t) (unsigned char) buffer[3] << 24;
    return ret;
}

It is undefined behaviour on systems where char is signed and there isn't two's complement and it cannot represent the number -128, so it can't represent 256 different chars. In foo it will work even if char is signed because section 7.21.8.1 of the C11 standard (draft N1570) says that fread uses unsigned char not char and unsigned char has to be able to represent all values in the range 0 to 255 inclusive.

Does bar really cause undefined behavior when tries to read the number 0x80 and if so is there a workaround still using a std::istream?

Edit: The undefined behaviour I am referring to is caused by the istream::read into buffer not the cast from buffer to unsigned char. For example if it is a sign+magnitude machine and char is signed then 0x80 is negative 0, but negative 0 and positive 0 must always compare equal according to the standard. If that is the case then there are only 255 different signed chars and you cannot represent a byte with a char. The casts will work because it will always add UCHAR_MAX + 1 to negative numbers (section 4.7 of draft C++11 standard N3242) when casting signed to unsigned.

like image 202
qbt937 Avatar asked Nov 17 '14 07:11

qbt937


People also ask

What is undefined behavior in C++?

So, in C/C++ programming, undefined behavior means when the program fails to compile, or it may execute incorrectly, either crashes or generates incorrect results, or when it may fortuitously do exactly what the programmer intended.

What causes undefined behavior in C?

In C the use of any automatic variable before it has been initialized yields undefined behavior, as does integer division by zero, signed integer overflow, indexing an array outside of its defined bounds (see buffer overflow), or null pointer dereferencing.

Why does C++ have undefined behavior?

Undefined behavior exists mainly to give the compiler freedom to optimize. One thing it allows the compiler to do, for example, is to operate under the assumption that certain things can't happen (without having to first prove that they can't happen, which would often be very difficult or impossible).

Is unspecified behavior undefined behavior?

Undefined Behavior results in unpredicted behavior of the entire program. But in unspecified behavior, the program makes choice at a particular junction and continue as usual like originally function executes.


1 Answers

I think I have the answer: bar does not cause undefined behaviour.

In the accepted answer of this question, R.. says:

On a non-twos-complement system, signed char will not be suitable for accessing the representation of an object. This is because either there are two possible signed char representations which have the same value (+0 and -0), or one representation that has no value (a trap representation). In either case, this prevents you from doing most meaningful things you might do with the representation of an object. For example, if you have a 16-bit unsigned integer 0x80ff, one or the other byte, as a signed char, is going to either trap or compare equal to 0.

Note that on such an implementation (non-twos-complement), plain char needs to be defined as an unsigned type for accessing the representations of objects via char to work correctly. While there's no explicit requirement, I see this as a requirement derived from other requirements in the standard.

This would seem to be the case because section 3.9 paragraph 2 of C++11 (draft N3242) says:

For any object (other than a base-class subobject) of trivially copyable type T, whether or not the object holds a valid value of type T, the underlying bytes (1.7) making up the object can be copied into an array of char or unsigned char. If the content of the array of char or unsigned char is copied back into the object, the object shall subsequently hold its original value.

If char was signed and had multiple object representations for some value (such as 0 in sign+magnitude) then if object was copied to a char array and then back into the object, it might not have the same value afterwords because the char array could change to a different object representation. That would contradict the quote above, so char must be unsigned if the machine's signed char has multiple object representations for the same value representation (e.g. On a sign+value machine both 0x80 and 0x00 would represent 0). This means that bar is defined behaviour because the only case where it is undefined behaviour would require that char is signed and has a odd representation the would not satisfy the above quote from the standard.

like image 112
qbt937 Avatar answered Oct 13 '22 00:10

qbt937