Assuming IEEE-754 conformance, is a float guaranteed to be preserved when transported through a double?
In other words, will the following assert always be satisfied?
int main() { float f = some_random_float(); assert(f == (float)(double)f); }
Assume that f
could acquire any of the special values defined by IEEE, such as NaN and Infinity.
According to IEEE, is there a case where the assert will be satisfied, but the exact bit-level representation is not preserved after the transportation through double?
The code snippet is valid in both C and C++.
Yes, a double can represent all values a float can. Here is why: Both numbers are represented as the sign, the exponent and the mantissa. The difference between float and double is, that there is more space for the exponent and the mantissa.
Scalars of type float are stored using four bytes (32-bits). The format used follows the IEEE-754 standard. The mantissa represents the actual binary digits of the floating-point number. The power of two is represented by the exponent.
float is mostly used in graphic libraries for high processing power due to its small range. double is mostly used for calculations in programming to eliminate errors when decimal values are being rounded off. Although float can still be used, it should only be in cases when we're dealing with small decimal values.
The double in C is a data type that is used to store high-precision floating-point data or numbers (up to 15 to 17 digits). It is used to store large values of decimal numbers. Values that are stored are double the size of data that can be stored in the float data type. Thus it is named a double data type.
You don't even need to assume IEEE. C89 says in 3.1.2.5:
The set of values of the type
float
is a subset of the set of values of the typedouble
And every other C and C++ standard says equivalent things. As far as I know, NaNs and infinities are "values of the type float
", albeit values with some special-case rules when used as operands.
The fact that the float -> double -> float conversion restores the original value of the float
follows (in general) from the fact that numeric conversions all preserve the value if it's representable in the destination type.
Bit-level representations are a slightly different matter. Imagine that there's a value of float
that has two distinct bitwise representations. Then nothing in the C standard prevents the float -> double -> float conversion from switching one to the other. In IEEE that won't happen for "actual values" unless there are padding bits, but I don't know whether IEEE rules out a single NaN having distinct bitwise representations. NaNs don't compare equal to themselves anyway, so there's also no standard way to tell whether two NaNs are "the same NaN" or "different NaNs" other than maybe converting them to strings. The issue may be moot.
One thing to watch out for is non-conforming modes of compilers, in which they keep super-precise values "under the covers", for example intermediate results left in floating-point registers and reused without rounding. I don't think that would cause your example code to fail, but as soon as you're doing floating-point ==
it's the kind of thing you start worrying about.
From C99:
6.3.1.5 Real floating types
1 When a float is promoted to double or long double, or a double is promoted to long double, its value is unchanged.
2 When a double is demoted to float, a long double is demoted to double or float, or a value being represented in greater precision and range than required by its semantic type (see 6.3.1.8) is explicitly converted to its semantic type, if the value being converted can be represented exactly in the new type, it is unchanged...
I think, this guarantees you that a float->double->float conversion is going to preserve the original float value.
The standard also defines the macros INFINITY
and NAN
in 7.12 Mathematics <math.h>
:
4 The macro INFINITY expands to a constant expression of type float representing positive or unsigned infinity, if available; else to a positive constant of type float that overflows at translation time.
5 The macro NAN is defined if and only if the implementation supports quiet NaNs for the float type. It expands to a constant expression of type float representing a quiet NaN.
So, there's provision for such special values and conversions may just work for them as well (including for the minus infinity and negative zero).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With