Assuming IEEE-754 conformance, is a float guaranteed to be preserved when transported through a double? In other words, will the following assert always be satisfied? <pre class="prettyprint"><code>int main() { float f = some_random_float(); assert(f == (float)(double)f); } </code></pre> Assume that <code>f</code> could acquire any of the special values defined by IEEE, such as NaN and Infinity. According to IEEE, is there a case where the assert will be satisfied, but the exact bit-level representation is not preserved after the transportation through double? The code snippet is valid in both C and C++.

You don't even need to assume IEEE. C89 says in 3.1.2.5: <blockquote> The set of values of the type <code>float</code> is a subset of the set of values of the type <code>double</code> </blockquote> And every other C and C++ standard says equivalent things. As far as I know, NaNs and infinities are "values of the type <code>float</code>", albeit values with some special-case rules when used as operands. The fact that the float -> double -> float conversion restores the original value of the <code>float</code> follows (in general) from the fact that numeric conversions all preserve the value if it's representable in the destination type. Bit-level representations are a slightly different matter. Imagine that there's a value of <code>float</code> that has two distinct bitwise representations. Then nothing in the C standard prevents the float -> double -> float conversion from switching one to the other. In IEEE that won't happen for "actual values" unless there are padding bits, but I don't know whether IEEE rules out a single NaN having distinct bitwise representations. NaNs don't compare equal to themselves anyway, so there's also no standard way to tell whether two NaNs are "the same NaN" or "different NaNs" other than maybe converting them to strings. The issue may be moot. One thing to watch out for is non-conforming modes of compilers, in which they keep super-precise values "under the covers", for example intermediate results left in floating-point registers and reused without rounding. I don't think that would cause your example code to fail, but as soon as you're doing floating-point <code>==</code> it's the kind of thing you start worrying about.

From C99: <blockquote> 6.3.1.5 Real floating types 1 When a float is promoted to double or long double, or a double is promoted to long double, its value is unchanged. 2 When a double is demoted to float, a long double is demoted to double or float, or a value being represented in greater precision and range than required by its semantic type (see 6.3.1.8) is explicitly converted to its semantic type, if the value being converted can be represented exactly in the new type, it is unchanged... </blockquote> I think, this guarantees you that a float->double->float conversion is going to preserve the original float value. The standard also defines the macros <code>INFINITY</code> and <code>NAN</code> in <code>7.12 Mathematics <math.h></code>: <blockquote> 4 The macro INFINITY expands to a constant expression of type float representing positive or unsigned infinity, if available; else to a positive constant of type float that overflows at translation time. 5 The macro NAN is defined if and only if the implementation supports quiet NaNs for the float type. It expands to a constant expression of type float representing a quiet NaN. </blockquote> So, there's provision for such special values and conversions may just work for them as well (including for the minus infinity and negative zero).

Is a float guaranteed to be preserved when transported through a double in C/C++?

Tags:

c++

c

floating-point

double

ieee-754

Assuming IEEE-754 conformance, is a float guaranteed to be preserved when transported through a double?

In other words, will the following assert always be satisfied?

int main() {     float f = some_random_float();     assert(f == (float)(double)f); }

Assume that f could acquire any of the special values defined by IEEE, such as NaN and Infinity.

According to IEEE, is there a case where the assert will be satisfied, but the exact bit-level representation is not preserved after the transportation through double?

The code snippet is valid in both C and C++.

211

asked Feb 08 '13 13:02

Kristian Spangsege

2 Answers

You don't even need to assume IEEE. C89 says in 3.1.2.5:

The set of values of the type float is a subset of the set of values of the type double

And every other C and C++ standard says equivalent things. As far as I know, NaNs and infinities are "values of the type float", albeit values with some special-case rules when used as operands.

The fact that the float -> double -> float conversion restores the original value of the float follows (in general) from the fact that numeric conversions all preserve the value if it's representable in the destination type.

Bit-level representations are a slightly different matter. Imagine that there's a value of float that has two distinct bitwise representations. Then nothing in the C standard prevents the float -> double -> float conversion from switching one to the other. In IEEE that won't happen for "actual values" unless there are padding bits, but I don't know whether IEEE rules out a single NaN having distinct bitwise representations. NaNs don't compare equal to themselves anyway, so there's also no standard way to tell whether two NaNs are "the same NaN" or "different NaNs" other than maybe converting them to strings. The issue may be moot.

One thing to watch out for is non-conforming modes of compilers, in which they keep super-precise values "under the covers", for example intermediate results left in floating-point registers and reused without rounding. I don't think that would cause your example code to fail, but as soon as you're doing floating-point == it's the kind of thing you start worrying about.

142

answered Sep 29 '22 21:09

Steve Jessop

From C99:

6.3.1.5 Real floating types
1 When a float is promoted to double or long double, or a double is promoted to long double, its value is unchanged.
2 When a double is demoted to float, a long double is demoted to double or float, or a value being represented in greater precision and range than required by its semantic type (see 6.3.1.8) is explicitly converted to its semantic type, if the value being converted can be represented exactly in the new type, it is unchanged...

I think, this guarantees you that a float->double->float conversion is going to preserve the original float value.

The standard also defines the macros INFINITY and NAN in 7.12 Mathematics <math.h>:

4 The macro INFINITY expands to a constant expression of type float representing positive or unsigned infinity, if available; else to a positive constant of type float that overflows at translation time.
5 The macro NAN is defined if and only if the implementation supports quiet NaNs for the float type. It expands to a constant expression of type float representing a quiet NaN.

So, there's provision for such special values and conversions may just work for them as well (including for the minus infinity and negative zero).

answered Sep 29 '22 19:09

Alexey Frunze

Related questions
                            
                                Why is it undefined behavior to delete[] an array of derived objects via a base pointer?
                            
                                inline vs. constexpr?
                            
                                When do we need #ifdef before #undef?
                            
                                CMake command line for C++ #define
                            
                                can member functions be used to initialize member variables in an initialization list?
                            
                                is it better to use shared_ptr.reset or operator =?
                            
                                Visual Studio Platform 2015 Toolset ='v141' cannot be found
                            
                                IntelliSense: the object has type qualifiers that are not compatible with the member function
                            
                                std::function fails to distinguish overloaded functions
                            
                                Multithreaded Memory Allocators for C/C++
                            
                                Why is there no reallocation functionality in C++ allocators?
                            
                                How to use fstream objects with relative path?
                            
                                What are tracepoints used for?
                            
                                Why is "defau4t" legal in a switch statement? [duplicate]
                            
                                Generating one class member per variadic template argument
                            
                                Do I need to buy the Qt Framework? [closed]
                            
                                Incorrect cast - is it the cast or the use which is undefined behavior
                            
                                What easy zlib tutorials are there? [closed]
                            
                                C++ a member with an in-class initializer must be const
                            
                                unordered_map: which one is faster find() or count()?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With