Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bizarre floating-point behavior with vs. without extra variables, why?

Tags:

People also ask

What is the main problem with floating-point numbers?

The Problem Since real numbers cannot be represented accurately in a fixed space, when operating with floating-point numbers, the result might not be able to be fully represented with the required precision.

Why are floating point calculations so inaccurate?

The floating-point calculations are inaccurate because mainly the rationals are approximating that cannot be represented finitely in base 2 and in general they are approximating numbers which may not be representable in finitely many digits in any base.

What is pitfalls of floating-point representation?

Inexact The “real” result of a computation cannot be exactly represented by a floating-point number. The silent response is to round the number, which is a behaviour that the vast majority of programs using floating-point numbers rely upon. However, rounding has to be correctly taking into account for sound analysis.

Why are floating points important?

Floating-point numbers also offer greater precision. Precision measures the number of bits used to represent numbers. Precision can be used to estimate the impact of errors due to integer truncation and rounding. The precision of a floating-point number is determined by the mantissa.


When I run the following code in VC++ 2013 (32-bit, no optimizations):

#include <cmath>
#include <iostream>
#include <limits>

double mulpow10(double const value, int const pow10)
{
    static double const table[] =
    {
        1E+000, 1E+001, 1E+002, 1E+003, 1E+004, 1E+005, 1E+006, 1E+007,
        1E+008, 1E+009, 1E+010, 1E+011, 1E+012, 1E+013, 1E+014, 1E+015,
        1E+016, 1E+017, 1E+018, 1E+019,
    };
    return pow10 < 0 ? value / table[-pow10] : value * table[+pow10];
}

int main(void)
{
    double d = 9710908999.008999;
    int j_max = std::numeric_limits<double>::max_digits10;
    while (j_max > 0 && (
        static_cast<double>(
            static_cast<unsigned long long>(
                mulpow10(d, j_max))) != mulpow10(d, j_max)))
    {
        --j_max;
    }
    double x = std::floor(d * 1.0E9);
    unsigned long long y1 = x;
    unsigned long long y2 = std::floor(d * 1.0E9);
    std::cout
        << "x == " << x << std::endl
        << "y1 == " << y1 << std::endl
        << "y2 == " << y2 << std::endl;
}

I get

x  == 9.7109089990089994e+018
y1 == 9710908999008999424
y2 == 9223372036854775808

in the debugger.

I'm mindblown. Can someone please explain to me how the heck y1 and y2 have different values?


Update:

This only seems to happen under /Arch:SSE2 or /Arch:AVX, not /Arch:IA32 or /Arch:SSE.