Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

c++ (double)0.700 * int(1000) => 699 (Not the double precision issue)

using g++ (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3

I have tried different typecasting of scaledvalue2 but not until I stored the multiplication in a double variable and then to an int could I get desired result.. but I can't explain why ???

I know double precission(0.6999999999999999555910790149937383830547332763671875) is an issue but I don't understand why one way is OK and the other is not ??

I would expect both to fail if precision is a problem.

I DON'T NEED solution to fix it.. but just a WHY ?? (the problem IS fixed)

void main()
{
    double value = 0.7;
    int scaleFactor = 1000;

    double doubleScaled = (double)scaleFactor * value; 
    int scaledvalue1 = doubleScaled; // = 700

    int scaledvalue2 = (double)((double)(scaleFactor) * value);  // = 699 ??

    int scaledvalue3 = (double)(1000.0 * 0.7);  // = 700 

    std::ostringstream oss;
    oss << scaledvalue2;
    printf("convert FloatValue[%f] multi with %i to get %f = %i or %i or %i[%s]\r\n",
      value,scaleFactor,doubleScaled,scaledvalue1,scaledvalue2,scaledvalue3,oss.str().c_str());

}

or in short:

value = 0.6999999999999999555910790149937383830547332763671875;
int scaledvalue_a = (double)(1000 * value);  // =  699??
int scaledvalue_b = (double)(1000 * 0.6999999999999999555910790149937383830547332763671875);  // =  700
// scaledvalue_a = 699
// scaledvalue_b = 700

I can't figure out what is going wrong here.

Output :

convert FloatValue[0.700000] multi with 1000 to get 700.000000 = 700 or 699 or 700[699]

vendor_id : GenuineIntel

cpu family : 6

model : 54

model name : Intel(R) Atom(TM) CPU N2600 @ 1.60GHz

like image 430
Ratman Avatar asked Nov 19 '22 13:11

Ratman


2 Answers

This is going to be a bit handwaving; I was up too late last night watching the Cubs win the World Series, so don't insist on precision.

The rules for evaluating floating-point expressions are somewhat flexible, and compilers typically treat floating-point expressions even more flexibly than the rules formally allow. This makes evaluation of floating-point expressions faster, at the expense of making the results somewhat less predictable. Speed is important for floating-point calculations. Java initially made the mistake of imposing exact requirements on floating-point expressions and the numerics community screamed with pain. Java had to give in to the real world and relax those requirements.

double f();
double g();
double d = f() + g(); // 1
double dd1 = 1.6 * d; // 2
double dd2 = 1.6 * (f() + g()); // 3

On x86 hardware (i.e., just about every desktop system in existence), floating-point calculations are in fact done with 80 bits of precision (unless you set some switches that kill performance, as Java required), even though double and float are 64 bits and 32 bits, respectively. So for arithmetic operations the operands are converted up to 80 bits and the results are converted back down to 64 or 32 bits. That's slow, so the generated code typically delays doing conversions as long as possible, doing all of the calculation with 80-bit precision.

But C and C++ both require that when a value is stored into a floating-point variable, the conversion has to be done. So, formally, in line //1, the compiler must convert the sum back to 64 bits to store it into the variable d. Then the value of dd1, calculated in line //2, must be computed using the value that was stored into d, i.e., a 64-bit value, while the value of dd2, calculated in line //3, can be calculated using f() + g(), i.e., a full 80-bit value. Those extra bits can make a difference, and the value of dd1 might be different from the value of dd2.

And often the compiler will hang on to the 80-bit value of f() + g() and use that instead of the value stored in d when it calculates the value of dd1. That's a non-conforming optimization, but as far as I know, every compiler does that sort of thing by default. They all have command-line switches to enforce the strictly-required behavior, so if you want slower code you can get it. <g>

For serious number crunching, speed is critical, so this flexibility is welcome, and number-crunching code is carefully written to avoid sensitivity to this kind of subtle difference. People get PhDs for figuring out how to make floating-point code fast and effective, so don't feel bad that the results you see don't seem to make sense. They don't, but they're close enough that, handled carefully, they give correct results without a speed penalty.

like image 118
Pete Becker Avatar answered Dec 18 '22 21:12

Pete Becker


Since x86 floating-point unit performs its computations in extended precision floating point type (80 bits wide), the result might easily depend on whether the intermediate values were forcefully converted to double (64-bit floating-point type). In that respect, in non-optimized code it is not unusual to see compilers treat memory writes to double variables literally, but ignore "unnecessary" casts to double applied to temporary intermediate values.

In your example, the first part involves saving the intermediate result in a double variable

double doubleScaled = (double)scaleFactor * value; 
int scaledvalue1 = doubleScaled; // = 700

The compiler takes it literally and does indeed store the product in a double variable doubleScaled, which unavoidably requires converting the 80-bit product to double. Later that double value is read from memory again and then converted to int type.

The second part

int scaledvalue2 = (double)((double)(scaleFactor) * value);  // = 699 ??

involves conversions that the compiler might see as unnecessary (and they indeed are unnecessary from the point of view of abstract C++ machine). The compiler ignores them, which means that the final int value is generated directly from the 80-bit product.

The presence of that intermediate conversion to double in the first variant (and its absence in the second one) is what causes that difference.

like image 40
AnT Avatar answered Dec 18 '22 23:12

AnT