using g++ (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3
I have tried different typecasting of scaledvalue2
but not until I stored the multiplication in a double
variable and then to an int
could I get desired result.. but I can't explain why ???
I know double precission(0.6999999999999999555910790149937383830547332763671875) is an issue but I don't understand why one way is OK and the other is not ??
I would expect both to fail if precision is a problem.
I DON'T NEED solution to fix it.. but just a WHY ?? (the problem IS fixed)
void main()
{
double value = 0.7;
int scaleFactor = 1000;
double doubleScaled = (double)scaleFactor * value;
int scaledvalue1 = doubleScaled; // = 700
int scaledvalue2 = (double)((double)(scaleFactor) * value); // = 699 ??
int scaledvalue3 = (double)(1000.0 * 0.7); // = 700
std::ostringstream oss;
oss << scaledvalue2;
printf("convert FloatValue[%f] multi with %i to get %f = %i or %i or %i[%s]\r\n",
value,scaleFactor,doubleScaled,scaledvalue1,scaledvalue2,scaledvalue3,oss.str().c_str());
}
or in short:
value = 0.6999999999999999555910790149937383830547332763671875;
int scaledvalue_a = (double)(1000 * value); // = 699??
int scaledvalue_b = (double)(1000 * 0.6999999999999999555910790149937383830547332763671875); // = 700
// scaledvalue_a = 699
// scaledvalue_b = 700
I can't figure out what is going wrong here.
Output :
convert FloatValue[0.700000] multi with 1000 to get 700.000000 = 700 or 699 or 700[699]
vendor_id : GenuineIntel
cpu family : 6
model : 54
model name : Intel(R) Atom(TM) CPU N2600 @ 1.60GHz
This is going to be a bit handwaving; I was up too late last night watching the Cubs win the World Series, so don't insist on precision.
The rules for evaluating floating-point expressions are somewhat flexible, and compilers typically treat floating-point expressions even more flexibly than the rules formally allow. This makes evaluation of floating-point expressions faster, at the expense of making the results somewhat less predictable. Speed is important for floating-point calculations. Java initially made the mistake of imposing exact requirements on floating-point expressions and the numerics community screamed with pain. Java had to give in to the real world and relax those requirements.
double f();
double g();
double d = f() + g(); // 1
double dd1 = 1.6 * d; // 2
double dd2 = 1.6 * (f() + g()); // 3
On x86 hardware (i.e., just about every desktop system in existence), floating-point calculations are in fact done with 80 bits of precision (unless you set some switches that kill performance, as Java required), even though double
and float
are 64 bits and 32 bits, respectively. So for arithmetic operations the operands are converted up to 80 bits and the results are converted back down to 64 or 32 bits. That's slow, so the generated code typically delays doing conversions as long as possible, doing all of the calculation with 80-bit precision.
But C and C++ both require that when a value is stored into a floating-point variable, the conversion has to be done. So, formally, in line //1, the compiler must convert the sum back to 64 bits to store it into the variable d
. Then the value of dd1
, calculated in line //2, must be computed using the value that was stored into d
, i.e., a 64-bit value, while the value of dd2
, calculated in line //3, can be calculated using f() + g()
, i.e., a full 80-bit value. Those extra bits can make a difference, and the value of dd1
might be different from the value of dd2
.
And often the compiler will hang on to the 80-bit value of f() + g()
and use that instead of the value stored in d
when it calculates the value of dd1
. That's a non-conforming optimization, but as far as I know, every compiler does that sort of thing by default. They all have command-line switches to enforce the strictly-required behavior, so if you want slower code you can get it. <g>
For serious number crunching, speed is critical, so this flexibility is welcome, and number-crunching code is carefully written to avoid sensitivity to this kind of subtle difference. People get PhDs for figuring out how to make floating-point code fast and effective, so don't feel bad that the results you see don't seem to make sense. They don't, but they're close enough that, handled carefully, they give correct results without a speed penalty.
Since x86 floating-point unit performs its computations in extended precision floating point type (80 bits wide), the result might easily depend on whether the intermediate values were forcefully converted to double
(64-bit floating-point type). In that respect, in non-optimized code it is not unusual to see compilers treat memory writes to double
variables literally, but ignore "unnecessary" casts to double
applied to temporary intermediate values.
In your example, the first part involves saving the intermediate result in a double
variable
double doubleScaled = (double)scaleFactor * value;
int scaledvalue1 = doubleScaled; // = 700
The compiler takes it literally and does indeed store the product in a double
variable doubleScaled
, which unavoidably requires converting the 80-bit product to double
. Later that double
value is read from memory again and then converted to int
type.
The second part
int scaledvalue2 = (double)((double)(scaleFactor) * value); // = 699 ??
involves conversions that the compiler might see as unnecessary (and they indeed are unnecessary from the point of view of abstract C++ machine). The compiler ignores them, which means that the final int
value is generated directly from the 80-bit product.
The presence of that intermediate conversion to double
in the first variant (and its absence in the second one) is what causes that difference.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With