I know all about the approximation issues with floating point numbers so I understand how 4.5 can get rounded down to 4 if it was approximated as 4.4999999999999991. My question is why is there a difference using the same types with 32 bit and 64 bit.
In the code below I have two calculations. In 32 bit the value for MyRoundValue1 is 4 and the value for MyRoundValue2 is 5. In 64 bit they are both 4. Shouldn't the results be consistent with both 32 bit and 64 bit?
{$APPTYPE CONSOLE}
const
MYVALUE1: Double = 4.5;
MYVALUE2: Double = 5;
MyCalc: Double = 0.9;
var
MyRoundValue1: Integer;
MyRoundValue2: Integer;
begin
MyRoundValue1 := Round(MYVALUE1);
MyRoundValue2 := Round(MYVALUE2 * MyCalc);
WriteLn(IntToStr(MyRoundValue1));
WriteLn(IntToStr(MyRoundValue2));
end.
Floating Point Numbers Floats generally come in two flavours: “single” and “double” precision. Single precision floats are 32-bits in length while “doubles” are 64-bits. Due to the finite size of floats, they cannot represent all of the real numbers - there are limitations on both their precision and range.
Single-precision floating-point format (sometimes called FP32 or float32) is a computer number format, usually occupying 32 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.
A double precision, floating-point number is a 64-bit approximation of a real number. The number can be zero or can range from -1.797693134862315E+308 to -2.225073858507201E-308, or from 2.225073858507201E-308 to 1.797693134862315E+308.
32-bit single precision, with an approximate range of 10 -101 to 10 90 and precision of 7 decimal digits.
In x87 this code:
MyRoundValue2 := Round(MYVALUE2 * MyCalc);
Is compiled to:
MyRoundValue2 := Round(MYVALUE2 * MyCalc); 0041C4B2 DD0508E64100 fld qword ptr [$0041e608] 0041C4B8 DC0D10E64100 fmul qword ptr [$0041e610] 0041C4BE E8097DFEFF call @ROUND 0041C4C3 A3C03E4200 mov [$00423ec0],eax
The default control word for the x87 unit under the Delphi RTL performs calculations to 80 bit precision. So the floating point unit multiplies 5 by the closest 64 bit value to 0.9 which is:
0.90000 00000 00000 02220 44604 92503 13080 84726 33361 81640 625
Note that this value is greater than 0.9. And it turns out that when multiplied by 5, and rounded to the nearest 80 bit value, the value is greater than 4.5. Hence Round(MYVALUE2 * MyCalc)
returns 5.
On 64 bit, the floating point math is done on the SSE unit. That does not use 80 bit intermediate values. And it turns out that 5 times the closest double to 0.9, rounded to double precision is exactly 4.5. Hence Round(MYVALUE2 * MyCalc)
returns 4 on 64 bit.
You can persuade the 32 bit compiler to behave the same way as the 64 bit compiler by storing to a double rather than relying on intermediate 80 bit values:
{$APPTYPE CONSOLE}
const
MYVALUE1: Double = 4.5;
MYVALUE2: Double = 5;
MyCalc: Double = 0.9;
var
MyRoundValue1: Integer;
MyRoundValue2: Integer;
d: Double;
begin
MyRoundValue1 := Round(MYVALUE1);
d := MYVALUE2 * MyCalc;
MyRoundValue2 := Round(d);
WriteLn(MyRoundValue1);
WriteLn(MyRoundValue2);
end.
This program produces the same output as your 64 bit program.
Or you can force the x87 unit to use 64 bit intermediates.
{$APPTYPE CONSOLE}
uses
SysUtils;
const
MYVALUE1: Double = 4.5;
MYVALUE2: Double = 5;
MyCalc: Double = 0.9;
var
MyRoundValue1: Integer;
MyRoundValue2: Integer;
begin
Set8087CW($1232); // <-- round intermediates to 64 bit
MyRoundValue1 := Round(MYVALUE1);
MyRoundValue2 := Round(MYVALUE2 * MyCalc);
WriteLn(MyRoundValue1);
WriteLn(MyRoundValue2);
end.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With