Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

simple floating-point numbers lose precision

I'm using Delphi XE2 Update 3. There are precision issue with even the simplest of floating-point numbers (like 3.7). Given this code (a 32-bit console app):

program Project1;

{$APPTYPE CONSOLE}
{$R *.res}

uses System.SysUtils;

var s: Single; d: Double; x: Extended;
begin
  Write('Size of Single  -----  ');  Writeln(SizeOf(Single));
  Write('Size of Double  -----  ');  Writeln(SizeOf(Double));
  Write('Size of Extended  ---  ');  Writeln(SizeOf(Extended));  Writeln;

  s := 3.7;  d := 3.7;  x := 3.7;

  Write('"s" is ');                  Writeln(s);
  Write('"d" is ');                  Writeln(d);
  Write('"x" is ');                  Writeln(x);                 Writeln;

  Writeln('Single Comparison');
  Write('"s > 3.7"  is  ');          Writeln(s > 3.7);
  Write('"s = 3.7"  is  ');          Writeln(s = 3.7);
  Write('"s < 3.7"  is  ');          Writeln(s < 3.7);           Writeln;

  Writeln('Double Comparison');
  Write('"d > 3.7"  is  ');          Writeln(d > 3.7);
  Write('"d = 3.7"  is  ');          Writeln(d = 3.7);
  Write('"d < 3.7"  is  ');          Writeln(d < 3.7);           Writeln;

  Writeln('Extended Comparison');
  Write('"x > 3.7"  is  ');          Writeln(x > 3.7);
  Write('"x = 3.7"  is  ');          Writeln(x = 3.7);
  Write('"x < 3.7"  is  ');          Writeln(x < 3.7);           Readln;
end.

I get this output:

Size of Single  -----  4
Size of Double  -----  8
Size of Extended  ---  10

"s" is  3.70000004768372E+0000
"d" is  3.70000000000000E+0000
"x" is  3.70000000000000E+0000

Single Comparison
"s > 3.7"  is  TRUE
"s = 3.7"  is  FALSE
"s < 3.7"  is  FALSE

Double Comparison
"d > 3.7"  is  TRUE
"d = 3.7"  is  FALSE
"d < 3.7"  is  FALSE

Extended Comparison
"x > 3.7"  is  FALSE
"x = 3.7"  is  TRUE
"x < 3.7"  is  FALSE

You can see extended is the only type that evaluates correctly. I thought precision was only an issue when using a complex floating-point number like 3.14159265358979323846, not something as simple as 3.7. The issue when using single kind of makes sense. But why doesn't double work?

like image 817
James L. Avatar asked May 14 '14 23:05

James L.


People also ask

Why do floating point numbers lose precision?

Floating-point numbers suffer from a loss of precision when represented with a fixed number of bits (e.g., 32-bit or 64-bit). This is because there is an infinite amount of real numbers, even within a small range like 0.0 to 0.1.

Are floating point numbers precise?

A single-precision, floating-point number is a 32-bit approximation of a real number. The number can be zero or can range from -3.40282347E+38 to -1.17549435E-38, or from 1.17549435E-38 to 3.40282347E+38.

What is the main problem with floating point numbers?

The Problem Since real numbers cannot be represented accurately in a fixed space, when operating with floating-point numbers, the result might not be able to be fully represented with the required precision. This inaccuracy ends up as information lost.

What is the precision of a floating-point number?

1.22 Floating Point Numbers The most commonly used floating point standard is the IEEE standard. According to this standard, floating point numbers are represented with 32 bits (single precision) or 64 bits (double precision).


Video Answer


2 Answers

Required reading: What Every Computer Scientist Should Know About Floating-Point Arithmetic, David Goldberg.

The issue is not one of precision. Rather the issue is one of representability. First of all, let us re-cap that floating point numbers are used to represent real numbers. There are an infinite quantity of real numbers. Of course, the same can be said of integers. But the difference here is that within a particular range, there are a finite number of integers but an infinite number of real numbers. Indeed as was originally shown by Cantor, any finite interval of real numbers contains an uncountable number of real values.

So it is clear that we cannot represent all real numbers on a finite machine. So, which numbers can we represent? Well, that depends on the data type. Delphi floating point data types use binary representation. The single (32 bit) and double (64 bit) types adhere to the IEEE-754 standard. The extended (80 bit) type is an Intel specific type. In binary floating point a representable number has the form k2n where k and n are integers. Note that I am not claiming that all numbers of this form are representable. That is not possible because there are an infinite quantity of such numbers. Rather my point is that all representable numbers are of this form.

Some examples of representable binary floating point numbers include: 1, 0.5, 0.25, 0.75, 1.25, 0.125, 0.375. Your value, 3.7, is not representable as a binary floating point value.

What this means in relation to your code is that none of it is doing what you expect it to do. You are hoping to compare against the value 3.7. But instead you are comparing against the nearest exactly representably value to 3.7. As a matter of implementation detail, this nearest exactly representably value is in the context of extended precision. Which is why it appears that the version using extended does what you expect. However, do not take this to mean that your variable x is equal to 3.7. In fact it is equal to the nearest representable extended precision value to 3.7.

Rob Kennedy's most useful website can show you the closest representable values to a specific number. In the case of 3.7 these are:

3.7 = + 3.70000 00000 00000 00004 33680 86899 42017 73602 98112 03479 76684 57031 25
3.7 = + 3.70000 00000 00000 17763 56839 40025 04646 77810 66894 53125
3.7 = + 3.70000 00476 83715 82031 25

These are presented in the order extended, double, single. In other words these are the values of your variables x, d and s respectively.

If you look at these values, and compare them with the closest extended to 3.7 you will see why your program produces the output that it does. Both the single and double precision values here are greater than the extended. Which is what your program told you.

I don't want to make any blanket recommendations as to how to compare floating point values. The best way to do that always depends very critically on the specific problem. No blanket advice can be usefully given.

like image 146
David Heffernan Avatar answered Oct 27 '22 23:10

David Heffernan


Short answer: 0.7 cannot be represented exactly (binary floating point values are always fractions with denominator that is a power of 2.); the precision of the data type you're storing it in (and the one the compiler selects for the type of the constant you're comparing them to) can affect the representation of that number and have an effect on the comparison.

Moral: Never directly compare two floating point values for equality unless they're exactly the same data type and assigned the same exact value.

Obligatory link: What Every Computer Scientist Should Know About Floating-Point Arithmetic

Another link that might be helpful is to Delphi's Math.SameValue function, that allows you to compare two floating point values for approximate equality depending on a specific allowable delta (difference).

like image 42
Ken White Avatar answered Oct 28 '22 01:10

Ken White