Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there any accuracy gain when casting to double and back when doing float division?

What is the difference between two following?

float f1 = some_number; float f2 = some_near_zero_number; float result;  result = f1 / f2; 

and:

float f1 = some_number; float f2 = some_near_zero_number; float result;  result = (double)f1 / (double)f2; 

I am especially interested in very small f2 values which may produce +infinity when operating on floats. Is there any accuracy to be gained?

Some practical guidelines for using this kind of cast would be nice as well.

like image 708
Piotr Lopusiewicz Avatar asked Feb 05 '15 12:02

Piotr Lopusiewicz


People also ask

How accurate is double float?

Double is more precise than float and can store 64 bits, double of the number of bits float can store.

Is double more precise than float?

double has 2x more precision than float. float is a 32-bit IEEE 754 single precision Floating Point Number – 1 bit for the sign, 8 bits for the exponent, and 23* for the value. float has 7 decimal digits of precision.

Why is double preferred over float?

Both double-type and float-type can be used to represent floating-point numbers in Java. A double-type is preferred over float-type if the more precise and accurate result is required. The precision of double-type is up to 15 to 16 decimal points while the precision of float type is only around 6 to 7 decimal digits.

Is double and float the same?

float and double both have varying capacities when it comes to the number of decimal digits they can hold. float can hold up to 7 decimal digits accurately while double can hold up to 15. Let's see some examples to demonstrate this.


2 Answers

I am going to assume IEEE 754 binary floating point arithmetic, with float 32 bit and double 64 bit.

In general, there is no advantage to doing the calculation in double, and in some cases it may make things worse through doing two rounding steps.

Conversion from float to double is exact. For the infinite, NaN, or zero divisor inputs it makes no differences. Given a finite number result, the IEEE 754 standard requires the result to be the result of the real number division f1/f2, rounded to the type being using in the division.

If it is done as a float division that is the closest float to the exact result. If it is done as double division, it will be the closest double with an additional rounding step for the assignment to result.

For most inputs, the two will give the same answer. Any overflow or underflow that did not happen on the division because it was done in double will happen instead on the conversion.

For simple conversion, if the answer is very close to half way between two float values the two rounding steps may pick the wrong float. I had assumed this could also apply to division results. However, Pascal Cuoq, in a comment on this answer, has called attention to a very interesting paper, Innocuous Double Rounding of Basic Arithmetic Operations by Pierre Roux, claiming proof that double rounding is harmless for several operations, including division, under conditions that are implied by the assumptions I made at the start of this answer.

like image 61
Patricia Shanahan Avatar answered Oct 09 '22 10:10

Patricia Shanahan


If the result of an individual floating-point addition, subtraction, multiply, or divide, is immediately stored to a float, there will be no accuracy improvement using double for intermediate values. In cases where operations are chained together, however, accuracy will often be improved by using a higher-precision intermediate type, provided that one is consistent in using them. In Turbo Pascal circa 1986 code like:

Function TriangleArea(A: Single, B:Single, C:Single): Single Begin   Var S: Extended;  (* S stands for Semi-perimeter *)   S := (A+B+C) * 0.5;   TriangleArea := Sqrt((S-A)*(S-B)*(S-C)*S) End; 

would extend all operands of floating-point operations to type Extended (80-bit float), and then convert them back to single- or double-precision when storing to variables of those types. Very nice semantics for numerical processing. Turbo C of that area behaved similarly, but rather unhelpfully failed to provide any numeric type capable of holding intermediate results; the failure of languages to provide a variable type which could hold intermediate results led to people's unfairly criticizing the concept of a higher-precision intermediate result type, when the real problem was that languages failed to support it properly.

Anyway, if one were to write the above method into a modern language like C#:

    public static float triangleArea(float a, float b, float c)     {         double s = (a + b + c) * 0.5;         return (double)(Math.Sqrt((s - a) * (s - b) * (s - c) * s));     } 

the code would work well if the compiler happens to promote the operands of the addition to double before performing the computation, but that's something it may or may not do. If the compiler performs the calculation as float, precision may be horrid. When using the above formula to compute the area of an isosceles triangle with long sides of 16777215 and a short side of 4, for example, eager promotion will yield a correct result of 3.355443E+7 while performing the math as float will, depending upon the order of the operands, yield 5.033165E+7 [more than 50% too big] or 16777214.0 [more than 50% too small].

Note that even though code like the above will work perfectly on some environments, but yield completely bogus results on others, compilers will generally not give any warning about the situation.

Although individual operations on float which are going to be immediately stored to float can be done just as accurately with type float as they could be with type double, eagerly promoting operands will often help considerably when operations are combined. In some cases, rearranging operations may avoid problems caused by loss of promotion (e.g. the above formula uses five additions, four multiplications, and a square root; rewriting the formula as:

Math.Sqrt((a+b+c)*(b-a+c)*(a-b+c)*(a-c+b))*0.25 

increases the number of additions to eight, but will work correctly even if they are performed at single precision.

like image 44
supercat Avatar answered Oct 09 '22 08:10

supercat