What is the difference between two following?
float f1 = some_number; float f2 = some_near_zero_number; float result; result = f1 / f2;
and:
float f1 = some_number; float f2 = some_near_zero_number; float result; result = (double)f1 / (double)f2;
I am especially interested in very small f2 values which may produce +infinity when operating on floats. Is there any accuracy to be gained?
Some practical guidelines for using this kind of cast would be nice as well.
Double is more precise than float and can store 64 bits, double of the number of bits float can store.
double has 2x more precision than float. float is a 32-bit IEEE 754 single precision Floating Point Number – 1 bit for the sign, 8 bits for the exponent, and 23* for the value. float has 7 decimal digits of precision.
Both double-type and float-type can be used to represent floating-point numbers in Java. A double-type is preferred over float-type if the more precise and accurate result is required. The precision of double-type is up to 15 to 16 decimal points while the precision of float type is only around 6 to 7 decimal digits.
float and double both have varying capacities when it comes to the number of decimal digits they can hold. float can hold up to 7 decimal digits accurately while double can hold up to 15. Let's see some examples to demonstrate this.
I am going to assume IEEE 754 binary floating point arithmetic, with float
32 bit and double
64 bit.
In general, there is no advantage to doing the calculation in double
, and in some cases it may make things worse through doing two rounding steps.
Conversion from float
to double
is exact. For the infinite, NaN, or zero divisor inputs it makes no differences. Given a finite number result, the IEEE 754 standard requires the result to be the result of the real number division f1/f2
, rounded to the type being using in the division.
If it is done as a float
division that is the closest float
to the exact result. If it is done as double
division, it will be the closest double
with an additional rounding step for the assignment to result
.
For most inputs, the two will give the same answer. Any overflow or underflow that did not happen on the division because it was done in double
will happen instead on the conversion.
For simple conversion, if the answer is very close to half way between two float
values the two rounding steps may pick the wrong float
. I had assumed this could also apply to division results. However, Pascal Cuoq, in a comment on this answer, has called attention to a very interesting paper, Innocuous Double Rounding of Basic Arithmetic Operations by Pierre Roux, claiming proof that double rounding is harmless for several operations, including division, under conditions that are implied by the assumptions I made at the start of this answer.
If the result of an individual floating-point addition, subtraction, multiply, or divide, is immediately stored to a float
, there will be no accuracy improvement using double
for intermediate values. In cases where operations are chained together, however, accuracy will often be improved by using a higher-precision intermediate type, provided that one is consistent in using them. In Turbo Pascal circa 1986 code like:
Function TriangleArea(A: Single, B:Single, C:Single): Single Begin Var S: Extended; (* S stands for Semi-perimeter *) S := (A+B+C) * 0.5; TriangleArea := Sqrt((S-A)*(S-B)*(S-C)*S) End;
would extend all operands of floating-point operations to type Extended (80-bit float), and then convert them back to single- or double-precision when storing to variables of those types. Very nice semantics for numerical processing. Turbo C of that area behaved similarly, but rather unhelpfully failed to provide any numeric type capable of holding intermediate results; the failure of languages to provide a variable type which could hold intermediate results led to people's unfairly criticizing the concept of a higher-precision intermediate result type, when the real problem was that languages failed to support it properly.
Anyway, if one were to write the above method into a modern language like C#:
public static float triangleArea(float a, float b, float c) { double s = (a + b + c) * 0.5; return (double)(Math.Sqrt((s - a) * (s - b) * (s - c) * s)); }
the code would work well if the compiler happens to promote the operands of the addition to double
before performing the computation, but that's something it may or may not do. If the compiler performs the calculation as float
, precision may be horrid. When using the above formula to compute the area of an isosceles triangle with long sides of 16777215 and a short side of 4, for example, eager promotion will yield a correct result of 3.355443E+7 while performing the math as float
will, depending upon the order of the operands, yield 5.033165E+7 [more than 50% too big] or 16777214.0 [more than 50% too small].
Note that even though code like the above will work perfectly on some environments, but yield completely bogus results on others, compilers will generally not give any warning about the situation.
Although individual operations on float
which are going to be immediately stored to float
can be done just as accurately with type float
as they could be with type double
, eagerly promoting operands will often help considerably when operations are combined. In some cases, rearranging operations may avoid problems caused by loss of promotion (e.g. the above formula uses five additions, four multiplications, and a square root; rewriting the formula as:
Math.Sqrt((a+b+c)*(b-a+c)*(a-b+c)*(a-c+b))*0.25
increases the number of additions to eight, but will work correctly even if they are performed at single precision.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With