Is there any accuracy gain when casting to double and back when doing float division?

Tags:

What is the difference between two following?

float f1 = some_number; float f2 = some_near_zero_number; float result;  result = f1 / f2;

and:

float f1 = some_number; float f2 = some_near_zero_number; float result;  result = (double)f1 / (double)f2;

I am especially interested in very small f2 values which may produce +infinity when operating on floats. Is there any accuracy to be gained?

Some practical guidelines for using this kind of cast would be nice as well.

708

asked Feb 05 '15 12:02

Piotr Lopusiewicz

2 Answers

I am going to assume IEEE 754 binary floating point arithmetic, with float 32 bit and double 64 bit.

In general, there is no advantage to doing the calculation in double, and in some cases it may make things worse through doing two rounding steps.

Conversion from float to double is exact. For the infinite, NaN, or zero divisor inputs it makes no differences. Given a finite number result, the IEEE 754 standard requires the result to be the result of the real number division f1/f2, rounded to the type being using in the division.

If it is done as a float division that is the closest float to the exact result. If it is done as double division, it will be the closest double with an additional rounding step for the assignment to result.

For most inputs, the two will give the same answer. Any overflow or underflow that did not happen on the division because it was done in double will happen instead on the conversion.

For simple conversion, if the answer is very close to half way between two float values the two rounding steps may pick the wrong float. I had assumed this could also apply to division results. However, Pascal Cuoq, in a comment on this answer, has called attention to a very interesting paper, Innocuous Double Rounding of Basic Arithmetic Operations by Pierre Roux, claiming proof that double rounding is harmless for several operations, including division, under conditions that are implied by the assumptions I made at the start of this answer.

answered Oct 09 '22 10:10

Patricia Shanahan

If the result of an individual floating-point addition, subtraction, multiply, or divide, is immediately stored to a float, there will be no accuracy improvement using double for intermediate values. In cases where operations are chained together, however, accuracy will often be improved by using a higher-precision intermediate type, provided that one is consistent in using them. In Turbo Pascal circa 1986 code like:

Function TriangleArea(A: Single, B:Single, C:Single): Single Begin   Var S: Extended;  (* S stands for Semi-perimeter *)   S := (A+B+C) * 0.5;   TriangleArea := Sqrt((S-A)*(S-B)*(S-C)*S) End;

would extend all operands of floating-point operations to type Extended (80-bit float), and then convert them back to single- or double-precision when storing to variables of those types. Very nice semantics for numerical processing. Turbo C of that area behaved similarly, but rather unhelpfully failed to provide any numeric type capable of holding intermediate results; the failure of languages to provide a variable type which could hold intermediate results led to people's unfairly criticizing the concept of a higher-precision intermediate result type, when the real problem was that languages failed to support it properly.

Anyway, if one were to write the above method into a modern language like C#:

    public static float triangleArea(float a, float b, float c)     {         double s = (a + b + c) * 0.5;         return (double)(Math.Sqrt((s - a) * (s - b) * (s - c) * s));     }

the code would work well if the compiler happens to promote the operands of the addition to double before performing the computation, but that's something it may or may not do. If the compiler performs the calculation as float, precision may be horrid. When using the above formula to compute the area of an isosceles triangle with long sides of 16777215 and a short side of 4, for example, eager promotion will yield a correct result of 3.355443E+7 while performing the math as float will, depending upon the order of the operands, yield 5.033165E+7 [more than 50% too big] or 16777214.0 [more than 50% too small].

Note that even though code like the above will work perfectly on some environments, but yield completely bogus results on others, compilers will generally not give any warning about the situation.

Although individual operations on float which are going to be immediately stored to float can be done just as accurately with type float as they could be with type double, eagerly promoting operands will often help considerably when operations are combined. In some cases, rearranging operations may avoid problems caused by loss of promotion (e.g. the above formula uses five additions, four multiplications, and a square root; rewriting the formula as:

Math.Sqrt((a+b+c)*(b-a+c)*(a-b+c)*(a-c+b))*0.25

increases the number of additions to eight, but will work correctly even if they are performed at single precision.

answered Oct 09 '22 08:10

supercat

Related questions
                            
                                Run C or C++ file as a script
                            
                                How relevant is Win32 programming to modern professionals? [closed]
                            
                                Trying to generate 9 digit numbers with each unique digits
                            
                                What does the operation c=a+++b mean?
                            
                                What does sizeof(&array) return?
                            
                                using C code to get same info as ifconfig
                            
                                Syntactic sugar in C/C++
                            
                                Debugging a Python Extension in Eclipse
                            
                                Retrieve names of running processes
                            
                                Understanding the C11 type hierarchy
                            
                                Use C++ with Android ndk/jni
                            
                                What is a good open source B-tree implementation in C? [closed]
                            
                                What is the difference between struct addrinfo and struct sockaddr
                            
                                What is the difference between AF_INET and PF_INET constants?
                            
                                How do I pad a printf to take account of negative signs and variable length numbers?
                            
                                Create statically-linked binary that uses getaddrinfo?
                            
                                How to implement memmove in standard C without an intermediate copy?
                            
                                Can someone explain how to append an element to an array in C programming?
                            
                                How undefined is undefined behavior?
                            
                                GCC: vectorization difference between two similar loops

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there any accuracy gain when casting to double and back when doing float division?

Tags:

c

floating-point

floating-accuracy

ieee-754

Piotr Lopusiewicz

People also ask

2 Answers

Patricia Shanahan

supercat

Recent Activity

Donate For Us