Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C++ Float Division and Precision

I know that 511 divided by 512 actually equals 0.998046875. I also know that the precision of floats is 7 digits. My question is, when I do this math in C++ (GCC) the result I get is 0.998047, which is a rounded value. I'd prefer to just get the truncated value of 0.998046, how can I do that?

  float a = 511.0f;
  float b = 512.0f;
  float c = a / b;
like image 399
Nick Gotch Avatar asked May 14 '11 16:05

Nick Gotch


3 Answers

Well, here's one problem. The value of 511/512, as a float, is exact. No rounding is done. You can check this by asking for more than seven digits:

#include <stdio.h>
int main(int argc, char *argv[])
{
    float x = 511.0f, y = 512.0f;
    printf("%.15f\n", x/y);
    return 0;
}

Output:

0.998046875000000

A float is stored not as a decimal number, but binary. If you divide a number by a power of 2, such as 512, the result will almost always be exact. What's going on is the precision of a float is not simply 7 digits, it is really 23 bits of precision.

See What Every Computer Scientist Should Know About Floating-Point Arithmetic.

like image 98
Dietrich Epp Avatar answered Oct 14 '22 04:10

Dietrich Epp


I also know that the precision of floats is 7 digits.

No. The most common floating point format is binary and has a precision of 24 bits. It is somewhere between 6 and 7 decimal digits but you can't think in decimal if you want to understand how rounding work.

As b is a power of 2, c is exactly representable. It is during the conversion in a decimal representation that rounding will occurs. The standard ways of getting a decimal representation don't offer the possibility to use truncation instead of rounding. One way would be to ask for one more digit and ignore it.

But note that the fact that c is exactly representable is a property of its value. SOme apparently simpler values (like 0.1) don't have an exact representation in binary FP formats.

like image 23
AProgrammer Avatar answered Oct 14 '22 04:10

AProgrammer


Your question is not unique, it has been answered numerous times before. This is not a simple topic and just because answers are posted doesn't necessarily mean they'll be of good quality. If you browse a little you'll find the really good stuff. And it will take you less time.

I bet someone will -1 me for commenting and not answering.

_____ Edit _____

What is fundamental to understanding floating point is to realize that everything is displayed in binary digits. Because most people have trouble grasping this they try to see it from the point of view of decimal digits.

On the subject of 511/512 you can start by looking at the value 1.0. In floating point this could be expressed as i.000000... * 2^0 or implicit bit set (to 1) multiplied by 2^0 ie equals 1. Since 511/512 is less than 1 you need to start with the next lower power -1 giving i.000000... * 2^-1 i e 0.5. Notice that the only thing that has changed is the exponent. If we want to express 511 in binary we get 9 ones - 111111111 or in floating point with implicit bit i.11111111 - which we can divide by 512 and put together with the exponent of -1 giving i.1111111100... * 2^-1.

How does this translate to 0.998046875?

Well to begin with the implicit bit represents 0.5 (or 2^-1), the first explicit bit 0.25 (2^-2), the next explicit bit 0.125 (2^-3), 0.0625, 0.03125 and so on until you've represented the ninth bit (eighth explicit). Sum them up and you get 0.998046875. From the i.11111111 we find that this number represents 9 binary digits of precision and, coincidentally, 9 decimal precision.

If you multiply 511/512 by 512 you will get i1111111100... * 2^8. Here there are the same nine binary digits of precision but only three decimal digits (for 511).

Consider i.11111111111111111111111 (i + 23 ones) * 2^-1. We will get a fraction (2^(24-1)^/(2^24))with 24 binary and 24 decimal digits of precision. Given an appropriate printf formatting all 24 decimal digits will be displayed. Multiply it by 2^24 and you still have 24 binary digits of precision but only 8 decimal (for 16777215).

Now consider i.1111100... * 2^2 which comes out to 7.875. i11 is the integer part and 111 the fraction part (111/1000 or 7/8ths). 6 binary digits of precision and 4 decimal.

Thinking decimal when doing floating-point is utterly detrimental to understanding it. Free yourself!

like image 27
Olof Forshell Avatar answered Oct 14 '22 03:10

Olof Forshell