C++ Float Division and Precision

Question

I know that 511 divided by 512 actually equals 0.998046875. I also know that the precision of floats is 7 digits. My question is, when I do this math in C++ (GCC) the result I get is 0.998047, which is a rounded value. I'd prefer to just get the truncated value of 0.998046, how can I do that?

  float a = 511.0f;
  float b = 512.0f;
  float c = a / b;

Dietrich Epp · Accepted Answer

Well, here's one problem. The value of 511/512, as a float, is exact. No rounding is done. You can check this by asking for more than seven digits:

#include <stdio.h>
int main(int argc, char *argv[])
{
    float x = 511.0f, y = 512.0f;
    printf("%.15f
", x/y);
    return 0;
}

Output:

0.998046875000000

A float is stored not as a decimal number, but binary. If you divide a number by a power of 2, such as 512, the result will almost always be exact. What's going on is the precision of a float is not simply 7 digits, it is really 23 bits of precision.

See What Every Computer Scientist Should Know About Floating-Point Arithmetic.

AProgrammer · Answer

I also know that the precision of floats is 7 digits.

No. The most common floating point format is binary and has a precision of 24 bits. It is somewhere between 6 and 7 decimal digits but you can't think in decimal if you want to understand how rounding work.

As b is a power of 2, c is exactly representable. It is during the conversion in a decimal representation that rounding will occurs. The standard ways of getting a decimal representation don't offer the possibility to use truncation instead of rounding. One way would be to ask for one more digit and ignore it.

But note that the fact that c is exactly representable is a property of its value. SOme apparently simpler values (like 0.1) don't have an exact representation in binary FP formats.

Olof Forshell · Answer

Your question is not unique, it has been answered numerous times before. This is not a simple topic and just because answers are posted doesn't necessarily mean they'll be of good quality. If you browse a little you'll find the really good stuff. And it will take you less time.

I bet someone will -1 me for commenting and not answering.

_____ Edit _____

What is fundamental to understanding floating point is to realize that everything is displayed in binary digits. Because most people have trouble grasping this they try to see it from the point of view of decimal digits.

On the subject of 511/512 you can start by looking at the value 1.0. In floating point this could be expressed as i.000000... * 2^0 or implicit bit set (to 1) multiplied by 2^0 ie equals 1. Since 511/512 is less than 1 you need to start with the next lower power -1 giving i.000000... * 2^-1 i e 0.5. Notice that the only thing that has changed is the exponent. If we want to express 511 in binary we get 9 ones - 111111111 or in floating point with implicit bit i.11111111 - which we can divide by 512 and put together with the exponent of -1 giving i.1111111100... * 2^-1.

How does this translate to 0.998046875?

Well to begin with the implicit bit represents 0.5 (or 2^-1), the first explicit bit 0.25 (2^-2), the next explicit bit 0.125 (2^-3), 0.0625, 0.03125 and so on until you've represented the ninth bit (eighth explicit). Sum them up and you get 0.998046875. From the i.11111111 we find that this number represents 9 binary digits of precision and, coincidentally, 9 decimal precision.

If you multiply 511/512 by 512 you will get i1111111100... * 2^8. Here there are the same nine binary digits of precision but only three decimal digits (for 511).

Consider i.11111111111111111111111 (i + 23 ones) * 2^-1. We will get a fraction (2^(24-1)^/(2^24))with 24 binary and 24 decimal digits of precision. Given an appropriate printf formatting all 24 decimal digits will be displayed. Multiply it by 2^24 and you still have 24 binary digits of precision but only 8 decimal (for 16777215).

Now consider i.1111100... * 2^2 which comes out to 7.875. i11 is the integer part and 111 the fraction part (111/1000 or 7/8ths). 6 binary digits of precision and 4 decimal.

Thinking decimal when doing floating-point is utterly detrimental to understanding it. Free yourself!

C++ Float Division and Precision

Tags:

c++

math

floating-point

Nick Gotch

3 Answers

Dietrich Epp

AProgrammer

Olof Forshell

Recent Activity

Donate For Us

C++ Float Division and Precision

Tags:

c++

math

floating-point

Nick Gotch

3 Answers

Dietrich Epp

AProgrammer

Olof Forshell

Related questions

Recent Activity

Donate For Us