Say I have: <pre class="prettyprint"><code>float a = 3 // (gdb) p/f a = 3 float b = 299792458 // (gdb) p/f b = 299792448 </code></pre> then <pre class="prettyprint"><code>float sum = a + b // (gdb) p/f sum = 299792448 </code></pre> I think it has something to do with the mantissa shifting around. Can someone explain exactly what's going on? 32bit

32-bit floats only have 24 bits of precision. Thus, a float cannot hold <code>b</code> exactly - it does the best job it can by setting some exponent and mantissa to get as close as possible1. (The nearest representable <code>float</code> to the constant in the source; the default FP rounding mode is "nearest".) When you then consider the floating point representation of <code>b</code> and <code>a</code>, and try and add them, the addition operation will shift the small number <code>a</code>'s mantissa downwards as it tries to match <code>b</code>'s exponent, to the point where the value (3) falls off the end and you're left with 0. Hence, the addition operator ends up adding floating point zero to <code>b</code>. (This is an over-simplification; low bits can still affect rounding if there's partial overlap of mantissas.) In general, the infinite-precision addition result has to get rounded to the nearest <code>float</code> with the current FP rounding mode, and that happened to be equal to <code>b</code>. See also Why adding big to small in floating point introduce more error? for cases where the number changes some, but with large rounding error, for an example using decimal significant figures as a way to help understand binary float rounding. <hr> Footnote 1: For numbers that large, the nearest two floats are 32 apart. Modern clang even warns about rounding of an <code>int</code> in the source to a <code>float</code> that represents a different value. Unless you write it as a float or double constant already (like <code>299792458.0f</code>), in which case the rounding happens without warning. That's why the smallest <code>a</code> value that will round <code>sum</code> up to <code>299792480.0f</code> instead of down to <code>299792448.0f</code> is about 16.000001 for that <code>b</code> value which rounded to <code>299792448.0f</code>. Runnable example on the Godbolt compiler explorer. The default FP rounding mode rounds to nearest with even mantissa as a tie-break. 16.0 goes exactly half-way, and thus round to a bit-pattern of 0x4d8ef3c2, not up to 0x4d8ef3c3. https://www.h-schmidt.net/FloatConverter/IEEE754.html. Anything slightly greater than 16 rounds up, because rounding cares about the infinite-precision result. It doesn't actually shift out bits before adding, that was an over-simplification. The nearest float to 16.000001 has only the low bit set in its mantissa, bit-pattern 0x41800001. It's actually about 1.0000001192092896 x 24, or 16.0000019... Much smaller and it would round to exactly 16.0f and would be <= 1 ULP (unit in the last place) of <code>b</code>, which wouldn't change <code>b</code> because <code>b</code>'s mantissa is already even. <hr> If you avoid early rounding by using <code>double a,b</code>, the smallest value you can add that rounds up <code>299792480.0f</code> instead of down to <code>299792448.0f</code> when you do <code>float sum = a+b</code> is about <code>a=6.0000001;</code>, which makes sense because the integer value ...58 stays as <code>...58.0</code> instead of rounding down to <code>...48.0f</code>, i.e. the rounding error in <code>float b = ...58</code> was -10, so <code>a</code> can be that much smaller. There are two rounding steps this time, though, with <code>a+b</code> rounding to the nearest <code>double</code> if that addition isn't exact, then that <code>double</code> rounding to a <code>float</code>. (Or if <code>FLT_EVAL_METHOD</code> == 2, like C compiling for 80-bit x87 floating point on 32-bit x86, the <code>+</code> result would round to to 80-bit <code>long double</code>, then to <code>float</code>.)

Floating-point number have limited precision. If you're using a <code>float</code>, you're only using 32 bits. However some of those bits are reserved for defining the exponent, so that you really only have 23 bits to use. The number you give is too large for those 23 bits, so the last few digits are ignored. To make this a little more intuitive, suppose all of the bits except 2 were reserved for the exponent. Then we can represent 0, 1, 2, and 3 without trouble, but then we have to increment the exponent. Now we need to represent 4 through 16 with only 2 bits. So the numbers that can be represented will be somewhat spread out: 4 and 5 won't both be there. So, 4+1 = 4.

Why does adding a small float to a large float just drop the small one?

Tags:

c

floating-point

Say I have:

float a = 3            // (gdb) p/f a   = 3
float b = 299792458    // (gdb) p/f b   = 299792448

then

float sum = a + b      // (gdb) p/f sum = 299792448

I think it has something to do with the mantissa shifting around. Can someone explain exactly what's going on? 32bit

243

asked Mar 05 '14 01:03

mharris7190

3 Answers

32-bit floats only have 24 bits of precision. Thus, a float cannot hold b exactly - it does the best job it can by setting some exponent and mantissa to get as close as possible¹. (The nearest representable float to the constant in the source; the default FP rounding mode is "nearest".)

When you then consider the floating point representation of b and a, and try and add them, the addition operation will shift the small number a's mantissa downwards as it tries to match b's exponent, to the point where the value (3) falls off the end and you're left with 0. Hence, the addition operator ends up adding floating point zero to b. (This is an over-simplification; low bits can still affect rounding if there's partial overlap of mantissas.)

In general, the infinite-precision addition result has to get rounded to the nearest float with the current FP rounding mode, and that happened to be equal to b.

See also Why adding big to small in floating point introduce more error? for cases where the number changes some, but with large rounding error, for an example using decimal significant figures as a way to help understand binary float rounding.

Footnote 1: For numbers that large, the nearest two floats are 32 apart. Modern clang even warns about rounding of an int in the source to a float that represents a different value. Unless you write it as a float or double constant already (like 299792458.0f), in which case the rounding happens without warning.

That's why the smallest a value that will round sum up to 299792480.0f instead of down to 299792448.0f is about 16.000001 for that b value which rounded to 299792448.0f. Runnable example on the Godbolt compiler explorer.

The default FP rounding mode rounds to nearest with even mantissa as a tie-break. 16.0 goes exactly half-way, and thus round to a bit-pattern of 0x4d8ef3c2, not up to 0x4d8ef3c3. https://www.h-schmidt.net/FloatConverter/IEEE754.html. Anything slightly greater than 16 rounds up, because rounding cares about the infinite-precision result. It doesn't actually shift out bits before adding, that was an over-simplification. The nearest float to 16.000001 has only the low bit set in its mantissa, bit-pattern 0x41800001. It's actually about 1.0000001192092896 x 2⁴, or 16.0000019... Much smaller and it would round to exactly 16.0f and would be <= 1 ULP (unit in the last place) of b, which wouldn't change b because b's mantissa is already even.

If you avoid early rounding by using double a,b, the smallest value you can add that rounds up 299792480.0f instead of down to 299792448.0f when you do float sum = a+b is about a=6.0000001;, which makes sense because the integer value ...58 stays as ...58.0 instead of rounding down to ...48.0f, i.e. the rounding error in float b = ...58 was -10, so a can be that much smaller.

There are two rounding steps this time, though, with a+b rounding to the nearest double if that addition isn't exact, then that double rounding to a float. (Or if FLT_EVAL_METHOD == 2, like C compiling for 80-bit x87 floating point on 32-bit x86, the + result would round to to 80-bit long double, then to float.)

151

answered Oct 25 '22 00:10

Chris McGrath

All you really need to know about the mechanics of rounding is that the result you get is the closest float to the correct answer (with some extra rules that decide what to do if the correct answer is exactly between two floats). It just so happens that the smaller number you added is less than half the distance between two floats at that scale, so the result is indistinguishable from the larger number you added. This is correct, to within the limits of float precision. If you want a better answer, use a better-precision data type, like double.

answered Oct 25 '22 02:10

hobbs

Floating-point number have limited precision. If you're using a float, you're only using 32 bits. However some of those bits are reserved for defining the exponent, so that you really only have 23 bits to use. The number you give is too large for those 23 bits, so the last few digits are ignored.

To make this a little more intuitive, suppose all of the bits except 2 were reserved for the exponent. Then we can represent 0, 1, 2, and 3 without trouble, but then we have to increment the exponent. Now we need to represent 4 through 16 with only 2 bits. So the numbers that can be represented will be somewhat spread out: 4 and 5 won't both be there. So, 4+1 = 4.

answered Oct 25 '22 01:10

Scott Lawrence

Related questions
                            
                                Implications of typedef void FOO vs. #define FOO void in function signatures [duplicate]
                            
                                Understanding code in strlen implementation
                            
                                What is the correct typedef for an opaque C pointer to a C++ class?
                            
                                MacOS "configure: error: cannot run C compiled programs"
                            
                                What does each entry in the Jmp_buf structure hold?
                            
                                How to read unicode (utf-8) / binary file line by line
                            
                                Multiple definition of inline functions when linking static libs
                            
                                Does lwIP support Zeroconf?
                            
                                Lock a mutex multiple times in the same thread
                            
                                Is it possible to unpage all memory in Windows?
                            
                                How to enforce C89-style variable declarations in gcc?
                            
                                Cross Platform C?
                            
                                Can one (re)set all the values of an array in one line (after it has been initialized)?
                            
                                Enable a signal handler using sigaction in C
                            
                                Standard C library in mingW
                            
                                In a signal handler, how to know where the program is interrupted?
                            
                                Finding repeating signed integers with O(n) in time and O(1) in space
                            
                                Do C99 signed integer types defined in stdint.h exhibit well-defined behaviour in case of an overflow?
                            
                                How to compile and run a simple C program with Sublime Text 2?
                            
                                Meaning and usage of the factor parameter in glPolygonOffset

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With