Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Integers and float precision

This is more of a numerical analysis rather than programming question, but I suppose some of you will be able to answer it.

In the sum two floats, is there any precision lost? Why?

In the sum of a float and a integer, is there any precision lost? Why?

Thanks.

like image 416
nunos Avatar asked Dec 07 '22 05:12

nunos


2 Answers

In the sum two floats, is there any precision lost?

If both floats have differing magnitude and both are using the complete precision range (of about 7 decimal digits) then yes, you will see some loss in the last places.

Why?

This is because floats are stored in the form of (sign) (mantissa) × 2(exponent). If two values have differing exponents and you add them, then the smaller value will get reduced to less digits in the mantissa (because it has to adapt to the larger exponent):

PS> [float]([float]0.0000001 + [float]1)
1

In the sum of a float and a integer, is there any precision lost?

Yes, a normal 32-bit integer is capable of representing values exactly which do not fit exactly into a float. A float can still store approximately the same number, but no longer exactly. Of course, this only applies to numbers that are large enough, i. e. longer than 24 bits.

Why?

Because float has 24 bits of precision and (32-bit) integers have 32. float will still be able to retain the magnitude and most of the significant digits, but the last places may likely differ:

PS> [float]2100000050 + [float]100
2100000100
like image 60
Joey Avatar answered Feb 25 '23 17:02

Joey


The precision depends on the magnitude of the original numbers. In floating point, the computer represents the number 312 internally as scientific notation:

3.12000000000 * 10 ^ 2

The decimal places in the left hand side (mantissa) are fixed. The exponent also has an upper and lower bound. This allows it to represent very large or very small numbers.

If you try to add two numbers which are the same in magnitude, the result should remain the same in precision, because the decimal point doesn't have to move:

312.0 + 643.0 <==>

3.12000000000 * 10 ^ 2 +
6.43000000000 * 10 ^ 2
-----------------------
9.55000000000 * 10 ^ 2

If you tried to add a very big and a very small number, you would lose precision because they must be squeezed into the above format. Consider 312 + 12300000000000000000000. First you have to scale the smaller number to line up with the bigger one, then add:

1.23000000000 * 10 ^ 15 +
0.00000000003 * 10 ^ 15
-----------------------
1.23000000003 <-- precision lost here!

Floating point can handle very large, or very small numbers. But it can't represent both at the same time.

As for ints and doubles being added, the int gets turned into a double immediately, then the above applies.

like image 26
Matthew Herrmann Avatar answered Feb 25 '23 19:02

Matthew Herrmann