For IEEE-754 arithmetic, is there a guarantee of 0 or 1 units in the last place accuracy for reciprocals? From that, is there a guaranteed error-bound on the reciprocal of a reciprocal?
Because floating-point numbers have a limited number of digits, they cannot represent all real numbers accurately: when there are more digits than the format allows, the leftover ones are omitted - the number is rounded.
In floating point arithmetic, two extra bits are used to the far right of the significand, called the guard and round bits. At the end of the arithmetic calculation, these bits are rounded off. We always round towards the closer digit (i.e. 0.00-‐0.49 → 0 and 0.51-‐0.99 → 1).
In contrast, given any fixed number of bits, most calculations with real numbers will produce quantities that cannot be exactly represented using that many bits. Therefore the result of a floating-point calculation must often be rounded in order to fit back into its finite representation.
Almost every language has a floating-point datatype; computers from PC's to supercomputers have floating-point accelerators; most compilers will be called upon to compile floating-point algorithms from time to time; and virtually every operating system must respond to floating-point exceptions such as overflow.
[Everything below assumes a fixed IEEE 754 binary format, with some form of round-to-nearest as the rounding-mode.]
Since reciprocal (computed as 1/x
) is a basic arithmetic operation, 1
is exactly representable, and the arithmetic operations are guaranteed correctly rounded by the standard, the reciprocal result is guaranteed to be within 0.5
units in the last place of the true value. (This applies to any of the basic arithmetic operations specified in the standard.)
The reciprocal of the reciprocal of a value x
is not guaranteed to be equal to x
, in general. A quick example with the IEEE 754 binary64 format:
>>> x = 1.8
>>> 1.0 / (1.0 / x)
1.7999999999999998
>>> x == 1.0 / (1.0 / x)
False
However, assuming that overflow and underflow are avoided, and that x
is finite, nonzero, and exactly representable, the following results are true:
The value of 1.0 / (1.0 / x)
will differ from x
by no more than 1 unit in the last place.
If the significand of x
(normalised to lie in the range [1.0, 2.0)
as usual) is smaller than sqrt(2)
, then the reciprocal does roundtrip: 1.0 / (1.0 / x) == x
.
Sketch of proof: without loss of generality we can assume that x
is positive, and scale x
by a power of two so that it lies in the range [1.0, 2.0)
. The results above are clearly true in the case that x
is an exact power of two, so let's assume it's not (this will be useful in the second step below). The proof below is given for precision specific to the IEEE 754 binary64 format, but it adapts directly to any IEEE 754 binary format.
Write 1 / x
for the true value of the reciprocal, before rounding, and let y
be the (unique, as it turns out) nearest representable binary64 float to 1 / x
. Then:
since y
is the closest float to 1 / x
, and both y
and 1/x
are in the binade [0.5, 1.0]
, where the spacing between successive floats is exactly 2^-53
, we have |y - 1/x| <= 2^-54
. In fact, we can do a bit better:
we actually have a strict inequality above: |y - 1/x| < 2^-54
. If |y - 1/x|
were exactly equal to 2^-54
, then 1/x
would be exactly representable in arbitrary-precision binary floating-point (since both y
and 2^-54
are). But the only binary floats x
for which 1/x
is exactly representable at some precision are powers of two, and we already excluded this case.
if x < sqrt(2)
then 1 / x > x / 2
, hence (rounding both to the nearest representable float), we have y >= x / 2
, so x / y <= 2
.
now x - 1/y = (y - 1/x) x/y
, and from the bounds on |y - 1/x|
and x/y
(still assuming that x < sqrt(2)
) we get |x - 1/y| < 2^-53
. It follows that x
is the closest representable float to 1/y
, 1/y
rounds to x
and the roundtrip succeeds. This completes the proof of part 2.
in the general case x < 2
, we have x / y < 4
, so |x - 1/y| < 2^-52
. That makes 1/y
at most 1 ulp away from x
, which completes the proof of part 1.
Here's a demonstration of the sqrt(2)
threshold: using Python, we take a million random floats in the range [1.0, 2.0)
, and identify those that don't roundtrip through the reciprocal. All of the samples smaller than sqrt(2)
pass the roundtrip.
>>> import random
>>> samples = [random.uniform(1.0, 2.0) for _ in range(10**6)]
>>> bad = [x for x in samples if 1.0 / (1.0 / x) != x]
>>> len(bad)
171279
>>> min(bad)
1.4150519879892107
>>> import math
>>> math.sqrt(2)
1.4142135623730951
And a demonstration that the maximum error is no more than 1 ulp, in general (for the binary64 format, in the binade [1.0, 2.0), 1 unit in the last place is 2^-52):
>>> samples = [random.uniform(1.0, 2.0) for _ in range(10**6)]
>>> max(abs(x - 1.0 / (1.0 / x)) for x in samples)
2.220446049250313e-16
>>> 2**-52
2.220446049250313e-16
Here's an example with the IEEE 754 binary64 format showing that the restriction about avoiding underflow is necessary:
>>> x = 1.3e308
>>> x_roundtrip = 1.0 / (1.0 / x)
>>> x.hex()
'0x1.72409614c1e6ap+1023'
>>> x_roundtrip.hex()
'0x1.72409614c1e6cp+1023'
Here the x_roundtrip
result differs from the original by two units in the last place, because 1 / x
was smaller than the smallest normal representable float, and so was not represented to the same precision as x
.
Final note: since IEEE 754-2008 covers decimal floating-point types as well, I should mention that the above proof carries over almost verbatim to the decimal case, establishing that for floats with significand smaller than sqrt(10)
, roundtrip occurs, while for general decimal floats (again avoiding overflow and underflow) we're never off by more than one unit in the last place. However, there's some number-theoretic finesse required to show that the key inequality |x - 1/y| < 1/2 10^(1-p)
is always strict: one ends up having to show that the quantity 1 + 16 10^(2p-1)
is never a square number (which is true, but it's probably outside the ambit of this site to include the proof here).
>>> from decimal import Decimal, getcontext
>>> import random
>>> getcontext().prec = 6
>>> samples = [+Decimal(random.uniform(1.0, 10.0)) for _ in range(10**6)]
>>> bad = [x for x in samples if 1 / (1 / x) != x]
>>> min(bad)
Decimal('3.16782')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With