Assume that t
,a
,b
are all double (IEEE Std 754) variables, and both values of a
, b
are NOT NaN
(but may be Inf
).
After t = a - b
, do I necessarily have a == b + t
?
The IEEE-754 standard describes floating-point formats, a way to represent real numbers in hardware. There are at least five internal formats for floating-point numbers that are representable in hardware targeted by the MSVC compiler. The compiler only uses two of them.
Storage Layout. IEEE floating point numbers have three basic components: the sign, the exponent, and the mantissa.
No, not all, but there exists a range within which you can represent all integers accurately.
To convert it into a binary fraction, multiply the fraction by 2, take the integer part and repeat with the new fraction by 2 until a fraction of zero is found or until the precision limit is reached which is 23 fraction digits for IEEE 754 binary32 format.
Absolutely not. One obvious case is a=DBL_MAX
, b=-DBL_MAX
. Then t=INFINITY
, so b+t
is also INFINITY
.
What may be more surprising is that there are cases where this happens without any overflow. Basically, they're all of the form where a-b
is inexact. For example, if a
is DBL_EPSILON/4
and b
is -1
, a-b
is 1 (assuming default rounding mode), and a-b+b
is then 0.
The reason I mention this second example is that this is the canonical way of forcing rounding to a particular precision in IEEE arithmetic. For instance, if you have a number in the range [0,1) and want to force rounding it to 4 bits of precision, you would add and then subtract 0x1p49
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With