The constants in <float.h>
for Apple clang version 12.0.0 (clang-1200.0.32.2)
don't seem to make sense.
DBL_MIN_EXP
is -1021
and DBL_MAX_EXP
is 1024
. However, that doesn't match what wikipedia says, "exponents range from −1022 to +1023, ..."
Also DBL_MIN_EXP
seems inconsistent with DBL_MIN
which is 2.2250738585072014e-308
which is equal to 2⁻¹⁰²²
sometimes written as 0x1.0000000000000p-1022
. So, we have an exponent smaller than the minimum -1021
.
Likewise, DBL_MIN_10_EXP
is -307
which doesn't sense given that DBL_MIN
has an exponent of e-308
.
The double DBL_MAX_EXP
value of 1024
overflows when used in real code. For example, ldexp(1.0, 1024)
gives inf
.
Here's my C code:
#include <float.h>
#include <stdio.h>
#include <math.h>
#define SHOW_DOUBLE(s) printf("%.17lg \t%s\n", s, #s);
#define SHOW_INT(s) printf("%d \t%s\n", s, #s);
int
main()
{
SHOW_DOUBLE(DBL_MAX);
SHOW_DOUBLE(DBL_MIN);
SHOW_DOUBLE(DBL_EPSILON);
SHOW_INT(DBL_MAX_EXP);
SHOW_INT(DBL_MAX_10_EXP);
SHOW_INT(DBL_MIN_EXP);
SHOW_INT(DBL_MIN_10_EXP);
SHOW_INT(DBL_DIG);
SHOW_INT(DBL_MANT_DIG);
SHOW_INT(FLT_RADIX);
SHOW_INT(FLT_ROUNDS);
printf("%lf\n", ldexp(1.0, 1024));
return 0;
}
And here is the output:
1.7976931348623157e+308 DBL_MAX
2.2250738585072014e-308 DBL_MIN
2.2204460492503131e-16 DBL_EPSILON
1024 DBL_MAX_EXP
308 DBL_MAX_10_EXP
-1021 DBL_MIN_EXP
-307 DBL_MIN_10_EXP
15 DBL_DIG
53 DBL_MANT_DIG
2 FLT_RADIX
1 FLT_ROUNDS
inf
The main cause of the error in floating point division is the division algorithms used to calculate the quotient. Most computer systems calculate division using multiplication by an inverse, mainly in Z=X/Y , Z = X * (1/Y) .
Floating-point decimal values generally do not have an exact binary representation due to how the CPU represents floating point data. For this reason, you may experience a loss of precision, and some floating-point operations may produce unexpected results.
Real numbers are represented in C by the floating point types float, double, and long double. Just as the integer types can't represent all integers because they fit in a bounded number of bytes, so also the floating-point types can't represent all real numbers.
A "floating-point constant" is a decimal number that represents a signed real number. The representation of a signed real number includes an integer portion, a fractional portion, and an exponent. Use floating-point constants to represent floating-point values that can't be changed.
The off-by-one is part of the spec. From 5.2.4.2.2 Characteristics of floating types <float.h>, ¶11,
...
- minimum negative integer such that FLT_RADIX raised to one less than that power is a normalized floating-point number, emin
- FLT_MIN_EXP
- DBL_MIN_EXP
- LDBL_MIN_EXP
...
- maximum integer such that FLT_RADIX raised to one less than that power is a representable finite floating-point number, emax
- FLT_MAX_EXP
- DBL_MAX_EXP
- LDBL_MAX_EXP
Emphasis on one less than is mine.
I asked myself the same question and realized that this stems from the fact that IEEE 754 and C use two different normalized forms:
x = sign × significand × 2exponent, encoded as S|E|F where sign = (−1)S; significand =
- 1.F if Emin ≤ E ≤ Emax (normal number),
- 0 if E = Emin − 1 and F = 0 (zero),
- 0.F if E = Emin − 1 and F ≠ 0 (subnormal number),
- ∞ if E = Emax + 1 and F = 0 (infinity),
- NaN if E = Emax + 1 and F ≠ 0 (not a number);
exponent = max(E, Emin) − exponent bias.
x = sign′ × significand′ × 2exponent′, encoded as S|E|F where sign′ = (−1)S; significand′ =
- 0.1F if Emin ≤ E ≤ Emax (normal number),
- 0 if E = Emin − 1 and F = 0 (zero),
- 0.0F if E = Emin − 1 and F ≠ 0 (subnormal number),
- ∞ if E = Emax + 1 and F = 0 (infinity),
- NaN if E = Emax + 1 and F ≠ 0 (not a number);
exponent′ = max(E + 1, Emin + 1) − exponent bias.
Since x = sign × significand × 2exponent = sign′ × significand′ × 2exponent′, the following relations between the IEEE 754 normalized form and the C normalized form hold:
In particular, in binary64 format that is why exponentmin = −1022 in IEEE 754 normalized form and exponent′min = −1021 in C normalized form (DBL_MIN_EXP
), and exponentmax = 1023 in IEEE 754 normalized form and exponent′max = 1024 in C normalized form (DBL_MAX_EXP
).
In Python:
math.ldexp(x, -math.floor(math.log2(x))), math.floor(math.log2(x))
math.frexp(x)
For instance:
>>> import math
>>> x = 12
>>> math.ldexp(x, -math.floor(math.log2(x))), math.floor(math.log2(x))
(1.5, 3)
>>> math.frexp(12)
(0.75, 4)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With