Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Some C Floating Point Constants Don't Make Sense

The constants in <float.h> for Apple clang version 12.0.0 (clang-1200.0.32.2) don't seem to make sense.

DBL_MIN_EXP is -1021 and DBL_MAX_EXP is 1024. However, that doesn't match what wikipedia says, "exponents range from −1022 to +1023, ..."

Also DBL_MIN_EXP seems inconsistent with DBL_MIN which is 2.2250738585072014e-308 which is equal to 2⁻¹⁰²² sometimes written as 0x1.0000000000000p-1022. So, we have an exponent smaller than the minimum -1021.

Likewise, DBL_MIN_10_EXP is -307 which doesn't sense given that DBL_MIN has an exponent of e-308.

The double DBL_MAX_EXP value of 1024 overflows when used in real code. For example, ldexp(1.0, 1024) gives inf.

Here's my C code:

#include <float.h>
#include <stdio.h>
#include <math.h>

#define SHOW_DOUBLE(s)   printf("%.17lg \t%s\n", s, #s);
#define SHOW_INT(s)      printf("%d \t%s\n", s, #s);

int
main()
{
    SHOW_DOUBLE(DBL_MAX);
    SHOW_DOUBLE(DBL_MIN);
    SHOW_DOUBLE(DBL_EPSILON);
    SHOW_INT(DBL_MAX_EXP);
    SHOW_INT(DBL_MAX_10_EXP);
    SHOW_INT(DBL_MIN_EXP);
    SHOW_INT(DBL_MIN_10_EXP);
    SHOW_INT(DBL_DIG);
    SHOW_INT(DBL_MANT_DIG);
    SHOW_INT(FLT_RADIX);
    SHOW_INT(FLT_ROUNDS);
    printf("%lf\n", ldexp(1.0, 1024));
    return 0;
}

And here is the output:

1.7976931348623157e+308 DBL_MAX
2.2250738585072014e-308 DBL_MIN
2.2204460492503131e-16  DBL_EPSILON
1024                    DBL_MAX_EXP
308                     DBL_MAX_10_EXP
-1021                   DBL_MIN_EXP
-307                    DBL_MIN_10_EXP
15                      DBL_DIG
53                      DBL_MANT_DIG
2                       FLT_RADIX
1                       FLT_ROUNDS
inf
like image 381
Raymond Hettinger Avatar asked Oct 16 '20 20:10

Raymond Hettinger


People also ask

Why do computers mess up floating-point math?

The main cause of the error in floating point division is the division algorithms used to calculate the quotient. Most computer systems calculate division using multiplication by an inverse, mainly in Z=X/Y , Z = X * (1/Y) .

Why are floating-point numbers inaccurate?

Floating-point decimal values generally do not have an exact binary representation due to how the CPU represents floating point data. For this reason, you may experience a loss of precision, and some floating-point operations may produce unexpected results.

How does C represent floating-point numbers?

Real numbers are represented in C by the floating point types float, double, and long double. Just as the integer types can't represent all integers because they fit in a bounded number of bytes, so also the floating-point types can't represent all real numbers.

Which is a valid floating-point constant?

A "floating-point constant" is a decimal number that represents a signed real number. The representation of a signed real number includes an integer portion, a fractional portion, and an exponent. Use floating-point constants to represent floating-point values that can't be changed.


2 Answers

The off-by-one is part of the spec. From 5.2.4.2.2 Characteristics of floating types <float.h>, ¶11,

...

  • minimum negative integer such that FLT_RADIX raised to one less than that power is a normalized floating-point number, emin
    • FLT_MIN_EXP
    • DBL_MIN_EXP
    • LDBL_MIN_EXP

...

  • maximum integer such that FLT_RADIX raised to one less than that power is a representable finite floating-point number, emax
    • FLT_MAX_EXP
    • DBL_MAX_EXP
    • LDBL_MAX_EXP

Emphasis on one less than is mine.

like image 186
R.. GitHub STOP HELPING ICE Avatar answered Oct 14 '22 05:10

R.. GitHub STOP HELPING ICE


I asked myself the same question and realized that this stems from the fact that IEEE 754 and C use two different normalized forms:

  • IEEE 754 uses a significand between 1 (inclusive) and 2 (exclusive) for normal numbers:

x = sign × significand × 2exponent, encoded as S|E|F where sign = (−1)S; significand =

  • 1.F if EminEEmax (normal number),
  • 0 if E = Emin − 1 and F = 0 (zero),
  • 0.F if E = Emin − 1 and F ≠ 0 (subnormal number),
  • ∞ if E = Emax + 1 and F = 0 (infinity),
  • NaN if E = Emax + 1 and F ≠ 0 (not a number);

exponent = max(E, Emin) − exponent bias.

  • C uses a significand′ between 0.5 (inclusive) and 1 (exclusive) for normal numbers:

x = sign′ × significand′ × 2exponent′, encoded as S|E|F where sign′ = (−1)S; significand′ =

  • 0.1F if EminEEmax (normal number),
  • 0 if E = Emin − 1 and F = 0 (zero),
  • 0.0F if E = Emin − 1 and F ≠ 0 (subnormal number),
  • ∞ if E = Emax + 1 and F = 0 (infinity),
  • NaN if E = Emax + 1 and F ≠ 0 (not a number);

exponent′ = max(E + 1, Emin + 1) − exponent bias.

Since x = sign × significand × 2exponent = sign′ × significand′ × 2exponent′, the following relations between the IEEE 754 normalized form and the C normalized form hold:

  • sign′ = sign;
  • significand′ = significand/2;
  • exponent′ = exponent + 1.

In particular, in binary64 format that is why exponentmin = −1022 in IEEE 754 normalized form and exponent′min = −1021 in C normalized form (DBL_MIN_EXP), and exponentmax = 1023 in IEEE 754 normalized form and exponent′max = 1024 in C normalized form (DBL_MAX_EXP).

In Python:

  • the IEEE 754 normalized significand and exponent can be extracted with:
math.ldexp(x, -math.floor(math.log2(x))), math.floor(math.log2(x))
  • the C normalized significand′ and exponent′ can be extracted with:
math.frexp(x)

For instance:

>>> import math
>>> x = 12
>>> math.ldexp(x, -math.floor(math.log2(x))), math.floor(math.log2(x))
(1.5, 3)
>>> math.frexp(12)
(0.75, 4)
like image 32
Maggyero Avatar answered Oct 14 '22 05:10

Maggyero