Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Floating point limits (double) defined with long double suffix L

1. Question:

I have a question about the DBL_MAX and DBL_MIN definition in Linux with gcc v4.8.5.
They are defined in limit.h as:

#define DBL_MAX     __DBL_MAX__
#define DBL_MIN     __DBL_MIN__

where __DBL_MIN__ and __DBL_MAX__ are compiler specific and can be obtained by:

$ gcc -dM -E - < /dev/null
...
#define __DBL_MAX__ ((double)1.79769313486231570815e+308L)
#define __DBL_MIN__ ((double)2.22507385850720138309e-308L)
...

My question is:
Why are the values defined as long double with suffix L and then casted back to a double?

2. Question:

Why is the __DBL_MIN_10_EXP__ defined with -307 but the minimum exponent is -308 as it is used above in the DBL_MIN macro? In the case of the maximum exponent it is defined with 308 which I can understand as it is used by the DBL_MAX macro.

#define __DBL_MAX_10_EXP__ 308
#define __DBL_MIN_10_EXP__ (-307)

Not part of the question, just observations I made:

By the way using Windows with Visual Studio 2015 there are just the DBL_MAX and DBL_MIN macros defined without the compiler specific redirection to the versions with the underscore. Further the minimum positive double value DBL_MIN and the maximum double value DBL_MAX are a little bit greater than the values from my Linux gcc compiler (just compared to the defined macros from gcc v4.8.5 above):

#define DBL_MAX        1.7976931348623158e+308
#define DBL_MIN        2.2250738585072014e–308

Moreover the Microsoft compiler set the long double limits to the values of a double, seems that it doesn't support a real long double implementation.

like image 662
Andre Kampling Avatar asked Jun 28 '17 12:06

Andre Kampling


People also ask

What is float double long double?

To represent floating point numbers, we use float, double and long double. What's the difference? double has 2x more precision than float. float is a 32-bit IEEE 754 single precision Floating Point Number – 1 bit for the sign, 8 bits for the exponent, and 23* for the value. float has 7 decimal digits of precision.

Which is are floating-point types of float to double to long double?

Explanation: The floating point data types are called real data types. Hence float, double, and long double are real data types.

What is float double long double in C?

In C and related programming languages, long double refers to a floating-point data type that is often more precise than double precision though the language standard only requires it to be at least as precise as double . As with C's other floating-point types, it may not necessarily map to an IEEE format.

Is long double 16 bytes?

sizeof(long double) is 16 (aka 128 bits) in Intel Macs for alignment purposes but is actually 80 bit precision according to their documentation.


Video Answer


2 Answers

Specifying binary floating point numbers in decimal has subtle issues.

Why are the values defined as long double with suffix L and then casted back to a double?

With typical binary64, the maximum finite value is about 1.795e+308 or exactly.

179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368

The numbers of digits needed to convert to a unique double may be as many as DBL_DECIMAL_DIG (typically 17 and at least 10). In any case, using exponential notation is certainly clear without being overly precise.

/*
1 2345678901234567 */          // Sorted 
1.79769313486231550856124...   // DBL_MAX next smallest for reference
1.79769313486231570814527...   // Exact
1.79769313486231570815e+308L   // gcc
1.7976931348623158e+308        // VS (just a hair closer to exact than "next largerst")
1.7976931348623159077293....   // DBL_MAX next largest if not limited by range

Various compilers may not convert this string exactly as hoped. Sometimes ignoring some least significant digits - although this is controlled by the compiler.

Another source of subtle conversion differences, and I expect this is why the 'L' is added, the double computation is affected by the processor's floating point unit which might not have exact adherence to IEEE standards. The worse result could be that the 1.797...e+308 constant converts to infinity due to minute conversion errors the "code to a double" using double math. By converting to a long double, those long double conversion errors are very small. Then converting the long double result to double rounds to the hoped for number.

In short, forcing L math insures the constant is not inadvertently made an infinity.

I would expect the following which matches neither gcc nor VS to be sufficient with a compliant IEEE 754 standard FPU.

#define __DBL_MAX__ 1.7976931348623157e+308

The cast back to double is to make DBL_MAX a double. This would meet many code's expectations that a DBL_MAX is a double and not a long double. I see no specification that requires this though.

Why is the DBL_MIN_10_EXP defined with -307 but the minimum exponent is -308?

That is to comply with the definition of DBL_MIN_10_EXP. "... minimum negative integer such that 10 raised to that power is in the range of normalized floating-point numbers" The non-integer answer is between -307 and -308, so the minimum integer in range is -307.

observation part

Although VS treats long double as a distinct type, the same encoding as double is used, thus there is no numeric advantage in using L.

like image 198
chux - Reinstate Monica Avatar answered Sep 20 '22 17:09

chux - Reinstate Monica


I don't know why the L suffix is used.

This site has an overview of IEEE 754 floating point.

The exponent is 11 bits, with an offset of 1023. However exponents of 0 and 2047 are reserved for special numbers. So this means that the exponent can vary from 2046-1023=1023 to 1-1023=-1022.

So for the max normalized value we have an exponent of 2^1023. The max value for the mantissa is just below 2 (1.111 etc with 52 1s after the point, in binary) which is ~2*2^1023 = ~1.79e308.

For the min normalized value we have an exponent of 2^-1022. The min mantissa is exactly 1 giving us a value of 1*2^-1022 = ~2.22e-308. So far so good.

DBL_MIN_10_EXP and DBL_MAX_10_EXP are the min/max exponents of 10 that are normalized. For the max 1e308 is less than ~1.79e308 so the value is 308. For the min, 1e-308 is too small - it is lower than ~2.22e-308. 1e-307 is greater than ~2.22e-308 so the value is -307.

like image 32
Paul Floyd Avatar answered Sep 20 '22 17:09

Paul Floyd