According to wikipedia, the layouts of the different precision data types are
I wrote a small program to output the numerical limits for float, double and long double in C++ (compiled with g++)
#include<iostream>
#include<limits>
#include<string>
template<typename T>
void print(std::string name) {
std::cout << name << " (" << sizeof(T) * 8 << "): " << std::numeric_limits<T>::epsilon() << "\t" << std::numeric_limits<T>::min() << "\t" << std::numeric_limits<T>::max() << std::endl;
}
int main() {
std::cout.precision(5);
print<float>("float");
print<double>("double");
print<long double>("long double");
return 0;
}
which outputs (I have run it on multiple machines with the same result)
float (32): 1.1921e-07 1.1755e-38 3.4028e+38
double (64): 2.2204e-16 2.2251e-308 1.7977e+308
long double (128): 1.0842e-19 3.3621e-4932 1.1897e+4932
The upper limits coincide with 2^(2^(e-1)) and for float and double, epsilon coincides with 2^(-f). For long double, however epsilon should be roughly 1.9259e-34 by that logic.
Does anyone know, why it isn't?
How can this be measured? For any format, the machine epsilon is the difference between 1 and the next larger number that can be stored in that format. 2−23 ·= 1.19 × 10−7 i.e., we can store approximately 7 decimal digits of a number x in decimal format.
In computing, quadruple precision (or quad precision) is a binary floating point–based computer number format that occupies 16 bytes (128 bits) with precision at least twice the 53-bit double precision.
In C, machine epsilon is specified in the standard header with the names FLT_EPSILON, DBL_EPSILON, and LDBL_EPSILON. Those three macros give the machine epsilon for the float, double, and long double types, respectively.
double precision. double. 2. 53 (one bit is implicit) 2−53 ≈ 1.11e-16.
long double
is not guaranteed to be implemented as IEEE-745 quadruple precision. C++ reference reads:
long double
- extended precision floating point type. Does not necessarily map to types mandated by IEEE-754. Usually 80-bit x87 floating point type on x86 and x86-64 architectures.
If long double
is implemented as 80-bits x86 extended precision, then epsilon is 2-63 = 1.0842e-19
. This is the value you get as the output.
Some compilers support __float128
type that has quadruple precision. In GCC long double
becomes an alias for __float128
if -mlong-double-128
command line option is used, and on x86_64 targets __float128
is guaranteed to be IEEE quadruple precision type (implemented in software).
std::numeric_limits
is not specialized for __float128
. To get the value of epsilon the following trick can be used (assuming a little-endian machine):
__float128 f1 = 1, f2 = 1; // 1.q -> ...00000000
std::uint8_t u = 1;
std::memcpy(&f2, &u, 1); // 1.q + eps -> ...00000001
std::cout << double(f2 - f1); // Output: 1.9259e-34
With GCC you can use libquadmath:
#include <quadmath.h>
...
std::cout << (double)FLT128_EPSILON;
to get the same output.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With