According to wikipedia, the layouts of the different precision data types are <ul> <li> single precision: exponent (e): 8 bits, fraction (f): 23 bits</li> <li> double precision: e: 11 bits, f: 52 bits</li> <li> quadruple precision: e: 15 bits, f: 112 bits.</li> </ul> I wrote a small program to output the numerical limits for float, double and long double in C++ (compiled with g++) <pre class="prettyprint"><code>#include<iostream> #include<limits> #include<string> template<typename T> void print(std::string name) { std::cout << name << " (" << sizeof(T) * 8 << "): " << std::numeric_limits<T>::epsilon() << "\t" << std::numeric_limits<T>::min() << "\t" << std::numeric_limits<T>::max() << std::endl; } int main() { std::cout.precision(5); print<float>("float"); print<double>("double"); print<long double>("long double"); return 0; } </code></pre> which outputs (I have run it on multiple machines with the same result) <pre class="prettyprint"><code>float (32): 1.1921e-07 1.1755e-38 3.4028e+38 double (64): 2.2204e-16 2.2251e-308 1.7977e+308 long double (128): 1.0842e-19 3.3621e-4932 1.1897e+4932 </code></pre> The upper limits coincide with 2^(2^(e-1)) and for float and double, epsilon coincides with 2^(-f). For long double, however epsilon should be roughly 1.9259e-34 by that logic. Does anyone know, why it isn't?

<code>long double</code> is not guaranteed to be implemented as IEEE-745 quadruple precision. C++ reference reads: <blockquote> <code>long double</code> - extended precision floating point type. Does not necessarily map to types mandated by IEEE-754. Usually 80-bit x87 floating point type on x86 and x86-64 architectures. </blockquote> If <code>long double</code> is implemented as 80-bits x86 extended precision, then epsilon is <code>2-63 = 1.0842e-19</code>. This is the value you get as the output. Some compilers support <code>__float128</code> type that has quadruple precision. In GCC <code>long double</code> becomes an alias for <code>__float128</code> if <code>-mlong-double-128</code> command line option is used, and on x86_64 targets <code>__float128</code> is guaranteed to be IEEE quadruple precision type (implemented in software). <code>std::numeric_limits</code> is not specialized for <code>__float128</code>. To get the value of epsilon the following trick can be used (assuming a little-endian machine): <pre class="prettyprint"><code>__float128 f1 = 1, f2 = 1; // 1.q -> ...00000000 std::uint8_t u = 1; std::memcpy(&f2, &u, 1); // 1.q + eps -> ...00000001 std::cout << double(f2 - f1); // Output: 1.9259e-34 </code></pre> With GCC you can use libquadmath: <pre class="prettyprint"><code>#include <quadmath.h> ... std::cout << (double)FLT128_EPSILON; </code></pre> to get the same output.

Epsilon in quadruple precision (gcc)

Tags:

c++

epsilon

quadruple-precision

According to wikipedia, the layouts of the different precision data types are

single precision: exponent (e): 8 bits, fraction (f): 23 bits
double precision: e: 11 bits, f: 52 bits
quadruple precision: e: 15 bits, f: 112 bits.

I wrote a small program to output the numerical limits for float, double and long double in C++ (compiled with g++)

#include<iostream>
#include<limits>
#include<string>

template<typename T>
void print(std::string name) {
    std::cout << name << " (" << sizeof(T) * 8 << "): " << std::numeric_limits<T>::epsilon() << "\t"  <<  std::numeric_limits<T>::min() << "\t" <<  std::numeric_limits<T>::max() << std::endl;
}

int main() {
    std::cout.precision(5);
    print<float>("float");
    print<double>("double");
    print<long double>("long double");
    return 0;
}

which outputs (I have run it on multiple machines with the same result)

float (32): 1.1921e-07  1.1755e-38  3.4028e+38
double (64): 2.2204e-16 2.2251e-308 1.7977e+308
long double (128): 1.0842e-19   3.3621e-4932    1.1897e+4932

The upper limits coincide with 2^(2^(e-1)) and for float and double, epsilon coincides with 2^(-f). For long double, however epsilon should be roughly 1.9259e-34 by that logic.

Does anyone know, why it isn't?

757

asked Nov 28 '19 11:11

okruz

1 Answers

long double is not guaranteed to be implemented as IEEE-745 quadruple precision. C++ reference reads:

long double - extended precision floating point type. Does not necessarily map to types mandated by IEEE-754. Usually 80-bit x87 floating point type on x86 and x86-64 architectures.

If long double is implemented as 80-bits x86 extended precision, then epsilon is 2^-63 = 1.0842e-19. This is the value you get as the output.

Some compilers support __float128 type that has quadruple precision. In GCC long double becomes an alias for __float128 if -mlong-double-128 command line option is used, and on x86_64 targets __float128 is guaranteed to be IEEE quadruple precision type (implemented in software).

std::numeric_limits is not specialized for __float128. To get the value of epsilon the following trick can be used (assuming a little-endian machine):

__float128 f1 = 1, f2 = 1;      // 1.q       -> ...00000000
std::uint8_t u = 1;
std::memcpy(&f2, &u, 1);        // 1.q + eps -> ...00000001
std::cout << double(f2 - f1);   // Output: 1.9259e-34

With GCC you can use libquadmath:

#include <quadmath.h>
...

std::cout << (double)FLT128_EPSILON;

to get the same output.

107

answered Sep 19 '22 22:09

Evg

Related questions
                            
                                How to use <execution> library in c++17
                            
                                How to use a different C++ compiler in Cython?
                            
                                Explicit specialization has already been instantiated
                            
                                Pointer to deallocated variable changes address
                            
                                Does Eigen assume aliasing?
                            
                                libc++'s implementation of std::map/set::equal_range gives unexpected results
                            
                                A std::visit-like function for visiting over polymorphic types
                            
                                Call a function with std::function as argument with a lambda
                            
                                How to list all function names of a Python module in C++?
                            
                                Is it UB to call a non-const method on const instance when the method does not modify members? [duplicate]
                            
                                const forwarding reference gives error C2440: 'initializing': cannot convert from 'const std::string' to 'const std::string &&'
                            
                                How to forward multiple constructor arguments through a variadic template to an array initializer list?
                            
                                Allocating a single object larger than 2GB using new in C++ (on Windows)
                            
                                What are the name lookup and type simplification rules for trailing return types?
                            
                                Including nuget packages in VS2019 C++ cross platform program
                            
                                Why is serial execution taking less time than parallel? [duplicate]
                            
                                How to write this recursion with loops
                            
                                When overloading operators in C++ why is T* preferred over bool?
                            
                                When is P1008 ("prohibit aggregates with user-declared constructors") useful in practice?
                            
                                cppcheck warns about the pointer to local variable in list initialization

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With