I'm looking for detailed information on <code>long double</code> and <code>__float128</code> in GCC/x86 (more out of curiosity than because of an actual problem). Few people will probably ever need these (I've just, for the first time ever, truly needed a <code>double</code>), but I guess it is still worthwile (and interesting) to know what you have in your toolbox and what it's about. In that light, please excuse my somewhat open questions: <ol> <li>Could someone explain the implementation rationale and intended usage of these types, also in comparison of each other? For example, are they "embarrassment implementations" because the standard allows for the type, and someone might complain if they're only just the same precision as <code>double</code>, or are they intended as first-class types? </li> <li>Alternatively, does someone have a good, usable web reference to share? A Google search on <code>"long double" site:gcc.gnu.org/onlinedocs</code> didn't give me much that's truly useful.</li> <li>Assuming that the common mantra "if you believe that you need double, you probably don't understand floating point" does not apply, i.e. you really need more precision than just <code>float</code>, and one doesn't care whether 8 or 16 bytes of memory are burnt... is it reasonable to expect that one can as well just jump to <code>long double</code> or <code>__float128</code> instead of <code>double</code> without a significant performance impact?</li> <li>The "extended precision" feature of Intel CPUs has historically been source of nasty surprises when values were moved between memory and registers. If actually 96 bits are stored, the <code>long double</code> type should eliminate this issue. On the other hand, I understand that the <code>long double</code> type is mutually exclusive with <code>-mfpmath=sse</code>, as there is no such thing as "extended precision" in SSE. <code>__float128</code>, on the other hand, should work just perfectly fine with SSE math (though in absence of quad precision instructions certainly not on a 1:1 instruction base). Am I right in these assumptions?</li> </ol> (3. and 4. can probably be figured out with some work spent on profiling and disassembling, but maybe someone else had the same thought previously and has already done that work.) Background (this is the TL;DR part): I initially stumbled over <code>long double</code> because I was looking up <code>DBL_MAX</code> in <code><float.h></code>, and incidentially <code>LDBL_MAX</code> is on the next line. "Oh look, GCC actually has 128 bit doubles, not that I need them, but... cool" was my first thought. Surprise, surprise: <code>sizeof(long double)</code> returns 12... wait, you mean 16? The C and C++ standards unsurprisingly do not give a very concrete definition of the type. C99 (6.2.5 10) says that the numbers of <code>double</code> are a subset of <code>long double</code> whereas C++03 states (3.9.1 8) that <code>long double</code> has at least as much precision as <code>double</code> (which is the same thing, only worded differently). Basically, the standards leave everything to the implementation, in the same manner as with <code>long</code>, <code>int</code>, and <code>short</code>. Wikipedia says that GCC uses "80-bit extended precision on x86 processors regardless of the physical storage used". The GCC documentation states, all on the same page, that the size of the type is 96 bits because of the i386 ABI, but no more than 80 bits of precision are enabled by any option (huh? what?), also Pentium and newer processors want them being aligned as 128 bit numbers. This is the default under 64 bits and can be manually enabled under 32 bits, resulting in 32 bits of zero padding. Time to run a test: <pre class="prettyprint"><code>#include <stdio.h> #include <cfloat> int main() { #ifdef USE_FLOAT128 typedef __float128 long_double_t; #else typedef long double long_double_t; #endif long_double_t ld; int* i = (int*) &ld; i[0] = i[1] = i[2] = i[3] = 0xdeadbeef; for(ld = 0.0000000000000001; ld < LDBL_MAX; ld *= 1.0000001) printf("%08x-%08x-%08x-%08x\r", i[0], i[1], i[2], i[3]); return 0; } </code></pre> The output, when using <code>long double</code>, looks somewhat like this, with the marked digits being constant, and all others eventually changing as the numbers get bigger and bigger: <pre class="prettyprint"><code>5636666b-c03ef3e0-00223fd8-deadbeef ^^ ^^^^^^^^ </code></pre> This suggests that it is not an 80 bit number. An 80-bit number has 18 hex digits. I see 22 hex digits changing, which looks much more like a 96 bits number (24 hex digits). It also isn't a 128 bit number since <code>0xdeadbeef</code> isn't touched, which is consistent with <code>sizeof</code> returning 12. The output for <code>__int128</code> looks like it's really just a 128 bit number. All bits eventually flip. Compiling with <code>-m128bit-long-double</code> does not align <code>long double</code> to 128 bits with a 32-bit zero padding, as indicated by the documentation. It doesn't use <code>__int128</code> either, but indeed seems to align to 128 bits, padding with the value <code>0x7ffdd000</code>(?!). Further, <code>LDBL_MAX</code>, seems to work as <code>+inf</code> for both <code>long double</code> and <code>__float128</code>. Adding or subtracting a number like <code>1.0E100</code> or <code>1.0E2000</code> to/from <code>LDBL_MAX</code> results in the same bit pattern. Up to now, it was my belief that the <code>foo_MAX</code> constants were to hold the largest representable number that is not <code>+inf</code> (apparently that isn't the case?). I'm also not quite sure how an 80-bit number could conceivably act as <code>+inf</code> for a 128 bit value... maybe I'm just too tired at the end of the day and have done something wrong.

Ad 1. Those types are designed to work with numbers with huge dynamic range. The long double is implemented in a native way in the x87 FPU. The 128b double I suspect would be implemented in software mode on modern x86s, as there's no hardware to do the computations in hardware. The funny thing is that it's quite common to do many floating point operations in a row and the intermediate results are not actually stored in declared variables but rather stored in FPU registers taking advantage of full precision. That's why comparison: <pre class="prettyprint"><code>double x = sin(0); if (x == sin(0)) printf("Equal!"); </code></pre> Is not safe and cannot be guaranteed to work (without additional switches). Ad. 3. There's an impact on the speed depending what precision you use. You can change used the precision of the FPU by using: <pre class="prettyprint"><code>void set_fpu (unsigned int mode) { asm ("fldcw %0" : : "m" (*&mode)); } </code></pre> It will be faster for shorter variables, slower for longer. 128bit doubles will be probably done in software so will be much slower. It's not only about RAM memory wasted, it's about cache being wasted. Going to 80 bit double from 64b double will waste from 33% (32b) to almost 50% (64b) of the memory (including cache). Ad 4. <blockquote> On the other hand, I understand that the long double type is mutually exclusive with -mfpmath=sse, as there is no such thing as "extended precision" in SSE. __float128, on the other hand, should work just perfectly fine with SSE math (though in absence of quad precision instructions certainly not on a 1:1 instruction base). Am I right under these assumptions? </blockquote> The FPU and SSE units are totally separate. You can write code using FPU at the same time as SSE. The question is what will the compiler generate if you constrain it to use only SSE? Will it try to use FPU anyway? I've been doing some programming with SSE and GCC will generate only single SISD on its own. You have to help it to use SIMD versions. __float128 will probably work on every machine, even the 8-bit AVR uC. It's just fiddling with bits after all. The 80 bit in hex representation is actually 20 hex digits. Maybe the bits which are not used are from some old operation? On my machine, I compiled your code and only 20 bits change in long mode: 66b4e0d2-ec09c1d5-00007ffe-deadbeef The 128-bit version has all the bits changing. Looking at the <code>objdump</code> it looks as if it was using software emulation, there are almost no FPU instructions. <blockquote> Further, LDBL_MAX, seems to work as +inf for both long double and __float128. Adding or subtracting a number like 1.0E100 or 1.0E2000 to/from LDBL_MAX results in the same bit pattern. Up to now, it was my belief that the foo_MAX constants were to hold the largest representable number that is not +inf (apparently that isn't the case?). </blockquote> This seems to be strange... <blockquote> I'm also not quite sure how an 80-bit number could conceivably act as +inf for a 128-bit value... maybe I'm just too tired at the end of the day and have done something wrong. </blockquote> It's probably being extended. The pattern which is recognized to be +inf in 80-bit is translated to +inf in 128-bit float too.

IEEE-754 defined 32 and 64 floating-point representations for the purpose of efficient data storage, and an 80-bit representation for the purpose of efficient computation. The intention was that given <code>float f1,f2; double d1,d2;</code> a statement like <code>d1=f1+f2+d2;</code> would be executed by converting the arguments to 80-bit floating-point values, adding them, and converting the result back to a 64-bit floating-point type. This would offer three advantages compared with performing operations on other floating-point types directly: <ol> <li>While separate code or circuitry would be required for conversions to/from 32-bit types and 64-bit types, it would only be necessary to have only one "add" implementation, one "multiply" implementation, one "square root" implementation, etc.</li> <li>Although in rare cases using an 80-bit computational type could yield results that were very slightly less accurate than using other types directly (worst-case rounding error is 513/1024ulp in cases where computations on other types would yield an error of 511/1024ulp), chained computations using 80-bit types would frequently be more accurate--sometimes much more accurate--than computations using other types.</li> <li>On a system without a FPU, separating a <code>double</code> into a separate exponent and mantissa before performing computations, normalizing a mantissa, and converting a separate mantissa and exponent into a <code>double</code>, are somewhat time consuming. If the result of one computation will be used as input to another and discarded, using an unpacked 80-bit type will allow these steps to be omitted.</li> </ol> In order for this approach to floating-point math to be useful, however, it is imperative that it be possible for code to store intermediate results with the same precision as would be used in computation, such that <code>temp = d1+d2; d4=temp+d3;</code> will yield the same result as <code>d4=d1+d2+d3;</code>. From what I can tell, the purpose of <code>long double</code> was to be that type. Unfortunately, even though K&R designed C so that all floating-point values would be passed to variadic methods the same way, ANSI C broke that. In C as originally designed, given the code <code>float v1,v2; ... printf("%12.6f", v1+v2);</code>, the <code>printf</code> method wouldn't have to worry about whether <code>v1+v2</code> would yield a <code>float</code> or a <code>double</code>, since the result would get coerced to a known type regardless. Further, even if the type of <code>v1</code> or <code>v2</code> changed to <code>double</code>, the <code>printf</code> statement wouldn't have to change. ANSI C, however, requires that code which calls <code>printf</code> must know which arguments are <code>double</code> and which are <code>long double</code>; a lot of code--if not a majority--of code which uses <code>long double</code> but was written on platforms where it's synonymous with <code>double</code> fails to use the correct format specifiers for <code>long double</code> values. Rather than having <code>long double</code> be an 80-bit type except when passed as a variadic method argument, in which case it would be coerced to 64 bits, many compilers decided to make <code>long double</code> be synonymous with <code>double</code> and not offer any means of storing the results of intermediate computations. Since using an extended precision type for computation is only good if that type is made available to the programmer, many people came to conclude regard extended precision as evil even though it was only ANSI C's failure to handle variadic arguments sensibly that made it problematic. PS--The intended purpose of <code>long double</code> would have benefited if there had also been a <code>long float</code> which was defined as the type to which <code>float</code> arguments could be most efficiently promoted; on many machines without floating-point units that would probably be a 48-bit type, but the optimal size could range anywhere from 32 bits (on machines with an FPU that does 32-bit math directly) up to 80 (on machines which use the design envisioned by IEEE-754). Too late now, though.

long double (GCC specific) and __float128

Tags:

x86

gcc

long-double

extended-precision

quadruple-precision

I'm looking for detailed information on long double and __float128 in GCC/x86 (more out of curiosity than because of an actual problem).

Few people will probably ever need these (I've just, for the first time ever, truly needed a double), but I guess it is still worthwile (and interesting) to know what you have in your toolbox and what it's about.

In that light, please excuse my somewhat open questions:

Could someone explain the implementation rationale and intended usage of these types, also in comparison of each other? For example, are they "embarrassment implementations" because the standard allows for the type, and someone might complain if they're only just the same precision as double, or are they intended as first-class types?
Alternatively, does someone have a good, usable web reference to share? A Google search on "long double" site:gcc.gnu.org/onlinedocs didn't give me much that's truly useful.
Assuming that the common mantra "if you believe that you need double, you probably don't understand floating point" does not apply, i.e. you really need more precision than just float, and one doesn't care whether 8 or 16 bytes of memory are burnt... is it reasonable to expect that one can as well just jump to long double or __float128 instead of double without a significant performance impact?
The "extended precision" feature of Intel CPUs has historically been source of nasty surprises when values were moved between memory and registers. If actually 96 bits are stored, the long double type should eliminate this issue. On the other hand, I understand that the long double type is mutually exclusive with -mfpmath=sse, as there is no such thing as "extended precision" in SSE. __float128, on the other hand, should work just perfectly fine with SSE math (though in absence of quad precision instructions certainly not on a 1:1 instruction base). Am I right in these assumptions?

(3. and 4. can probably be figured out with some work spent on profiling and disassembling, but maybe someone else had the same thought previously and has already done that work.)

Background (this is the TL;DR part):
I initially stumbled over long double because I was looking up DBL_MAX in <float.h>, and incidentially LDBL_MAX is on the next line. "Oh look, GCC actually has 128 bit doubles, not that I need them, but... cool" was my first thought. Surprise, surprise: sizeof(long double) returns 12... wait, you mean 16?

The C and C++ standards unsurprisingly do not give a very concrete definition of the type. C99 (6.2.5 10) says that the numbers of double are a subset of long double whereas C++03 states (3.9.1 8) that long double has at least as much precision as double (which is the same thing, only worded differently). Basically, the standards leave everything to the implementation, in the same manner as with long, int, and short.

Wikipedia says that GCC uses "80-bit extended precision on x86 processors regardless of the physical storage used".

The GCC documentation states, all on the same page, that the size of the type is 96 bits because of the i386 ABI, but no more than 80 bits of precision are enabled by any option (huh? what?), also Pentium and newer processors want them being aligned as 128 bit numbers. This is the default under 64 bits and can be manually enabled under 32 bits, resulting in 32 bits of zero padding.

Time to run a test:

#include <stdio.h>
#include <cfloat>

int main()
{
#ifdef  USE_FLOAT128
    typedef __float128  long_double_t;
#else
    typedef long double long_double_t;
#endif

long_double_t ld;

int* i = (int*) &ld;
i[0] = i[1] = i[2] = i[3] = 0xdeadbeef;

for(ld = 0.0000000000000001; ld < LDBL_MAX; ld *= 1.0000001)
    printf("%08x-%08x-%08x-%08x\r", i[0], i[1], i[2], i[3]);

return 0;
}

The output, when using long double, looks somewhat like this, with the marked digits being constant, and all others eventually changing as the numbers get bigger and bigger:

5636666b-c03ef3e0-00223fd8-deadbeef
                  ^^       ^^^^^^^^

This suggests that it is not an 80 bit number. An 80-bit number has 18 hex digits. I see 22 hex digits changing, which looks much more like a 96 bits number (24 hex digits). It also isn't a 128 bit number since 0xdeadbeef isn't touched, which is consistent with sizeof returning 12.

The output for __int128 looks like it's really just a 128 bit number. All bits eventually flip.

Compiling with -m128bit-long-double does not align long double to 128 bits with a 32-bit zero padding, as indicated by the documentation. It doesn't use __int128 either, but indeed seems to align to 128 bits, padding with the value 0x7ffdd000(?!).

Further, LDBL_MAX, seems to work as +inf for both long double and __float128. Adding or subtracting a number like 1.0E100 or 1.0E2000 to/from LDBL_MAX results in the same bit pattern.
Up to now, it was my belief that the foo_MAX constants were to hold the largest representable number that is not +inf (apparently that isn't the case?). I'm also not quite sure how an 80-bit number could conceivably act as +inf for a 128 bit value... maybe I'm just too tired at the end of the day and have done something wrong.

924

asked Nov 22 '12 16:11

Damon

2 Answers

Ad 1.

Those types are designed to work with numbers with huge dynamic range. The long double is implemented in a native way in the x87 FPU. The 128b double I suspect would be implemented in software mode on modern x86s, as there's no hardware to do the computations in hardware.

The funny thing is that it's quite common to do many floating point operations in a row and the intermediate results are not actually stored in declared variables but rather stored in FPU registers taking advantage of full precision. That's why comparison:

double x = sin(0); if (x == sin(0)) printf("Equal!");

Is not safe and cannot be guaranteed to work (without additional switches).

Ad. 3.

There's an impact on the speed depending what precision you use. You can change used the precision of the FPU by using:

void 
set_fpu (unsigned int mode)
{
  asm ("fldcw %0" : : "m" (*&mode));
}

It will be faster for shorter variables, slower for longer. 128bit doubles will be probably done in software so will be much slower.

It's not only about RAM memory wasted, it's about cache being wasted. Going to 80 bit double from 64b double will waste from 33% (32b) to almost 50% (64b) of the memory (including cache).

Ad 4.

On the other hand, I understand that the long double type is mutually exclusive with -mfpmath=sse, as there is no such thing as "extended precision" in SSE. __float128, on the other hand, should work just perfectly fine with SSE math (though in absence of quad precision instructions certainly not on a 1:1 instruction base). Am I right under these assumptions?

The FPU and SSE units are totally separate. You can write code using FPU at the same time as SSE. The question is what will the compiler generate if you constrain it to use only SSE? Will it try to use FPU anyway? I've been doing some programming with SSE and GCC will generate only single SISD on its own. You have to help it to use SIMD versions. __float128 will probably work on every machine, even the 8-bit AVR uC. It's just fiddling with bits after all.

The 80 bit in hex representation is actually 20 hex digits. Maybe the bits which are not used are from some old operation? On my machine, I compiled your code and only 20 bits change in long mode: 66b4e0d2-ec09c1d5-00007ffe-deadbeef

The 128-bit version has all the bits changing. Looking at the objdump it looks as if it was using software emulation, there are almost no FPU instructions.

Further, LDBL_MAX, seems to work as +inf for both long double and __float128. Adding or subtracting a number like 1.0E100 or 1.0E2000 to/from LDBL_MAX results in the same bit pattern. Up to now, it was my belief that the foo_MAX constants were to hold the largest representable number that is not +inf (apparently that isn't the case?).

This seems to be strange...

I'm also not quite sure how an 80-bit number could conceivably act as +inf for a 128-bit value... maybe I'm just too tired at the end of the day and have done something wrong.

It's probably being extended. The pattern which is recognized to be +inf in 80-bit is translated to +inf in 128-bit float too.

106

answered Oct 09 '22 21:10

Caladan

IEEE-754 defined 32 and 64 floating-point representations for the purpose of efficient data storage, and an 80-bit representation for the purpose of efficient computation. The intention was that given float f1,f2; double d1,d2; a statement like d1=f1+f2+d2; would be executed by converting the arguments to 80-bit floating-point values, adding them, and converting the result back to a 64-bit floating-point type. This would offer three advantages compared with performing operations on other floating-point types directly:

While separate code or circuitry would be required for conversions to/from 32-bit types and 64-bit types, it would only be necessary to have only one "add" implementation, one "multiply" implementation, one "square root" implementation, etc.
Although in rare cases using an 80-bit computational type could yield results that were very slightly less accurate than using other types directly (worst-case rounding error is 513/1024ulp in cases where computations on other types would yield an error of 511/1024ulp), chained computations using 80-bit types would frequently be more accurate--sometimes much more accurate--than computations using other types.
On a system without a FPU, separating a double into a separate exponent and mantissa before performing computations, normalizing a mantissa, and converting a separate mantissa and exponent into a double, are somewhat time consuming. If the result of one computation will be used as input to another and discarded, using an unpacked 80-bit type will allow these steps to be omitted.

In order for this approach to floating-point math to be useful, however, it is imperative that it be possible for code to store intermediate results with the same precision as would be used in computation, such that temp = d1+d2; d4=temp+d3; will yield the same result as d4=d1+d2+d3;. From what I can tell, the purpose of long double was to be that type. Unfortunately, even though K&R designed C so that all floating-point values would be passed to variadic methods the same way, ANSI C broke that. In C as originally designed, given the code float v1,v2; ... printf("%12.6f", v1+v2);, the printf method wouldn't have to worry about whether v1+v2 would yield a float or a double, since the result would get coerced to a known type regardless. Further, even if the type of v1 or v2 changed to double, the printf statement wouldn't have to change.

ANSI C, however, requires that code which calls printf must know which arguments are double and which are long double; a lot of code--if not a majority--of code which uses long double but was written on platforms where it's synonymous with double fails to use the correct format specifiers for long double values. Rather than having long double be an 80-bit type except when passed as a variadic method argument, in which case it would be coerced to 64 bits, many compilers decided to make long double be synonymous with double and not offer any means of storing the results of intermediate computations. Since using an extended precision type for computation is only good if that type is made available to the programmer, many people came to conclude regard extended precision as evil even though it was only ANSI C's failure to handle variadic arguments sensibly that made it problematic.

PS--The intended purpose of long double would have benefited if there had also been a long float which was defined as the type to which float arguments could be most efficiently promoted; on many machines without floating-point units that would probably be a 48-bit type, but the optimal size could range anywhere from 32 bits (on machines with an FPU that does 32-bit math directly) up to 80 (on machines which use the design envisioned by IEEE-754). Too late now, though.

answered Oct 09 '22 22:10

supercat

Related questions
                            
                                How can I force linking with a static library when a shared library of same name is present
                            
                                What configure options were used when building gcc / libstdc++?
                            
                                How to work with external libraries when cross compiling?
                            
                                Order of local variable allocation on the stack
                            
                                What is .cfi and .LFE in assembly code produced by GCC from c++ program?
                            
                                How do I run the GCC preprocessor to get the code after macros like #define are expanded?
                            
                                Is std::string ref-counted in GCC 4.x / C++11?
                            
                                How much overhead can the -fPIC flag add?
                            
                                Is right shift undefined behavior if the count is larger than the width of the type?
                            
                                How does make know which files to update
                            
                                Compiling C and C++ files together using GCC
                            
                                Eclipse CDT shows semantic errors, but compilation is ok
                            
                                What is the use of "push %ebp; movl %esp, %ebp" generated by GCC for x86?
                            
                                Makefile removes object files for no reason
                            
                                How to use the __attribute__((visibility("default")))?
                            
                                gcc optimization flags for Xeon?
                            
                                How to test the current version of GCC at compile time?
                            
                                Installing lxml with pip in virtualenv Ubuntu 12.10 error: command 'gcc' failed with exit status 4
                            
                                Segfault on declaring a variable of type vector<shared_ptr<int>>
                            
                                Linking to MSVC DLL from MinGW

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With