I'm working with a client who is using an old version of GCC (3.2.3 to be precise) but wants to upgrade and one reason that's been given as stumbling block to upgrading to a newer version is differences in the size of type <code>float_t</code> which, sure enough is correct: On GCC 3.2.3 <pre class="prettyprint"><code>sizeof(float_t) = 12 sizeof(float) = 4 sizeof(double_t) = 12 sizeof(double) = 8 </code></pre> On GCC 4.1.2 <pre class="prettyprint"><code>sizeof(float_t) = 4 sizeof(float) = 4 sizeof(double_t) = 8 sizeof(double) = 8 </code></pre> but what's the reason for this difference? Why did the size get smaller and when should and shouldn't you use <code>float_t</code> or <code>double_t</code> ?

The "why" is that some compilers will return floating point values in a floating-point register. These registers have only one size. For example, on X86, it is 80 bits wide. The results of a function that returns a floating point value will be placed into this register regardless of whether the type has been declared as float, double, float_t or double_t. If the size of the return value and the size of the floating-point register differ, then at some point an instruction will be required to round down to the desired size. The same kind of conversion is necessary for integers as well, but for subsequent additions and subtractions there is no overhead, because there are instructions to pick which bytes to involve in the operation. The rules for conversion of integers to a smaller size specify that the most significant bits be tossed away, so the result of downsizing can produce a result that is radically different (e.g. (short)(2147450880) --> -32768), but for some reason that seems to be OK with the programming community. In doing a floating-point downsizing, the result is specified to be rounded to the closest representable number. If integers were subject to the same rules, then the above example would truncate thusly (short)(2147450880) -> +32767. Obviously a little more logic is required to perform such an operation that mere truncation of the upper bits. With floating-point, the exponent and the significand change sizes between float, double and long double, so it is more complicated. Additionally, there are issues of conversion between infinity, NaN, normalized numbers, and renormalized numbers that need to be taken into account. Hardware can implement these conversions in the same amount of time as an integer addition, but if the conversion needs to be implemented in software, it may take 20 instructions, which can have a noticeable effect on performance. Since the C programming model assures that the same results be generated regardless of whether the floating-point is implemented in hardware or software, the software is obliged to execute these extra instructions in order to comply with the computational model. The float_t and double_t types were designed to expose the most efficient return value type. The compiler defines a FLT_EVAL_METHOD, which specifies how much precision is to be used in the intermediate computations. With integers, the rule is to do intermediate computations using the highest precision of the operands involved. This would correspond to a FLT_EVAL_METHOD==0. However, the original K&R specified that all intermediate computations be done in double, thus yielding FLT_EVAL_METHOD==1. However, with the introduction of the IEEE floating-point standard, it became commonplace on some platforms, notably the Macintosh PowerPC and Windows X86 to perform intermediate computations in long double -- 80 bits, thus yielding FLT_EVAL_METHOD==2. Regression testing will be affected by the FLT_EVAL_METHOD computational model. Thus, your regression code should take this into account. One way is to test FLT_EVAL_METHOD and have different branches for each model. A similar method would be to test sizeof(float_t), and have different branches. A third method would be to use some kind of epsilon that would be used to check whether the results are close enough. Unfortunately, there are some computations that make a decision based on the results of a computation, resulting in a true or false, which cannot be resolved by using an epsilon. This occurs in computer graphics, for example, to decide whether a point is inside or outside a polygon, which determines whether a particular pixel should be filled. If your regression involves one of these, you cannot use the epsilon method, and must use different branches depending on the computational model. Another way to resolve the decision regression between models is to cast the result explicitly to a particular desired precision. This works most of the time on many compilers, but some compilers think that they are smarter than you, and refuse to do the conversion. This happens in the case where an intermediate result is stored in a register, but is used in a subsequent computation. You can cast away precision as much as you want in the intermediate result, but the compiler will do nothing -- unless you declare the intermediate result as volatile. This then forces the compiler to downsize and store the intermediate result in a variable of the specified size in memory, then to retrieve it when needed for computation. The IEEE floating point standard is exact for elementary operations (+-*/) and square root. I believe that sin(), cos(), exp(), log(), etc. are specified to be within 2 ULP (units in the least significant position) of the closest numerically-representable result. The long double (80 bit) format was designed to allow computation of those other transcendental functions exactly to the closest numerically-represenatble result. This covers a lot of the issues brought up (and implied) in this thread, but does not answer the question of when you should use the float_t and double_t types. Obviously, you need to do so when interfacing to an API that uses these types, especially when passing the address of one of these types. If your prime concern is about performance, then you might want to consider using the float_t and double_t types in your computations and APIs. But it is most probable that the performance increase that you get is neither measurable nor noticeable. However, if you are concerned about regression between different compilers and different machines, you should probably avoid these types as much as possible, and use casting liberally to assure cross-platform compatibility.

What's the point of float_t and when should it be used?

Tags:

I'm working with a client who is using an old version of GCC (3.2.3 to be precise) but wants to upgrade and one reason that's been given as stumbling block to upgrading to a newer version is differences in the size of type float_t which, sure enough is correct:

On GCC 3.2.3

sizeof(float_t) = 12 sizeof(float) = 4 sizeof(double_t) = 12 sizeof(double) = 8

On GCC 4.1.2

sizeof(float_t) = 4 sizeof(float) = 4 sizeof(double_t) = 8 sizeof(double) = 8

but what's the reason for this difference? Why did the size get smaller and when should and shouldn't you use float_t or double_t ?

459

asked Mar 22 '11 10:03

Component 10

2 Answers

The "why" is that some compilers will return floating point values in a floating-point register. These registers have only one size. For example, on X86, it is 80 bits wide. The results of a function that returns a floating point value will be placed into this register regardless of whether the type has been declared as float, double, float_t or double_t. If the size of the return value and the size of the floating-point register differ, then at some point an instruction will be required to round down to the desired size.

The same kind of conversion is necessary for integers as well, but for subsequent additions and subtractions there is no overhead, because there are instructions to pick which bytes to involve in the operation. The rules for conversion of integers to a smaller size specify that the most significant bits be tossed away, so the result of downsizing can produce a result that is radically different (e.g. (short)(2147450880) --> -32768), but for some reason that seems to be OK with the programming community.

In doing a floating-point downsizing, the result is specified to be rounded to the closest representable number. If integers were subject to the same rules, then the above example would truncate thusly (short)(2147450880) -> +32767. Obviously a little more logic is required to perform such an operation that mere truncation of the upper bits. With floating-point, the exponent and the significand change sizes between float, double and long double, so it is more complicated. Additionally, there are issues of conversion between infinity, NaN, normalized numbers, and renormalized numbers that need to be taken into account. Hardware can implement these conversions in the same amount of time as an integer addition, but if the conversion needs to be implemented in software, it may take 20 instructions, which can have a noticeable effect on performance. Since the C programming model assures that the same results be generated regardless of whether the floating-point is implemented in hardware or software, the software is obliged to execute these extra instructions in order to comply with the computational model. The float_t and double_t types were designed to expose the most efficient return value type.

The compiler defines a FLT_EVAL_METHOD, which specifies how much precision is to be used in the intermediate computations. With integers, the rule is to do intermediate computations using the highest precision of the operands involved. This would correspond to a FLT_EVAL_METHOD==0. However, the original K&R specified that all intermediate computations be done in double, thus yielding FLT_EVAL_METHOD==1. However, with the introduction of the IEEE floating-point standard, it became commonplace on some platforms, notably the Macintosh PowerPC and Windows X86 to perform intermediate computations in long double -- 80 bits, thus yielding FLT_EVAL_METHOD==2.

Regression testing will be affected by the FLT_EVAL_METHOD computational model. Thus, your regression code should take this into account. One way is to test FLT_EVAL_METHOD and have different branches for each model. A similar method would be to test sizeof(float_t), and have different branches. A third method would be to use some kind of epsilon that would be used to check whether the results are close enough.

Unfortunately, there are some computations that make a decision based on the results of a computation, resulting in a true or false, which cannot be resolved by using an epsilon. This occurs in computer graphics, for example, to decide whether a point is inside or outside a polygon, which determines whether a particular pixel should be filled. If your regression involves one of these, you cannot use the epsilon method, and must use different branches depending on the computational model.

Another way to resolve the decision regression between models is to cast the result explicitly to a particular desired precision. This works most of the time on many compilers, but some compilers think that they are smarter than you, and refuse to do the conversion. This happens in the case where an intermediate result is stored in a register, but is used in a subsequent computation. You can cast away precision as much as you want in the intermediate result, but the compiler will do nothing -- unless you declare the intermediate result as volatile. This then forces the compiler to downsize and store the intermediate result in a variable of the specified size in memory, then to retrieve it when needed for computation. The IEEE floating point standard is exact for elementary operations (+-*/) and square root. I believe that sin(), cos(), exp(), log(), etc. are specified to be within 2 ULP (units in the least significant position) of the closest numerically-representable result. The long double (80 bit) format was designed to allow computation of those other transcendental functions exactly to the closest numerically-represenatble result.

This covers a lot of the issues brought up (and implied) in this thread, but does not answer the question of when you should use the float_t and double_t types. Obviously, you need to do so when interfacing to an API that uses these types, especially when passing the address of one of these types.

If your prime concern is about performance, then you might want to consider using the float_t and double_t types in your computations and APIs. But it is most probable that the performance increase that you get is neither measurable nor noticeable.

However, if you are concerned about regression between different compilers and different machines, you should probably avoid these types as much as possible, and use casting liberally to assure cross-platform compatibility.

answered Oct 21 '22 18:10

Reality Pixels

The reason for float_t is that for some processors and compilers using a larger type e.g. long double for float could be more efficient and so the float_t allows the compiler to use the larger type instead of float.

thus in the OPs case using float_t the change in size is what the standard allows for. If the original code wanted to use the smaller float sizes it should be using float.

There is some rationale in open-std doc

for example the type definitions float_t and double_t (defined in <math.h>), are intended to allow effective use of architectures with more efficient, wider formats. Annexes

199

answered Oct 21 '22 20:10

mmmmmm

Related questions
                            
                                How to differentiate between 'Enter' and 'Return' keys in Javascript?
                            
                                SqlAlchemy: case statement (case - if - then -else)
                            
                                Ruby Koans: explicit scoping on a class definition part 2
                            
                                Service already exists (when it clearly doesn't)
                            
                                Tell gdb to skip standard files
                            
                                Can we join two tables in postgresql if the tables are in different schemas?
                            
                                How do I add the MinGW bin directory to my system path?
                            
                                ios: best way to display variable-length, multi-line text
                            
                                PHP, MySQL and Time Zones
                            
                                Return in the Finally Block... Why not?
                            
                                generate Class diagram for iphone app
                            
                                MSBuild project file: Copy item to specific location in output directory

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With