Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the point of float_t and when should it be used?

Tags:

I'm working with a client who is using an old version of GCC (3.2.3 to be precise) but wants to upgrade and one reason that's been given as stumbling block to upgrading to a newer version is differences in the size of type float_t which, sure enough is correct:

On GCC 3.2.3

sizeof(float_t) = 12 sizeof(float) = 4 sizeof(double_t) = 12 sizeof(double) = 8 

On GCC 4.1.2

sizeof(float_t) = 4 sizeof(float) = 4 sizeof(double_t) = 8 sizeof(double) = 8 

but what's the reason for this difference? Why did the size get smaller and when should and shouldn't you use float_t or double_t ?

like image 459
Component 10 Avatar asked Mar 22 '11 10:03

Component 10


People also ask

When would you use a float data type?

Integers and floats are two different kinds of numerical data. An integer (more commonly called an int) is a number without a decimal point. A float is a floating-point number, which means it is a number that has a decimal place. Floats are used when more precision is needed.

When should I use float or double?

Float and double Double is more precise than float and can store 64 bits, double of the number of bits float can store. Double is more precise and for storing large numbers, we prefer double over float. For example, to store the annual salary of the CEO of a company, double will be a more accurate choice.

Should I use float Java?

Though both Java float vs Double is approximate types, use double if you need more precise and accurate results. Use float if you have memory constraint because it takes almost half as much space as double. If your numbers cannot fit in the range offered by float, then use double.

What is Double_t?

Floating-point type. Alias of one of the fundamental floating-point types at least as wide as double . It is the type used by the implementation to evaluate values of type double , as determined by FLT_EVAL_METHOD : FLT_EVAL_METHOD.


2 Answers

The "why" is that some compilers will return floating point values in a floating-point register. These registers have only one size. For example, on X86, it is 80 bits wide. The results of a function that returns a floating point value will be placed into this register regardless of whether the type has been declared as float, double, float_t or double_t. If the size of the return value and the size of the floating-point register differ, then at some point an instruction will be required to round down to the desired size.

The same kind of conversion is necessary for integers as well, but for subsequent additions and subtractions there is no overhead, because there are instructions to pick which bytes to involve in the operation. The rules for conversion of integers to a smaller size specify that the most significant bits be tossed away, so the result of downsizing can produce a result that is radically different (e.g. (short)(2147450880) --> -32768), but for some reason that seems to be OK with the programming community.

In doing a floating-point downsizing, the result is specified to be rounded to the closest representable number. If integers were subject to the same rules, then the above example would truncate thusly (short)(2147450880) -> +32767. Obviously a little more logic is required to perform such an operation that mere truncation of the upper bits. With floating-point, the exponent and the significand change sizes between float, double and long double, so it is more complicated. Additionally, there are issues of conversion between infinity, NaN, normalized numbers, and renormalized numbers that need to be taken into account. Hardware can implement these conversions in the same amount of time as an integer addition, but if the conversion needs to be implemented in software, it may take 20 instructions, which can have a noticeable effect on performance. Since the C programming model assures that the same results be generated regardless of whether the floating-point is implemented in hardware or software, the software is obliged to execute these extra instructions in order to comply with the computational model. The float_t and double_t types were designed to expose the most efficient return value type.

The compiler defines a FLT_EVAL_METHOD, which specifies how much precision is to be used in the intermediate computations. With integers, the rule is to do intermediate computations using the highest precision of the operands involved. This would correspond to a FLT_EVAL_METHOD==0. However, the original K&R specified that all intermediate computations be done in double, thus yielding FLT_EVAL_METHOD==1. However, with the introduction of the IEEE floating-point standard, it became commonplace on some platforms, notably the Macintosh PowerPC and Windows X86 to perform intermediate computations in long double -- 80 bits, thus yielding FLT_EVAL_METHOD==2.

Regression testing will be affected by the FLT_EVAL_METHOD computational model. Thus, your regression code should take this into account. One way is to test FLT_EVAL_METHOD and have different branches for each model. A similar method would be to test sizeof(float_t), and have different branches. A third method would be to use some kind of epsilon that would be used to check whether the results are close enough.

Unfortunately, there are some computations that make a decision based on the results of a computation, resulting in a true or false, which cannot be resolved by using an epsilon. This occurs in computer graphics, for example, to decide whether a point is inside or outside a polygon, which determines whether a particular pixel should be filled. If your regression involves one of these, you cannot use the epsilon method, and must use different branches depending on the computational model.

Another way to resolve the decision regression between models is to cast the result explicitly to a particular desired precision. This works most of the time on many compilers, but some compilers think that they are smarter than you, and refuse to do the conversion. This happens in the case where an intermediate result is stored in a register, but is used in a subsequent computation. You can cast away precision as much as you want in the intermediate result, but the compiler will do nothing -- unless you declare the intermediate result as volatile. This then forces the compiler to downsize and store the intermediate result in a variable of the specified size in memory, then to retrieve it when needed for computation. The IEEE floating point standard is exact for elementary operations (+-*/) and square root. I believe that sin(), cos(), exp(), log(), etc. are specified to be within 2 ULP (units in the least significant position) of the closest numerically-representable result. The long double (80 bit) format was designed to allow computation of those other transcendental functions exactly to the closest numerically-represenatble result.

This covers a lot of the issues brought up (and implied) in this thread, but does not answer the question of when you should use the float_t and double_t types. Obviously, you need to do so when interfacing to an API that uses these types, especially when passing the address of one of these types.

If your prime concern is about performance, then you might want to consider using the float_t and double_t types in your computations and APIs. But it is most probable that the performance increase that you get is neither measurable nor noticeable.

However, if you are concerned about regression between different compilers and different machines, you should probably avoid these types as much as possible, and use casting liberally to assure cross-platform compatibility.

like image 39
Reality Pixels Avatar answered Oct 21 '22 18:10

Reality Pixels


The reason for float_t is that for some processors and compilers using a larger type e.g. long double for float could be more efficient and so the float_t allows the compiler to use the larger type instead of float.

thus in the OPs case using float_t the change in size is what the standard allows for. If the original code wanted to use the smaller float sizes it should be using float.

There is some rationale in open-std doc

for example the type definitions float_t and double_t (defined in <math.h>), are intended to allow effective use of architectures with more efficient, wider formats. Annexes

like image 199
mmmmmm Avatar answered Oct 21 '22 20:10

mmmmmm