Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiplying floats and keep/get double precision accuracy

I have a function that takes floats, I'm doing some computation with them, and I'd like to keep as much accuracy as possible in the returned result. I read that when you multiply two floats, you double the number of significant digits.

So when two floats get multiplied, for example float e, f; and I do double g = e * f, when do the bits get truncated?

In my example function below, do I need casting, and if yes, where? This is in a tight inner loop, if I put static_cast<double>(x) around each variable a b c d where it's used, I get 5-10% slowdown. But I suspect I don't need to cast each variable separately, and only in some locations, if at all? Or does returning a double here do not give me any gain anyway and I can as well just return a float?

double func(float a, float b, float c, float d) {
    return (a - b) * c + (a - c) * b;
}
like image 860
Ela782 Avatar asked Sep 11 '16 13:09

Ela782


People also ask

Are Doubles more precise than floats?

double has 2x more precision than float. float is a 32-bit IEEE 754 single precision Floating Point Number – 1 bit for the sign, 8 bits for the exponent, and 23* for the value. float has 7 decimal digits of precision.

Can you multiply two floats?

Example: Multiply Two Floating-Point Numbers This ensures the numbers are float , otherwise they will be assigned - type double . first and second are then multiplied using the * operator and the result is stored in a new float variable product .

Is floating point multiplication exact?

Floating-point addition, subtraction, and multiplication of integral values will be exact as long as the inputs are exact and the results are small enough.

Can we multiply float and double in Java?

This is not possible.


1 Answers

When you multiply two floats without casting, the result is calculated with float precision (i.e. truncated) and then converted to double.

To calculate the result in double, you need to cast at least one operand to double first. Then the entire calculation will be done in double (and all float values will be converted). However, that will create the same slowdown. The slowdown is likely because converting a number from float to double is not entirely trivial (different bit size and range of exponent and mantisa).

If I'd be doing that and have control over the function definition, I'd pass all the arguments as double (I generally use double everywhere, on modern computers the speed difference between calculating in float vs double is negligible, only issues could be memory throughput and cache performance when operating on large arrays of values).

Btw. the case important for precision actually isn't the multiplication, but the addition/subtraction - that is where the precision can make a big difference. Consider adding/subtracting 1e+6 and 1e-3.

like image 81
EmDroid Avatar answered Sep 22 '22 01:09

EmDroid