I don't want to introduce floating point when an inexact value would be a distaster, so I have a couple of questions about when you actually can use them safely.
Are they exact for integers as long as you don't overflow the number of significant digit? Are these two tests always true:
double d = 2.0;
if (d + 3.0 == 5.0) ...
if (d * 3.0 == 6.0) ...
What math functions can you rely on? Are these tests always true:
#include <math.h>
double d = 100.0;
if (log10(d) == 2.0) ...
if (pow(d, 2.0) == 10000.0) ...
if (sqrt(d) == 10.0) ...
How about this:
int v = ...;
if (log2((double) v) > 16.0) ... /* gonna need more than 16 bits to store v */
if (log((double) v) / log(2.0) > 16.0) ... /* C89 */
I guess you can summarize this question as: 1) Can floating point types hold the exact value of all integers up to the number of their significant digits in float.h? 2) Do all floating point operators and functions guarantee that the result is the closest to the actual mathematical result?
I too find incorrect results distasteful.
On common hardware, you can rely on +
, -
, *
, /
, and sqrt
working and delivering the correctly-rounded result. That is, they deliver the floating-point number closest to the sum, difference, product, quotient, or square root of their argument or arguments.
Some library functions, notably log2
and log10
and exp2
and exp10
, traditionally have terrible implementations that are not even faithfully-rounded. Faithfully-rounded means that a function delivers one of the two floating-point numbers bracketing the exact result. Most modern pow
implementations have similar issues. Lots of these functions will even blow exact cases like log10(10000)
and pow(7, 2)
. Thus equality comparisons involving these functions, even in exact cases, are asking for trouble.
sin
, cos
, tan
, atan
, exp
, and log
have faithfully-rounded implementations on every platform I've recently encountered. In the bad old days, on processors using the x87 FPU to evaluate sin
, cos
, and tan
, you would get horribly wrong outputs for largish inputs and you'd get the input back for larger inputs. CRlibm has correctly-rounded implementations; these are not mainstream because, I'm told, they've got rather nastier worst cases than the traditional faithfully-rounded implementations.
Things like copysign
and nextafter
and isfinite
all work correctly. ceil
and floor
and rint
and friends always deliver the exact result. fmod
and friends do too. frexp
and friends work. fmin
and fmax
work.
Someone thought it would be a brilliant idea to make fma(x,y,z)
compute x*y+z
by computing x*y
rounded to a double
, then adding z
and rounding the result to a double
. You can find this behaviour on modern platforms. It's stupid and I hate it.
I have no experience with the hyperbolic trig, gamma, or Bessel functions in my C library.
I should also mention that popular compilers targeting 32-bit x86 play by a different, broken, set of rules. Since the x87 is the only supported floating-point instruction set and all x87 arithmetic is done with an extended exponent, computations that would induce an underflow or overflow in double precision may fail to underflow or overflow. Furthermore, since the x87 also by default uses an extended significand, you may not get the results you're looking for. Worse still, compilers will sometimes spill intermediate results to variables of lower precision, so you can't even rely on your calculations with double
s being done in extended precision. (Java has a trick for doing 64-bit math with 80-bit registers, but it is quite expensive.)
I would recommend sticking to arithmetic on long double
s if you're targeting 32-bit x86. Compilers are supposed to set FLT_EVAL_METHOD
to an appropriate value, but I do not know if this is done universally.
Well, they can store the integers which fit in their mantissa (significand). So [-2^53, 2^53] for double. For more on this, see: Which is the first integer that an IEEE 754 float is incapable of representing exactly?
They at least guarantee that the result is immediately on either side of the actual mathematical result. That is, you won't get a result which has a valid floating point value between itself and the "actual" result. But beware, because repeated operations may accumulate an error which seems counter to this, while it is not (because all intermediate values are subject to the same constraints, not just the inputs and output of a compound expression).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With