In floating-point arithmetic, if two numbers have the same binary representation, then the result of any operation performed on these numbers should be the same, and equality comparisons using == should work as expected.
For example, if a and b are computed as 1.0/3.0, they will indeed have the same binary representation in a standard floating-point system. Therefore, x and y calculated as follows shall be identical and the assertion shall hold.
double a = 1.0/3.0;
double b = 1.0/3.0;
double x = a*a/Math::Pi;
double y = b*b/Math::Pi;
assert(x==y);
My question is, will the sign of the number affect the accuracy of the results? Will the following be always true?
double a = 1.0/3.0;
double b = -1.0/3.0;
double x = a*a/Math::Pi;
double y = -(-b*b/Math::Pi);
assert(x==y);
How about this? Will the assertion hold?
double a = 1.0/3.0;
double b = 1.0/7.0;
double x = a-b;
double y = -(b-a);
assert(x==y);
I mainly work on x86/x64 machines. I thought C/C++/ASM shall have the same behaviour, so I tagged both C and C++.
As a demonstration of how the rounding mode can matter, consider this code:
#include <stdio.h>
#include <fenv.h>
#pragma STDC FENV_ACCESS ON
int main()
{
double a = 1.0 / 3.0;
double b = -1.0 / 3.0;
if(a != -b) printf("unexpectedly unequal #1\n");
fesetround(FE_DOWNWARD);
a = 1.0 / 3.0;
b = -1.0 / 3.0;
if(a != -b) {
printf("unexpectedly unequal #2:\n");
printf("a = % .20f\n", a);
printf("b = % .20f\n", b);
}
}
When compiled under clang v. 14.0.3 on my Mac (as either C or C++) this code does print "unexpectedly unequal #2",
with the values of a and b displayed as:
a = 0.33333333333333331482
b = -0.33333333333333337035
[In retrospect, I'm impressed this worked the way it did. Either clang is declining to do floating point constant folding at compile time, or it is evaluating the effect of the fesetround call at compile time.]
[Note, too, that the change to the rounding mode has affected the way printf renders the numbers. Under normal rounding, they would have been 0.33333333333333331483 and -0.33333333333333337034.]
Update: this example code does not work (does not print "unexpectedly unequal #2") under gcc, I suspect because gcc is going ahead and folding the constants at compile time. Under gcc v. 13.1.0, at least, it suffices to create global variables double one = 1.0; and double three = 3.0; and then use one and three in the various computations of a and b.
TLDR: In theory the sign can affect results, in practice it doesn't.
The standard for floating point numbers is IEEE-754. It specifies that the sign is encoded in a separate bit. This means that, unlike with integers that use Two's complement, the range is exactly symmetrical around zero. Neither the significand nor the exponent are affected by the sign. Therefore the sign does not affect precision.
However, IEEE-754 allows different rounding modes to deal with the lowest bit in the significand (internally, the CPU computes with more bits, then rounds to the closest floating point representation). If you change the rounding mode to towards positive or negative infinity, this will affect results. But this is rarely (if ever) done and you have to opt-in to that behavior change; it is not the default.
Even if you write a library, it is probably easier to just mandate it being called with a standard math environment. Otherwise you may also have to worry about trapping math and signalling NaNs. Not every hardware implementation supports changing the rounding mode either, for example GPUs generally don't or only via explicit instructions (CUDA, OpenCL). Not caring about this is also in line with GCC's default behaviour:
Without any explicit options, GCC assumes round to nearest or even and does not care about signalling NaNs.
Strictly speaking, most languages do not require IEEE-754 for their floating point type. However, you will have a hard time finding a system in wide use that doesn't follow the standard. See also Why not use a two's complement based floating-point?
The expectation that the range is symmetrical is also baked in a lot of code, including standard libraries. For example C++ before C++11 had std::numeric_limits<float>::max() for the largest finite positive value and std::numeric_limits<float>::min() for the smallest nonzero, normalized, positive value. If you wanted the most-negative finite value, you had to negate the max().
This only changed in C++11 with std::numeric_limits<float>::lowest() but it did not change for C itself, which uses the similar float.h macros. It also still leaves gaps, for example there is no direct way to retrieve the first normalized, negative number below zero. The standard expects you to simply negate min().
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With