Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What exactly does (1.0e300 + pow(2.0, -30.0) > 1.0) do in STDC?

Tags:

c

ieee-754

atan2

I have come across a function which computes atan(x) (the source is here). Reducing it to the core of my question and slightly reformatting it, they have something like that:

static const double one   = 1.0,
                   huge   = 1.0e300;

double atan(double x)
{
  /* A lot of uninteresting stuff here */

  if (ix < 0x3fdc0000) {              /* |x| < 0.4375 */

    if (ix < 0x3e200000) {            /* |x| < 2^-29 */
      if ((huge + x) > one) return x; /* raise inexact */
    }

    id = -1;
  }

  /* A lot of more uninteresting stuff */
}

I would be very interested in what the line if ((huge + x) ... is supposed to do and how it works.

According to the comments, if the absolute value of x is smaller than 2^-29, the expression or the comparison raises an inexact error.

My first problem is that I currently don't get why this is done that way: If computing the arctan using that function leads to imprecise results if the absolute value of x is too small, why don't they just use something like if (fabs(x) < [some_value_here]) ...? I suspect that this is just because the inexact warning wouldn't be raised that way in their hardware / library, but I'd like to know for sure.

Assuming that I am right, my second problem is that I don't understand why the comparison is needed. I think that the key point here is that a very small number is added to a very large one, so that this addition does not change the large one sufficiently or even not at all. Hence, it is the addition which will raise the inexact warning, not the comparison. So I am asking myself what the comparison should do. Is this just to force the compiler to actually compute (huge + x), which might be optimized out otherwise?

Finally, I would be grateful if somebody could explain the math a little bit. Choosing the value 1.0e300 for huge seems a pretty arbitrary choice. But this is only a bonus question since I admit I haven't done the math part of my homework yet (I am not a total newbie regarding double values and their IEEE754 representation, but understanding the mathematical aspects of this code will take me some time, unless somebody gives a short explanation).

EDIT 1

Just seen by accident:

The float32 version of that function, including the weird line discussed above, is nearly literally still in glibc 2.19! Since glibc is meant to be portable, that code should be as well. It is in the sub-directory sysdeps\ieee754\flt-32, so I suppose this is the software emulation of the float32 functions, where portability is no issue because hardware-dependent oddities won't show up (I think the software emulation raises those exceptions exactly as defined in IEEE754).

like image 934
Binarus Avatar asked Feb 19 '19 18:02

Binarus


1 Answers

The intent of if ((huge + x) > one) return x; is to generate a floating-point inexact exception and then return from the routine.

A floating-point exception is not a trap or processor exception. It just means something unusual has happened in a floating-point operation. What happens then depends on the circumstances of the operation. In particular, the floating-point environment might be set so that an inexact exception merely raises a flag in a special register and continues with the operation, delivering a numerical result. Or it might be set so that an inexact exception causes a trap, and program control is redirected to a trap handler.

This code implementing atan does not know how the floating-point environment is set. Maybe it could fetch the settings, but it does not want to bother with that. Given that it has decided the arctangent function cannot be computed exactly, the easiest way for it to trigger a floating-point inexact exception is merely to perform a simple addition that has an inexact result. This inexact addition will have the same behavior that is desired for an inexact arctangent—it will either just raise the flag or cause a trap, depending on the settings.

As to why comparisons are made with ix < 0x3e200000, that is not clear. For one thing, ix has been adjusted to reflect the absolute value, whereas x has not, so why not use the already prepared ix rather than using another operation to produce fabs(x)? Additionally, an integer compare typically takes less processor resources than a floating-point compare, especially in the processors of the time when this code was written. Or it may be the author simply happened to use one over the author, perhaps writing most of their code using ix to operate on the floating-point encoding and not x to operate on the floating-point value, and they did not want to switch back and forth unnecessarily. It could also be because the code was written before the hexadecimal floating-point notation was available (so we could write x < 0x1p-29f), and compilers were not good about converting decimal numerals to floating-point values, so they did not want to write out the floating-point value in the source code.

This sort of code is problematic and is highly dependent on the C implementation it is written for. In general, there may not be a guarantee from the compiler that (huge + x) > one will be evaluated during program execution. The compiler might evaluate it at compile time. Presumably, though, if this code is written for a particular C implementation, they know the compiler will either evaluate it at compile time or will ensure the same result is achieved, including raising of the floating-point inexact exception.

On the face of it, (huge + x) > one does not seem to do anything useful that huge + x alone does not do, but perhaps the author knew something about the compiler that we do not.

huge does not need to be 1.0e300. Any value so large that the sum of huge and x cannot be exact would suffice.

like image 121
Eric Postpischil Avatar answered Nov 09 '22 21:11

Eric Postpischil