Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest algorithm to identify the smallest and largest x that make the double-precision equation x + a == b true

In the context of static analysis, I am interested in determining the values of x in the then-branch of the conditional below:

double x;
x = …;
if (x + a == b)
{
  …

a and b can be assumed to be double-precision constants (generalizing to arbitrary expressions is the easiest part of the problem), and the compiler can be assumed to follow IEEE 754 strictly (FLT_EVAL_METHOD is 0). The rounding mode at run-time can be assumed to be to nearest-even.

If computing with rationals was cheap, it would be simple: the values for x would be the double-precision numbers contained in the rational interval (b - a - 0.5 * ulp1(b) … b - a + 0.5 * ulp2(b)). The bounds should be included if b is even, excluded if b is odd, and ulp1 and ulp2 are two slightly different definitions of “ULP” that can be taken identical if one does not mind losing a little precision on powers of two.

Unfortunately, computing with rationals can be expensive. Consider that another possibility is to obtain each of the bounds by dichotomy, in 64 double-precision additions (each operation deciding one bit of the result). 128 floating-point additions to obtain the lower and upper bounds may well be faster than any solution based on maths.

I am wondering if there is a way to improve over the “128 floating-point additions” idea. Actually I have my own solution involving changes of rounding mode and nextafter calls, but I wouldn't want to cramp anyone's style and cause them to miss a more elegant solution than the one I currently have. Also I am not sure that changing the rounding mode twice is actually cheaper than 64 floating-point additions.

like image 899
Pascal Cuoq Avatar asked Jun 14 '14 18:06

Pascal Cuoq


People also ask

What is the smallest number in double precision?

The smallest representable number in IEEE Double Precision format is approximately 4.95×10−324 which is many orders of magnitude smaller than 1.1102230246316×10−16.

What is the largest double precision number?

A double-precision floating-point number is a 64-bit approximation of a real number. The number can be zero or can range from -1.7976931348623158e+308 to -2.2250738585072014e-308, or from 2.2250738585072014e-308 to 1.7976931348623158e+308.

What is the gap between 2 and the next larger double precision number?

How many IEEE double precision numbers are there between an adjacent pair of nonzero IEEE single precision numbers? The binary floating point representation of 2 is 1.02 ×21. Therefore the next larger double precision floating point number is (1 + 2−52) × 21, and the gap is 2−51.

What is mantissa numerical method?

The definition of a mantissa is the part of a number located after a decimal point. An example of mantissa is 234 in the number 1101.234. (mathematics) The part of a common logarithm after the decimal point, the fractional part of a logarithm.


1 Answers

You already gave a nice and elegant solution in your question:

If computing with rationals was cheap, it would be simple: the values for x would be the double-precision numbers contained in the rational interval (b - a - 0.5 * ulp1(b) … b - a + 0.5 * ulp2(b)). The bounds should be included if b is even, excluded if b is odd, and ulp1 and ulp2 are two slightly different definitions of “ULP” that can be taken identical if one does not mind losing a little precision on powers of two.

What follows is a half-reasoned sketch of a partial solution to the problem based on this paragraph. Hopefully I'll get a chance to flesh it out soon. To get a real solution, you'll have to handle subnormals, zeroes, NaNs, and all that other fun stuff. I'm going to assume that a and b are, say, such that 1e-300 < |a| < 1e300 and 1e-300 < |b| < 1e300 so that no craziness occurs at any point.

Absent overflow and underflow, you can get ulp1(b) from b - nextafter(b, -1.0/0.0). You can get ulp2(b) from nextafter(b, 1.0/0.0) - b.

If b/2 <= a <= 2b, then Sterbenz's theorem tells you that b - a is exact. So (b - a) - ulp1 / 2 will be the closest double to the lower bound and (b - a) + ulp2 / 2 will be the closest double to the upper bound. Try these values, and the values immediately before and after, and pick the widest interval that works.

If b > 2a, b - a > b/2. The computed value of b - a is off by at most half an ulp. One ulp1 is at most two ulp, as is one ulp2, so the rational interval you gave is at most two ulp wide. Figure out which of the five closest values to b-a work.

If a > 2b, an ulp of b-a is at least as big as an ulp of b; if anything works, I bet it'll have to be be among the three closest values to b-a. I imagine the case where a and b have different signs works similarly.

I wrote a small pile of C++ code implementing this idea. It didn't fail random fuzz testing (in a few different ranges) before I got bored of waiting. Here it is:

void addeq_range(double a, double b, double &xlo, double &xhi) {
  if (a != a) return; // empty interval
  if (b != b) {
    if (a-a != 0) { xlo = xhi = -a; return; }
    else return; // empty interval
  }
  if (b-b != 0) {
    // TODO: handle me.
  }

  // b is now guaranteed to be finite.
  if (a-a != 0) return; // empty interval

  if (b < 0) {
    addeq_range(-a, -b, xlo, xhi);
    xlo = -xlo;
    xhi = -xhi;
    return;
  }

  // b is now guaranteed to be zero or positive finite and a is finite.
  if (a >= b/2 && a <= 2*b) {
    double upulp = nextafter(b, 1.0/0.0) - b;
    double downulp = b - nextafter(b, -1.0/0.0);
    xlo = (b-a) - downulp/2;
    xhi = (b-a) + upulp/2;
    if (xlo + a == b) {
      xlo = nextafter(xlo, -1.0/0.0);
      if (xlo + a != b) xlo = nextafter(xlo, 1.0/0.0);
    } else xlo = nextafter(xlo, 1.0/0.0);
    if (xhi + a == b) {
      xhi = nextafter(xhi, 1.0/0.0);
      if (xhi + a != b) xhi = nextafter(xhi, -1.0/0.0);
    } else xhi = nextafter(xhi, -1.0/0.0);
  } else {
    double xmid = b-a;
    if (xmid + a < b) {
      xhi = xlo = nextafter(xmid, 1.0/0.0);
      if (xhi + a != b) xhi = xmid;
    } else if (xmid + a == b) {
      xlo = nextafter(xmid, -1.0/0.0);
      xhi = nextafter(xmid, 1.0/0.0);
      if (xlo + a != b) xlo = xmid;
      if (xhi + a != b) xhi = xmid;
    } else {
      xlo = xhi = nextafter(xmid, -1.0/0.0);
      if (xlo + a != b) xlo = xmid;
    }
  }
}
like image 103
tmyklebu Avatar answered Oct 05 '22 09:10

tmyklebu