Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

next higher/lower IEEE double precision number

I am doing high precision scientific computations. In looking for the best representation of various effects, I keep coming up with reasons to want to get the next higher (or lower) double precision number available. Essentially, what I want to do is add one to the least significant bit in the internal representation of a double.

The difficulty is that the IEEE format is not totally uniform. If one were to use low-level code and actually add one to the least significant bit, the resulting format might not be the next available double. It might, for instance, be a special case number such as PositiveInfinity or NaN. There are also the sub-normal values, which I don't claim to understand, but which seem to have specific bit patterns different from the "normal" pattern.

An "epsilon" value is available, but I have never understood its definition. Since double values are not evenly spaced, no single value can be added to a double to result in the next higher value.

I really don't understand why IEEE hasn't specified a function to get the next higher or lower value. I can't be the only one who needs it.

Is there a way to get the next value (without some sort of a loop which tries to add smaller and smaller values).

like image 436
Mark T Avatar asked Aug 07 '09 16:08

Mark T


People also ask

What is the gap between 2 and the next larger double-precision number?

The binary floating point representation of 2 is 1.02 ×21. Therefore the next larger double precision floating point number is (1 + 2−52) × 21, and the gap is 2−51.

What is the largest double-precision number?

Double-precision floating-point (DOUBLE or FLOAT) A double-precision floating-point number is a 64-bit approximation of a real number. The number can be zero or can range from -1.7976931348623158e+308 to -2.2250738585072014e-308, or from 2.2250738585072014e-308 to 1.7976931348623158e+308.

Which has higher precision float or double?

double has 2x more precision than float. float is a 32-bit IEEE 754 single precision Floating Point Number – 1 bit for the sign, 8 bits for the exponent, and 23* for the value.

What is the highest and lowest precision for IEEE single precision floating points?

A signed 32-bit integer variable has a maximum value of 231 − 1 = 2,147,483,647, whereas an IEEE 754 32-bit base-2 floating-point variable has a maximum value of (2 − 2−23) × 2127 ≈ 3.4028235 × 1038.


2 Answers

There are functions available for doing exactly that, but they can depend on what language you use. Two examples:

  • if you have access to a decent C99 math library, you can use nextafter (and its float and long double variants, nextafterf and nextafterl); or the nexttoward family (which take a long double as second argument).

  • if you write Fortran, you have the nearest intrinsic available

If you can't access these directly from your language, you can also look at how they're implemented in freely available, such as this one.

like image 192
F'x Avatar answered Sep 19 '22 21:09

F'x


Most languages have intrinsic or library functions for acquiring the next or previous single-precision (32-bit) and/or double-precision (64-bit) number.

For users of 32-bit and 64-bit floating point arithmetic, a sound understanding of the basic constructs is very useful for avoiding some hazards with them. The IEEE standard applies uniformly, but still leaves a number of details up to implementers. Hence, a platform universal solution based on bit manipulations of the machine word representations may problematic and may depend on issues such as endian and so on. Whilst understanding all the gory details of how it could or should work at the bit level may demonstrate intellectual prowess, it is still better to use an intrinsic or library solution that is tailored for each platform and has a universal API across supported platforms.

I noticed solutions for C# and C++. Here are some for Java:

Math.nextUp:

public static double nextUp(double d):

  • Returns the floating-point value adjacent to d in the direction of positive infinity. This method is semantically equivalent to nextAfter(d, Double.POSITIVE_INFINITY); however, a nextUp implementation may run faster than its equivalent nextAfter call.

Special Cases:

  • If the argument is NaN, the result is NaN.
  • If the argument is positive infinity, the result is positive infinity.
  • If the argument is zero, the result is Double.MIN_VALUE

Parameters:

  • d - starting floating-point value

Returns:

  • The adjacent floating-point value closer to positive infinity.

public static float nextUp(float f):

  • Returns the floating-point value adjacent to f in the direction of positive infinity. This method is semantically equivalent to nextAfter(f, Float.POSITIVE_INFINITY); however, a nextUp implementation may run faster than its equivalent nextAfter call.

Special Cases:

  • If the argument is NaN, the result is NaN.
  • If the argument is positive infinity, the result is positive infinity.
  • If the argument is zero, the result is Float.MIN_VALUE

Parameters:

  • f - starting floating-point value

Returns:

  • The adjacent floating-point value closer to positive infinity.

The next two are a bit more complex to use. However, a direction towards zero or towards either positive or negative infinity seem the more likely and useful uses. Another use is to see an intermediate value exists between two values. One can determine how many exist between two values with a loop and counter. Also, it seems they, along with the nextUp methods, might be useful for increment/decrement in for loops.

Math.nextAfter:

public static double nextAfter(double start, double direction)

  • Returns the floating-point number adjacent to the first argument in the direction of the second argument. If both arguments compare as equal the second argument is returned.

Special cases:

  • If either argument is a NaN, then NaN is returned.
  • If both arguments are signed zeros, direction is returned unchanged (as implied by the requirement of returning the second argument if the arguments compare as equal).
  • If start is ±Double.MIN_VALUE and direction has a value such that the result should have a smaller magnitude, then a zero with the same sign as start is returned.
  • If start is infinite and direction has a value such that the result should have a smaller magnitude, Double.MAX_VALUE with the same sign as start is returned.
  • If start is equal to ± Double.MAX_VALUE and direction has a value such that the result should have a larger magnitude, an infinity with same sign as start is returned.

Parameters:

  • start - starting floating-point value
  • direction - value indicating which of start's neighbors or start should be returned

Returns:

  • The floating-point number adjacent to start in the direction of direction.

public static float nextAfter(float start, double direction)

  • Returns the floating-point number adjacent to the first argument in the direction of the second argument. If both arguments compare as equal a value equivalent to the second argument is returned.

Special cases:

  • If either argument is a NaN, then NaN is returned.
  • If both arguments are signed zeros, a value equivalent to direction is returned.
  • If start is ±Float.MIN_VALUE and direction has a value such that the result should have a smaller magnitude, then a zero with the same sign as start is returned.
  • If start is infinite and direction has a value such that the result should have a smaller magnitude, Float.MAX_VALUE with the same sign as start is returned.
  • If start is equal to ± Float.MAX_VALUE and direction has a value such that the result should have a larger magnitude, an infinity with same sign as start is returned.

Parameters:

  • start - starting floating-point value
  • direction - value indicating which of start's neighbors or start should be returned

Returns:

  • The floating-point number adjacent to start in the direction of direction.
like image 42
Jim Avatar answered Sep 22 '22 21:09

Jim