Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to round double precision numbers to a lower precision given in number of bits

In C#, I want to round doubles to a lower precision so that I can store them in buckets of varying size in an associative array. Unlike the usual rounding, I want to round to a number of significant bits. Thus large numbers will change in absolute terms much more than small numbers, but they will tend to change the same proportionately. So if I want to round to 10 binary digits, I find the ten most significant bits, and zero out all the lower bits, possibly adding a small number for rounding up.

I prefer "half-way" numbers be rounded up.

If it were an integer type, here would be a possible algorithm:

  1. Find: zero-based index of the most significant binary digit set H.
  2. Compute: B = H - P, 
       where P is the number of significant digits of precision to round
       and B is the binary digit to start rounding, where B = 0 is the ones place, 
       B = 1 is the twos place, etc. 
  3. Add: x = x + 2^B 
       This will force a carry if necessary (we round halfway values up).
  4. Zero out: x = x mod 2^(B+1). 
       This clears the B place and all lower digits.

The problem is finding an efficient way to find the highest bit set. If I were using integers, there are cool bit hacks to find the MSB. I do not want to call Round(Log2(x)) if I can help it. This function will be called many millions of times.

Note: I have read this SO question:

What is a good way to round double-precision values to a (somewhat) lower precision?

It works for C++. I am using C#.

UPDATE:

This is the code (modified from what the answerer supplied) as I am using it:

/// <summary>
/// Round numbers to a specified number of significant binary digits.
/// 
/// For example, to 3 places, numbers from zero to seven are unchanged, because they only require 3 binary digits,
/// but larger numbers lose precision:
/// 
///      8    1000 => 1000   8
///      9    1001 => 1010  10
///     10    1010 => 1010  10
///     11    1011 => 1100  12
///     12    1100 => 1100  12
///     13    1101 => 1110  14
///     14    1110 => 1110  14
///     15    1111 =>10000  16
///     16   10000 =>10000  16
///     
/// This is different from rounding in that we are specifying the place where rounding occurs as the distance to the right
/// in binary digits from the highest bit set, not the distance to the left from the zero bit.
/// </summary>
/// <param name="d">Number to be rounded.</param>
/// <param name="digits">Number of binary digits of precision to preserve. </param>
public static double AdjustPrecision(this double d, int digits)
{
    // TODO: Not sure if this will work for both normalized and denormalized doubles. Needs more research.
    var shift = 53 - digits; // IEEE 754 doubles have 53 bits of significand, but one bit is "implied" and not stored.
    ulong significandMask = (0xffffffffffffffffUL >> shift) << shift;
    var local_d = d;
    unsafe
    {
        // double -> fixed point (sorta)
        ulong toLong = *(ulong*)(&local_d);
        // mask off your least-sig bits
        var modLong = toLong & significandMask;
        // fixed point -> float (sorta)
        local_d = *(double*)(&modLong);
    }
    return local_d;
}

UPDATE 2: Dekker's Algorithm

I derived this from Dekker's algorithm, thanks to the other respondent. It rounds to the closest value, instead of truncating as the above code does, and it uses only safe code:

private static double[] PowersOfTwoPlusOne;

static NumericalAlgorithms()
{
    PowersOfTwoPlusOne = new double[54];
    for (var i = 0; i < PowersOfTwoPlusOne.Length; i++)
    {
        if (i == 0)
            PowersOfTwoPlusOne[i] = 1; // Special case.
        else
        {
            long two_to_i_plus_one = (1L << i) + 1L;
            PowersOfTwoPlusOne[i] = (double)two_to_i_plus_one;
        }
    }
}

public static double AdjustPrecisionSafely(this double d, int digits)
{
    double t = d * PowersOfTwoPlusOne[53 - digits];
    double adjusted = t - (t - d);
    return adjusted;
}

UPDATE 2: TIMINGS

I ran a test and found that Dekker's algorithm is better than TWICE as fast!

Number of calls in test: 100,000,000
Unsafe Time = 1.922 (sec)
Safe Time = 0.799 (sec)

like image 858
Paul Chernoch Avatar asked Jan 11 '13 19:01

Paul Chernoch


People also ask

How much more precise is a double precision format?

Double precision numbers are accurate up to sixteen decimal places but after calculations have been done there may be some rounding errors to account for. In theory this should affect no more than the last significant digit but in practice it is safer to rely upon fewer decimal places.

What is mid point rounding?

Rounding to nearest even, or banker's rounding Midpoint values are rounded to the nearest even number. For example, both 3.75 and 3.85 round to 3.8, and both -3.75 and -3.85 round to -3.8. This form of rounding is represented by the MidpointRounding.

How to round values in c#?

In C#, Math. Round() is a Math class method which is used to round a value to the nearest integer or to the particular number of fractional digits. This method has another overload with which, you can specify the number of digits beyond the decimal point in the returned value.

How do you calculate a double precision floating point?

Short answer: the max value for a double-precision value (assuming IEEE 754 floating-point) is exactly 2^1024 * (1 - 2^-53). For a single-precision value it's 2^128 * (1 - 2^-24).


1 Answers

Dekker’s algorithm will split a floating-point number into high and low parts. If there are s bits in the significand (53 in IEEE 754 64-bit binary), then *x0 receives the high s-b bits, which is what you requested, and *x1 receives the remaining bits, which you may discard. In the code below, Scale should have the value 2b. If b is known at compile time, e.g., the constant 43, you can replace Scale with 0x1p43. Otherwise, you must produce 2b in some way.

This requires round-to-nearest mode. IEEE 754 arithmetic suffices, but other reasonable arithmetic may be okay too. It rounds ties to even, which is not what you requested (ties upward). Is that necessary?

This assumes that x * (Scale + 1) does not overflow. The operations must be evaluated in double precision (not greater).

void Split(double *x0, double *x1, double x)
{
    double d = x * (Scale + 1);
    double t = d - x;
    *x0 = d - t;
    *x1 = x - *x0;
}
like image 191
Eric Postpischil Avatar answered Oct 12 '22 13:10

Eric Postpischil