Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Floating point range reduction

I'm implementing some 32-bit float trigonometry in C# using Mono, hopefully utilizing Mono.Simd. I'm only missing solid range reduction currently. I'm rather stuck now, because apparently Mono's SIMD extensions does not include conversions between floats and integers, meaning I have no access to rounding/truncation which would be the usual method. I can however convert bitwise between ints and floats.

Can something like this be done? I can scale the domain up and down if needed, but ideally the range reduction should result in a domain of [0, 2 pi] or [-pi, pi]. I have a hunch that it would be possible to do some IEEE magic with the exponent, if the domain is a power of 2, but I'm really not sure how to.

Edit: Okay, I've tried messing around with this C code and it feels like I'm on the verge of something (it doesn't work but the fractional part is always correct, in decimal / base10 at least...). The core principle seems to be getting the exponent difference between your domain and the input exponent, and composing a new float with a shifted mantissa and an adjusted exponent.. But it won't work for negatives, and I have no idea how to handle non-powers of 2 (or anything fractional - in fact, anything else than 2 doesn't work!).

// here's another more correct attempt:
float fmodulus(float val, int domain)
{
    const int mantissaMask = 0x7FFFFF;
    const int exponentMask = 0x7F800000;

    int ival = *(int*)&val;

    int mantissa = ival & mantissaMask;
    int rawExponent = ival & exponentMask;
    int exponent = (rawExponent >> 23) - (129 - domain);
    // powers over one:
    int p = exponent;

    mantissa <<= p;
    rawExponent = exponent >> p;
    rawExponent += 127;
    rawExponent <<= 23;

    int newVal = rawExponent & exponentMask;
    newVal |= mantissa & mantissaMask;

    float ret = *(float*)&newVal;
    
    return ret;
}

float range_reduce(float value, int range )
{
    const int mantissaMask = 0x7FFFFF;
    const int exponentMask = 0x7F800000;

    int ival = *(int*)&value;
    // grab exponent:
    unsigned exponent = (ival & exponentMask) >> 23;
    // grab mantissa:
    unsigned mantissa = ival & mantissaMask;

    // remove bias, and see how much the exponent is over range/domain
    unsigned char erange = (unsigned char)(exponent - (125 + range));
    // check if sign bit is set - that is, the exponent is under our range
    if (erange & 0x80)
    {
        // don't do anything then.
        erange = 0;
    }

    // shift mantissa (and chop off bits) by the reduced amount
    int inewVal = (mantissa << (erange)) & mantissaMask;
    // add exponent, and subtract the amount we reduced the argument with
    inewVal |= ((exponent - erange) << 23) & exponentMask;

    // reinterpret
    float newValue = *(float*)&inewVal;
    return newValue;
    //return newValue - ((erange) & 0x1 ? 1.0f : 0.0f);
}

int main()
{
    float val = 2.687f;
    int ival = *(int*)&val;
    float correct = fmod(val, 2);
    float own = range_reduce(val, 2);

    getc(stdin);
}

Edit 2:

Okay, I'm really trying to understand this in terms of the ieee binary system. If we write the modulus operation like this:

output = input % 2

[exponent] + [mantissa_bit_n_times_exponent]

3.5     = [2] + [1 + 0.5]                   ->[1] + [0.5]       = 1.5
4.5     = [4] + [0 + 0 + 0.5]               ->[0.5] + [0]       = 0.5
5.5     = [4] + [0 + 1 + 0.5]               ->[1] + [0.5]       = 1.5
2.5     = [2] + [0 + 0.5]                   ->[0.5] + [0]       = 0.5
2.25    = [2] + [0 + 0 + 0.25]              ->[0.25]            = 0.25
2.375   = [2] + [0 + 0 + 0.25 + 0.125]      ->[0.25] + [0.125]  = 0.375
13.5    = [8] + [4 + 0 + 1 + 0.5]           ->[1] + [0.5]       = 1.5
56.5    = [32] + [16 + 8 + 0 + 0 + 0 + 0.5] ->[0.5]             = 0.5

We can see the output in all cases is a new number, with no original exponent and the mantissa shifted an amount (that is based on the exponent and the first non-zero bits of the mantissa after the first exponent-bits of the mantissa is ignored) into the exponent. But I'm not really sure if this is the correct approach, it just works out nicely on paper.

Edit3: I'm stuck on Mono version 2.0.50727.1433

like image 995
Shaggi Avatar asked Nov 01 '22 07:11

Shaggi


2 Answers

You can reduce the problem to taking a float mod 1. To simplify that, you can compute the floor of the float using bit operations, then use a floating point subtraction. The following is (unsafe) C# code for these operations:

// domain is assumed to be positive
// returns value in [0,domain)
public float fmodulus(float val, float domain)
{
    if (val < 0)
    {
        float negative = fmodulus(-val, domain);
        if (domain - negative == domain)
            return 0;
        else
            return domain-negative;
    }

    if (val < domain)
        return val; // this avoids losing accuracy

    return fmodOne(val / domain) * domain;
}

// assumes val >= 1, so val is positive and the exponent is at least 0 
unsafe public float fmodOne(float val)
{
    int iVal = *(int*)&val;
    int uncenteredExponent = iVal >> 23;
    int exponent = uncenteredExponent - 127; // 127 corresponds to 2^0 times the mantissa
    if (exponent >= 23) 
        return 0; // not enough precision to distinguish val from an integer

    int unneededBits = 23 - exponent; // between 0 and 23
    int iFloorVal = (iVal >> unneededBits) << unneededBits; // equivalent to using a mask to zero the bottom bits of the mantissa
    float floorVal = *(float*)&iFloorVal; // convert the bit pattern back to a float

    return val-floorVal;
}

For example, fmodulus(100.1f, 1) is 0.09999847. The bit pattern of 100.1f is

0 10000101 10010000011001100110011

The bit pattern of floorVal (100f) is

0 10000101 10010000000000000000000

A floating point subtraction gives something close to 0.1f:

0 01111011 10011001100110000000000

Actually, I was surprised that the last 8 bits were zeroed out. I thought only the last 6 bits of 0.1f were supposed to be replaced with 0. Perhaps one can do better than relying on the floating point subtraction.

like image 127
Douglas Zare Avatar answered Nov 16 '22 13:11

Douglas Zare


Check your mono version because ConvertToInt and ConvertToIntTruncated have been added 4 years ago and should be present since release 2.10.

like image 30
Jester Avatar answered Nov 16 '22 13:11

Jester