Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Packing 32bit floats into 30 bits (c++)

Here are the goals I'm trying to achieve:

  • I need to pack 32 bit IEEE floats into 30 bits.
  • I want to do this by decreasing the size of mantissa by 2 bits.
  • The operation itself should be as fast as possible.
  • I'm aware that some precision will be lost, and this is acceptable.
  • It would be an advantage, if this operation would not ruin special cases like SNaN, QNaN, infinities, etc. But I'm ready to sacrifice this over speed.

I guess this questions consists of two parts:

1) Can I just simply clear the least significant bits of mantissa? I've tried this, and so far it works, but maybe I'm asking for trouble... Something like:

float f;
int packed = (*(int*)&f) & ~3;
// later
f = *(float*)&packed;

2) If there are cases where 1) will fail, then what would be the fastest way to achieve this?

Thanks in advance

like image 445
Smilediver Avatar asked Nov 29 '22 19:11

Smilediver


1 Answers

You actually violate the strict aliasing rules (section 3.10 of the C++ standard) with these reinterpret casts. This will probably blow up in your face when you turn on the compiler optimizations.

C++ standard, section 3.10 paragraph 15 says:

If a program attempts to access the stored value of an object through an lvalue of other than one of the following types the behavior is undefined

  • the dynamic type of the object,
  • a cv-qualified version of the dynamic type of the object,
  • a type similar to the dynamic type of the object,
  • a type that is the signed or unsigned type corresponding to the dynamic type of the object,
  • a type that is the signed or unsigned type corresponding to a cv-qualified version of the dynamic type of the object,
  • an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union),
  • a type that is a (possibly cv-qualified) base class type of the dynamic type of the object,
  • a char or unsigned char type.

Specifically, 3.10/15 doesn't allow us to access a float object via an lvalue of type unsigned int. I actually got bitten myself by this. The program I wrote stopped working after turning on optimizations. Apparently, GCC didn't expect an lvalue of type float to alias an lvalue of type int which is a fair assumption by 3.10/15. The instructions got shuffled around by the optimizer under the as-if rule exploiting 3.10/15 and it stopped working.

Under the following assumptions

  • float really corresponds to a 32bit IEEE-float,
  • sizeof(float)==sizeof(int)
  • unsigned int has no padding bits or trap representations

you should be able to do it like this:

/// returns a 30 bit number
unsigned int pack_float(float x) {
    unsigned r;
    std::memcpy(&r,&x,sizeof r);
    return r >> 2;
}

float unpack_float(unsigned int x) {
    x <<= 2;
    float r;
    std::memcpy(&r,&x,sizeof r);
    return r;
}

This doesn't suffer from the "3.10-violation" and is typically very fast. At least GCC treats memcpy as an intrinsic function. In case you don't need the functions to work with NaNs, infinities or numbers with extremely high magnitude you can even improve accuracy by replacing "r >> 2" with "(r+1) >> 2":

unsigned int pack_float(float x) {
    unsigned r;
    std::memcpy(&r,&x,sizeof r);
    return (r+1) >> 2;
}

This works even if it changes the exponent due to a mantissa overflow because the IEEE-754 coding maps consecutive floating point values to consecutive integers (ignoring +/- zero). This mapping actually approximates a logarithm quite well.

like image 112
sellibitze Avatar answered Dec 10 '22 23:12

sellibitze