Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"Shared Exponent" representation of a floating point vector in OpenCL C

In OpenCL, I want to store a vector (3D) using a "Shared Exponent" representation for compact storage. Typically, if you store a 3D floating point vector, you simply store 3 separate float values (or 4 when aligned properly). This requires 12 (16) bytes storage for single precision and if you don't require this accuracy you can use the "half" precision float and shrink it down to 6 (8) bytes.

When using half precision and 3 separate values, the memory looks like this (no alignment considered):

  • x coordinate: 1 bit sign, 5 bits exponent, 10 bits mantissa
  • y coordinate: 1 bit sign, 5 bits exponent, 10 bits mantissa
  • z coordinate: 1 bit sign, 5 bits exponent, 10 bits mantissa

I'd like to shrink this down to 4 bytes by using a shared exponent, as OpenGL uses this in one of its internal texture formats ("RGB9_E5"). This means, the absolutely largest component decides what the exponent of the whole number is. This exponent is then used for each component implicitly. Tricks such as "normalized" storage with an implicit "1." in front of the mantissa don't work in this case. Such a representation works like this (we could tweak the acutal parameters, so this is an example):

  • x coordinate: 1 bit sign, 8 bits mantissa
  • y coordinate: 1 bit sign, 8 bits mantissa
  • z coordinate: 1 bit sign, 8 bits mantissa
  • 5 bits shared exponent

I'd like to store this in an OpenCL uint type (32 bits) or something equivalent (e.g. uchar4). The question now is:

How can I convert from and into this representation to and from float3 as fast as possible?

My idea is like this, but I'm sure there is some "bit hacking" trick which uses the bit representation of IEEE floats to circumvent the floating point ALU:

  • Use uchar4 as the representative type. Store x, y, z mantisssa in x, y, z components of this uchar4. The w component is split up into 5 less significant bits (w & 0x1F) for the shared exponent and the three more significant bits (w >> 5) & 1, (w >> 6) & 1 and (w >> 7) & 1 are the signs for x, y and z, respectively.
  • Note that the exponent is "biased" by 16, i.e. a stored value of 16 means that the represented numbers are up to (not including) 1.0, a stored value of 19 means values up to (not including) 8.0 and so on.
  • "Unpacking" this representation into a float3 could be done using this code:

    float3 unpackCompactVector(uchar4 packed) {
        float exp = (float)(packed.w & 0x1F) - 16.0;
        float factor = exp2(exp) / 256.0;
        float x = (float)(packed.x) * factor * (packed.w & 0x20 ? -1.0 : 1.0);
        float y = (float)(packed.y) * factor * (packed.w & 0x40 ? -1.0 : 1.0);
        float z = (float)(packed.z) * factor * (packed.w & 0x80 ? -1.0 : 1.0);
        float3 result = { x, y, z };
        return result;
    }
    
  • "Packing" a float3 into this representation could be done using this code:

    uchar4 packCompactVector(float3 vec) {
        float xAbs = abs(vec.x);   uchar xSign = vec.x < 0.0 ? 0x20 : 0;
        float yAbs = abs(vec.y);   uchar ySign = vec.y < 0.0 ? 0x40 : 0;
        float zAbs = abs(vec.z);   uchar zSign = vec.z < 0.0 ? 0x80 : 0;
        float maxAbs = max(max(xAbs, yAbs), zAbs);
        int exp = floor(log2(maxAbs)) + 1;
        float factor = exp2(exp);
        uchar xMant = floor(xAbs / factor * 256);
        uchar yMant = floor(yAbs / factor * 256);
        uchar zMant = floor(zAbs / factor * 256);
        uchar w = ((exp + 16) & 0x1F) + xSign + ySign + zSign;
        uchar4 result = { xMant, yMant, zMant, w };
        return result;
    }
    

I've put an equivalent implementation in C++ online on ideone. The test cases shows the transition from exp = 3 to exp 4 (with the bias of 16 this is encoded as 19 and 20, respectively) by encoding numbers around 8.0.

This implementation seems to work on the first sight. But:

  • There are some corner cases I didn't cover, in particular over- and underflow (of the exponent).
  • I don't want to use floating point math functions like log2 because they are slow.

Can you suggest a better way to achieve my goal?

Note that I only need an OpenCL "device code" for this, I don't need to convert between the representations in the host program. But I added the C tag since a solution is most probably independent of the OpenCL language features (OpenCL is almost C and it also uses IEEE 754 floats, bit manipulation works the same, etc.).

like image 881
leemes Avatar asked Nov 12 '22 03:11

leemes


1 Answers

If you used CL/GL interop and stored your data in an OpenGL texture in RGB9_E5 format and if you could create an OpenCL image from that texture, you could leverage the hardware texture unit to do the conversion into a float4 upon reading from the image. It might be worth trying.

like image 193
Dithermaster Avatar answered Nov 14 '22 23:11

Dithermaster