Packing float into vec4 is lossy?

Question

A well-known procedure for packing/unpacking a 32-bit float (e.g. depth value) into a vec4 in gl_RGBA8 format is as follows:

vec4 pack(const in float depth)
{
    const vec4 bit_shift = vec4(256.0*256.0*256.0, 256.0*256.0, 256.0, 1.0);
    const vec4 bit_mask  = vec4(0.0, 1.0/256.0, 1.0/256.0, 1.0/256.0);
    vec4 res = fract(depth * bit_shift);
    res -= res.xxyz * bit_mask;
    return res;
}

float unpack(const in vec4 rgba_depth)
{
    const vec4 bit_shift = vec4(1.0/(256.0*256.0*256.0), 1.0/(256.0*256.0), 1.0/256.0, 1.0);
    float depth = dot(rgba_depth, bit_shift);
    return depth;
}

The pack routine correctly packs 32 bits into 4 x 8-bit values. However, opengl compresses each into 8-bit unsigned integer (unorm) using division by 255 plus truncation. This is a lossy operation; for example, 0.11011011 multiplied by 255 gives 11011010 after truncation, so the least-significant bit is "lost". It makes sense, because in order to preserve the octet we really need to multiply it by 256 which amounts to an 8-bit shift-left.

If this happens to the most-significant octet for example, I see no point in storing the remaining three less significant octets.

Is my reasoning flawed? Am I missing something?

Daniel Nitzan · Accepted Answer

The algorithm in question seems bogus; it doesn't take into account the fact that opengl divides by 255 and truncates. (D3D's FLOAT-to-UNORM is basically the same, except adding 0.5 before truncation, but it's still lossy.)

What we really need is for opengl to multiply by 256 to store the octet as integer and divide by 256 to restore it. This is impossible, but we can achieve the same effect using a trick: first encode the octets as integers (instead of fractions as pack does) and then divide by 255. Opengl will multiply by 255 yielding the original octet.

In more detail:

pack() returns a vec4 whose components are fractions representing four octets comprising the original number. As an example, assume the original float is a 16-bit fraction 0.1101101110111101. pack's output would then be a vec2 whose components are: res=[0.10111101, 0.11011011] (the least-significant octet in res[0] and the most-significant octet in res[1], both encoded as fractions).

Instead of fractions, we want integers: [10111101, 11011011] which amounts to multiplying the fractions by 256. Then dividing each by 255 yields two fractions again, but since opengl multiplies by 255 to convert to gl_RGBA8, it basically reverses our division, storing the original integers exactly, without any truncation/rounding. (See caveat #1 below, as to why the div/mul works.)

Continuing our example, on the unpack side, opengl first divides each component by 255 and then passes the result to our function: [0.10111101..., 0.11011011...]. We then need to multiply by 255 and truncate to reverse that and finally shift the octets back into their respective positions to get the original float.

This procedure reproduces the original float number exactly.

Why does it work?

Not a rigorous proof by any means, but the fact that we multiply the octets by 256/255 gives us a fraction which is a tiny bit larger than the original (more bits are now present after the LSB). This larger number, when multiplied by 255 (by opengl), has better chance to retain all its eight most-significant (original) bits because it gets rounded up instead of down.

To illustrate, let's take the octet 0.11011011 from the question for example. As we saw in the question, if we just multiply it by 255 we lose the LSB (it turns zero). But if we first multiply by 256/255 we get the repeating fraction 0.11011011... which is a bit larger than 0.11011011. When opengl multiplies it by 255 we get 11011011 after truncation, hence all bits are preserved.

So the three operations combined are basically equivalent to multiplication by 256 (shift-left of 8 bits) which is why this trick works.

Caveats:

Dividing and multiplying by an integer may introduce FP round-off errors such that we don't get the original integer back. To check this on our case, I wrote a small C program that divides every integer in the range [0..255] by 255, then multiplies by 255 and finally truncates the result (using float math). Turns out it returns the original numbers in that range.
A 32-bit float has a 24-bit significand and it doesn't have enough precision to represent a 32-bit fraction exactly. So preserving 32-bits of precision is rather pointless. But this method could work well if we wanted to preserve 16-bits of precision for example.

Packing float into vec4 is lossy?

Tags:

glsl

shader

Daniel Nitzan

1 Answers

Why does it work?

Caveats:

Daniel Nitzan

Recent Activity

Donate For Us

Packing float into vec4 is lossy?

Tags:

glsl

shader

Daniel Nitzan

1 Answers

Why does it work?

Caveats:

Daniel Nitzan

Related questions

Recent Activity

Donate For Us