Portable way to serialize float as 32-bit integer

Question

I have been struggling with finding a portable way to serialize 32-bit float variables in C and C++ to be sent to and from microcontrollers. I want the format to be well-defined enough so that serialization/de-serialization can be done from other languages as well without too much effort. Related questions are:

Portability of binary serialization of double/float type in C++

Serialize double and float with C

c++ portable conversion of long to double

I know that in most cases a ~~typecast~~ union/memcpy will work just fine because the float representation is the same, but I would prefer to have a bit more control and piece of mind. What I came up with so far is the following:

void serialize_float32(uint8_t* buffer, float number, int32_t *index) {
    int e = 0;
    float sig = frexpf(number, &e);
    float sig_abs = fabsf(sig);
    uint32_t sig_i = 0;

    if (sig_abs >= 0.5) {
        sig_i = (uint32_t)((sig_abs - 0.5f) * 2.0f * 8388608.0f);
        e += 126;
    }

    uint32_t res = ((e & 0xFF) << 23) | (sig_i & 0x7FFFFF);
    if (sig < 0) {
        res |= 1 << 31;
    }

    buffer[(*index)++] = (res >> 24) & 0xFF;
    buffer[(*index)++] = (res >> 16) & 0xFF;
    buffer[(*index)++] = (res >> 8) & 0xFF;
    buffer[(*index)++] = res & 0xFF;
}

and

float deserialize_float32(const uint8_t *buffer, int32_t *index) {
    uint32_t res = ((uint32_t) buffer[*index]) << 24 |
                ((uint32_t) buffer[*index + 1]) << 16 |
                ((uint32_t) buffer[*index + 2]) << 8 |
                ((uint32_t) buffer[*index + 3]);
    *index += 4;

    int e = (res >> 23) & 0xFF;
    uint32_t sig_i = res & 0x7FFFFF;
    bool neg = res & (1 << 31);

    float sig = 0.0;
    if (e != 0 || sig_i != 0) {
        sig = (float)sig_i / (8388608.0 * 2.0) + 0.5;
        e -= 126;
    }

    if (neg) {
        sig = -sig;
    }

    return ldexpf(sig, e);
}

The frexp and ldexp functions seem to be made for this purpose, but in case they aren't available I tried to implement them manually as well using functions that are common:

float frexpf_slow(float f, int *e) {
    if (f == 0.0) {
        *e = 0;
        return 0.0;
    }

    *e = ceil(log2f(fabsf(f)));
    float res = f / powf(2.0, (float)*e);

    // Make sure that the magnitude stays below 1 so that no overflow occurs
    // during serialization. This seems to be required after doing some manual
    // testing.

    if (res >= 1.0) {
        res -= 0.5;
        *e += 1;
    }

    if (res <= -1.0) {
        res += 0.5;
        *e += 1;
    }

    return res;
}

and

float ldexpf_slow(float f, int e) {
    return f * powf(2.0, (float)e);
}

One thing I have been considering is whether to use 8388608 (2^23) or 8388607 (2^23 - 1) as the multiplier. The documentation says that frexp returns values that are less than 1 in magnitude, and after some experimentation it seems that 8388608 gives results that are bit-accurate with actual floats and I could not find any corner case where this overflows. That might not be true with a different compiler/system though. If this can become a problem a smaller multiplier which reduces the accuracy a bit is fine with me as well. I know that this does not handle Inf or NaN, but for now that is not a requirement.

So, finally, my question is: Does this look like a reasonable approach, or am I just making a complicated solution that still has portability issues?

2501 · Accepted Answer

Assuming the float is in IEEE 754 format, extracting the mantissa, exponent and sign, is completely portable:

uint32_t internal;
float value = //...some value
memcpy( &internal , &value , sizeof( value ) );

const uint32_t sign =     ( internal >> 31u ) & 0x1u;
const uint32_t mantissa = ( internal >> 0u  ) & 0x7FFFFFu;
const uint32_t exponent = ( internal >> 23u ) & 0xFFu;

Invert the procedure to construct the float.

If you want to send the entire float only, then just copy it to the buffer. This will work even if float is not IEEE 754, but it must be 32 bit and the endianess of both integer and floating point types must be the same:

buffer[0] = ( internal >> 0u  ) & 0xFFu;
buffer[1] = ( internal >> 8u  ) & 0xFFu;
buffer[2] = ( internal >> 16u ) & 0xFFu;
buffer[3] = ( internal >> 24u ) & 0xFFu;

chqrlie · Answer

You seem to have a bug in serialize_float: the last 4 lines should read:

buffer[(*index)++] = (res >> 24) & 0xFF;
buffer[(*index)++] = (res >> 16) & 0xFF;
buffer[(*index)++] = (res >> 8) & 0xFF;
buffer[(*index)++] = res & 0xFF;

Your method might not work correctly for infinities and/or NaNs because of the offset by 126 instead of 128. Note that you can validate it by extensive testing: there are only 4 billion values, trying all possibilities should not take very long.

The actual representation in memory of float values may differ on different architectures, but IEEE 854 (or more precisely IEC 60559) is largely prevalent today. You can verify if your particular targets are compliant or not by checking if __STDC_IEC_559__ is defined. Note however that even if you can assume IEEE 854, you must handle potentially different endianness between the systems. You cannot assume the endianness of floats to be the same as that of integers for the same platform.

Note also that the simple cast would be incorrect: uint32_t res = *(uint32_t *)&number; violates the strict aliasing rule. You should either use a union or use memcpy(&res, &number, sizeof(res));

Portable way to serialize float as 32-bit integer

Tags:

c++

c

floating-point

embedded

Benjamin Vedder

2 Answers

2501

chqrlie

Recent Activity

Donate For Us