Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why use anything but unions for IEEE 754 floating point format?

Tags:

c

ieee-754

I have been studying ways to convert floating point (floats and doubles) to IEEE 754 for the purpose of creating routines to efficiently send/receive information across network connections. (Akin to the perl pack/unpack functions.) I have waded through the methods of creating the IEEE 754 representation via Lockless, technical-recipes.com, Bit Twiddling, Bitwizardry, Haskell.org (c++) and the like, but I do not understand why those methods are any faster/efficient/better than just using a union to get the conversion? The union conversions involving integer/float or long/double seem like a far better way to let C take care of worrying about the sign, exponent and mantissa than doing it manually with shifts and rotations.

For example, with bit twiddling, you can manually create the IEEE 754 representation with:

/* 23 bits of float fractional data */
#define I2F_FRAC_BITS   23
#define I2F_MASK ((1 << I2F_FRAC_BITS) - 1)

/* Find the log base 2 of an integer (MSB) */
int
getmsb (uint32_t word)
{
    int r;
#ifdef BUILD_64
    union { uint32_t u[2]; double d; } t;  // temp
    t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] = 0x43300000;
    t.u[__FLOAT_WORD_ORDER!=LITTLE_ENDIAN] = word;
    t.d -= 4503599627370496.0;
    r = (t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] >> 20) - 0x3FF;
#else    
    while (word >>= 1)
    {
        r++;
    }
#endif  /* BUILD_64 */
    return r;
}

/* rotate to right */
inline uint32_t 
rotr (uint32_t value, int shift)
{  return (value >> shift) | (value << (sizeof (value) * CHAR_BIT - shift));  }

/* unsigned to IEEE 754 */
uint32_t
u2ieee (uint32_t x)
{
    uint32_t msb, exponent, fraction;


    if (!x) return 0;       /* Zero is special */
    msb = getmsb (x);       /* Get location of the most significant bit */
    fraction = rotr (x, (msb - I2F_FRAC_BITS) & 0x1f) & I2F_MASK;
    exponent = (127 + msb) << I2F_FRAC_BITS;

    return fraction + exponent;
}

/* signed int to IEEE 754 */
uint32_t i2ieee (int32_t x)
{
        if (x < 0)
            return u2ieee (-x) | 0x80000000;
        return u2ieee (x);
}

At that point you can convert it to a hex or binary string, put it an a packet and reverse the process on the other end. (Note, this is just for the 32bit case, similar functions are needed for 64 bit numbers.) Why do it this way? Why not just put the float or double in a union which automatically stores in IEEE 754 representation, and then simply use int or long representation? It seems all cases could be handled by the following which seem much less error prone:

union uif { int i; float f; };
union uid { long int i; double d; };

int
f2ieee (float f) {
    union uif cvt;
    cvt.f = f;
    return cvt.i;
}

float
ieee32f (int i) {
    union uif cvt;
    cvt.i = i;
    return cvt.f;
}

long
d2ieee64 (double d) {
    union uid cvt;
    cvt.d = d;
    return cvt.i;
}

double
ieee64d (long int i) {
    union uid cvt;
    cvt.i = i;
    return cvt.d;
}

All of this has been good learning, but I'm missing the most important piece of all. Why do it one way instead of the other? What benefit is provided by manual conversion when simply reading from a union is much less error prone and on its face seems like it would be more efficient? What say the experts?

like image 853
David C. Rankin Avatar asked May 28 '14 17:05

David C. Rankin


Video Answer


1 Answers

Your suggested "simpler" code does not do the same thing as the code you propose to replace. Your code is the correct way to convert a machine floating-point quantity (which conceivably might not be in IEEE format) to the same-size unsigned integer with the same representation. The "bit-twiddling" code you don't like is (if I understand it correctly) manually computing the IEEE-format floating point quantity with the same numeric value as a given integer. Both of these operations are useful, but in different contexts. For instance, I'd expect to see your suggested code in the implementation of fpclassify on a CPU that has hardware IEEE floating point but no special instruction to classify values, and the "bit-twiddling" code in the implementation of a software floating-point library for a machine that doesn't have hardware floating point at all.

It is unsafe to use bit-fields to extract fields of a floating-point value, because the C standard says that the order in which bit-fields are packed into a struct is implementation-defined (N1570: 6.7.2.1p11), meaning that compilers can choose any ordering they like. They are supposed to document what they do, but they don't have to pick an ordering that "makes sense", and in particular, if you write a struct with bit-fields corresponding to the sign, exponent, and mantissa fields of an IEEE floating-point value, you can not rely cross-platform on those bit-fields lining up with the fields of an actual IEEE floating-point value. There really have been compilers that, for instance, packed bit-fields in the opposite direction from that expected by the target CPU's floating-point unit.

Now, in terms of the letter of the standard, this problem bites you worse if you use bit-shifts and masks to extract fields, because the value you get out of the conversion from a floating-point value to the same-size unsigned integer that you hope has the same representation is unspecified (N1570: 6.2.6.1p7), which is less nailed down than implementation-defined (but more nailed down than undefined). However, in practice, doing it this way is much more likely to work. (I can think of only one, thoroughly obsolete, context where it wouldn't work: some ARM-based systems in the early 1990s had third-party floating-point coprocessors that were big-endian, opposite to the main CPU's choice for integer values. In contrast, there have been dozens of compilers that used the "wrong" ordering for bit-fields; it has even been known to change upon minor upgrades.)

(Have a look at Ada's "representation clauses" sometime, to see what it really takes to give the programmer the ability to align a record type with an external specification of the arrangement of bits in memory. C doesn't even come close.)

(If all you want is to convert from an integer to a float with the same value, and you're not tasked with implementing the compiler back end, you do it by simple assignment: double x = 1123581321; Going the other way you're probably looking for lrint and its friends.)

like image 175
zwol Avatar answered Oct 21 '22 05:10

zwol