Properly (independent of 32bit/64bit) saving a float to binary ofstream

Question

So apparently on my machine, float, double and long double each have different sizes each. There also doesn't seem to be a strict standard enforcing how many bytes each of those types would have to be.

How would one, then, save a floating point value into a binary file, and then have it read properly on a different system if the sizes differ? e.g my machine has 8 byte doubles, whereas joe's have 12 byte doubles.

Without having to export it in text form (e.g "0.3232"), that is. Seems a lot less compact than the binary representation.

James Kanze · Accepted Answer

You have to define a format, and implement that. Typically, most of the network protocols I know use IEEE float and double, output big-endian (but other formats are possible). The advantage of using IEEE formats is that it is what most of the current everyday machines use internally; if you're on one of these machines (and portability of your code to other machines, like mainframes, isn't an issue), you can "convert" to the format simply by type-punning to an unsigned int of the same size, and outputting that. So, for example, you might have:

obstream&
operator<<( obstream& dest, uint64_t value )
{
    dest.put((value >> 56) & 0xFF);
    dest.put((value >> 48) & 0xFF);
    dest.put((value >> 40) & 0xFF);
    dest.put((value >> 32) & 0xFF);
    dest.put((value >> 24) & 0xFF);
    dest.put((value >> 16) & 0xFF);
    dest.put((value >>  8) & 0xFF);
    dest.put((value      ) & 0xFF);
    return dest;
}

obstream&
operator<<( obstream& dest, double value )
{
    return dest << reinterpret_cast<uint64_t const&>( value );
}

If you have to be portable to a machine not supporting IEEE (e.g. any of the modern mainframes), you'll need something a bit more complicated:

obstream&
obstream::operator<<( obstream& dest, double value )
{
    bool                isNeg = value < 0;
    if ( isNeg ) {
        value = - value;
    }
    int                 exp;
    if ( value == 0.0 ) {
        exp = 0;
    } else {
        value = ldexp( frexp( value, &exp ), 53 );
        exp += 1022;
    }
    uint64_t mant = static_cast< uint64_t >( value );
    dest.put( (isNeg ? 0x80 : 0x00) | exp >> 4 );
    dest.put( ((exp << 4) & 0xF0) | ((mant >> 48) & 0x0F) );
    dest.put( mant >> 40 );
    dest.put( mant >> 32 );
    dest.put( mant >> 24 );
    dest.put( mant >> 16 );
    dest.put( mant >>  8 );
    dest.put( mant       );
    return dest;
}

(Note that this doesn't handle NaN's and infinities correctly. Personally, I would ban them from the format, since not all floating point representations support them. But then, there's no floating point format on an IBM mainframe which will support 1E306, either, although you can encode it in the IEEE double format above.)

Reading is, of course, the opposite. Either:

ibstream&
operator>>( ibstream& source, uint64_t& results )
{
    uint64_t value = (source.get() & 0xFF) << 56;
    value |= (source.get() & 0xFF) << 48;
    value |= (source.get() & 0xFF) << 40;
    value |= (source.get() & 0xFF) << 32;
    value |= (source.get() & 0xFF) << 24;
    value |= (source.get() & 0xFF) << 16;
    value |= (source.get() & 0xFF) <<  8;
    value |= (source.get() & 0xFF)      ;
    if ( source )
        results = value;
    return source;
}

ibstream&
operator>>( ibstream& source, double& results)
{
    uint64_t tmp;
    source >> tmp;
    if ( source )
        results = reinterpret_cast<double const&>( tmp );
    return source;
}

or if you can't count on IEEE:

ibstream&
ibstream::operator>>( ibstream& source, double& results )
{
    uint64_t tmp;
    source >> tmp;
    if ( source ) {
        double f = 0.0;
        if ( (tmp & 0x7FFFFFFFFFFFFFFF) != 0 ) {
            f = ldexp( ((tmp & 0x000FFFFFFFFFFFFF) | 0x0010000000000000),
                       static_cast<int>( (tmp & 0x7FF0000000000000) >> 52 )
                                - 1022 - 53 );
        }
        if ( (tmp & 0x8000000000000000) != 0 ) {
            f = -f;
        }
        dest = f;
    }
    return source;
}

(This assumes that the input is not an NaN or an infinity.)

Properly (independent of 32bit/64bit) saving a float to binary ofstream

Tags:

c++

floating-point

64-bit

32-bit

kamziro

1 Answers

James Kanze

Recent Activity

Donate For Us

Properly (independent of 32bit/64bit) saving a float to binary ofstream

Tags:

c++

floating-point

64-bit

32-bit

kamziro

1 Answers

James Kanze

Related questions

Recent Activity

Donate For Us