Representing a float in a single byte

Question

I have a situation which requires a float to be represented in a single char. The range that this 'minifloat' needs to represent is 0 to 10e-7, so we can always assume that the number is +ve, and the exponent -ve in order to save space.

The representation that I have thought about going with is 3 bits of exponent, and 5 bits mantissa (with 1 implied bit), with the exponent being in base 10, i.e. x = man * 10^exp.

To convert from a float to my minifloat, I plan to use frexp, and use some maths to convert from base 2 to base 10.

Is this a sensible approach? Or are there better ways to achieve this?

Stephen Canon · Accepted Answer

Do you actually need the value to be floating point (i.e. to have roughly constant precision as the value scales)? What are you going to do with these values?

A much simpler (and more efficient) idea would be to interpret 8 bits as an unsigned fixed-point number with an implicit scale of 1e-7. I.e.:

float toFloat(uint8_t x) {
    return x / 255.0e7;
}

uint8_t fromFloat(float x) {
    if (x < 0) return 0;
    if (x > 1e-7) return 255;
    return 255.0e7 * x; // this truncates; add 0.5 to round instead
}

Eric Postpischil · Answer

If it serves your purposes, it is reasonable to use such a format as a storage or transmission format, that is, for recording data in a small space. You should verify that the rounding errors from this format are not too large for your needs, that the range is suitable, et cetera.

This would not be a good format for calculation, because it would be slow on normal hardware.

I do not understand what base conversion you would be doing. If you have an IEEE-754 floating-point number in a float, then the job of converting to or from your 8-bit format is one of rounding the significand (the fraction) when going to the narrower format and of adjusting the exponent bias, plus handling special cases (denormals, overflow, NaNs). This would just involve binary arithmetic, not decimal.

As an aside, note that the proper term for the fraction portion of a floating-point number is “fraction” or “significand” (the term used in the IEEE-754 standard). A “mantissa” is the fractional portion of a logarithm.

aka.nice · Answer

An alternative is to use a static array of 256 float (or double) that you will choose on your own criteria.

Then the conversion unsigned char -> float/double is trivial...

The conversion float/double-> unsigned char is a bit more involved (find nearest float in the static array); it would cost about 8 comparisons with a naive binary search algorithm, but you may find better according to the way you choosed the values in the static array.

Of course, operations would be performed with native float/double.

Representing a float in a single byte

Tags:

c++

floating-point

Matt Dunn

3 Answers

Stephen Canon

Eric Postpischil

aka.nice

Recent Activity

Donate For Us

Representing a float in a single byte

Tags:

c++

floating-point

Matt Dunn

3 Answers

Stephen Canon

Eric Postpischil

aka.nice

Related questions

Recent Activity

Donate For Us