Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Computing the ranges of floating point data types

Tags:

c

Is it possible to compute the ranges of float, double and long double data types in a portable way without reading float.h and using ANSI C? By portable, I mean including those cases when the target machine does not adhere to IEEE 754 standard.

I'm reading the K&R book and exercise 2-1 asks me to "compute" them so I suppose that means avoiding float.h completely which includes FLT_MIN, FLT_MAX, DBL_MIN and DBL_MAX (reading these values directly would certainly not classify as "computing").

like image 929
Ree Avatar asked Feb 08 '09 19:02

Ree


People also ask

What is the range of the floating-point numbers?

A single-precision, floating-point number is a 32-bit approximation of a real number. The number can be zero or can range from -3.40282347E+38 to -1.17549435E-38, or from 1.17549435E-38 to 3.40282347E+38.

What are floating-point data types?

Discussion. The floating-point data type is a family of data types that act alike and differ only in the size of their domains (the allowable values). The floating-point family of data types represents number values with fractional parts. They are technically stored as two integer values: a mantissa and an exponent.

What is size and range of float data type?

float. 3.4E-38 to 3.4E+38.


2 Answers

It's possible (at least for IEEE 754 float and double values) to compute the greatest floating-point value via (pseudo-code):

~(-1.0) | 0.5

Before we can do our bit-twiddling, we'll have to convert the floating-point values to integers and then back again. This can be done in the following way:

uint64_t m_one, half;
double max;

*(double *)(void *)&m_one = -1.0;
*(double *)(void *)&half = 0.5;
*(uint64_t *)(void *)&max = ~m_one | half;

So how does it work? For that, we have to know how the floating-point values will be encoded.

The highest bit encodes the sign, the next k bits encode the exponent and the lowest bits will hold the fractional part. For powers of 2, the fractional part is 0.

The exponent will be stored with a bias (offset) of 2**(k-1) - 1, which means an exponent of 0 corresponds to a pattern with all but the highest bit set.

There are two exponent bit patterns with special meaning:

  • if no bit is set, the value will be 0 if the fractional part is zero; otherwise, the value is a subnormal
  • if all bits are set, the value is either infinity or NaN

This means the greatest regular exponent will be encoded via setting all bits except the lowest one, which corresponds to a value of 2**k - 2 or 2**(k-1) - 1 if you substract the bias.

For double values, k = 11, ie the highest exponent will be 1023, so the greatest floating point value is of order 2**1023 which is about 1E+308.

The greatest value will have

  • the sign bit set to 0
  • all but the lowest exponent bit set to 1
  • all fractional bits set to 1

Now, it's possible to understand how our magic numbers work:

  • -1.0 has its sign bit set, the exponent is the bias - ie all bits but the highest one are present - and the fractional part is 0
  • ~(-1.0) has only the highest exponent bit and all fractional bits set
  • 0.5 has a sign bit and fractional part of 0; the exponent will be the bias minus 1, ie all but the highest and lowest exponent bits will be present

When we combine these two values via logical or, we'll get the bit pattern we wanted.


The computation works for x86 80-bit extended precision values (aka long double) as well, but the bit-twiddling must be done byte-wise as there's no integer type large enough to hold the values on 32-bit hardware.

The bias isn't actually required to be 2**(k-1) - 1 - it'll work for an arbitrary bias as long as it is odd. The bias must be odd because otherwise the bit-patterns for the exponent of 1.0 and 0.5 will differ in other places than the lowest bit.

If the base b (aka radix) of the floating-point type is not 2, you have to use b**(-1) instead of 0.5 = 2**(-1).

If the greatest exponent value is not reserverd, use 1.0 instead of 0.5. This will work regardless of base or bias (meaning it's no longer restricted to odd values). The difference in using 1.0 is that the lowest exponent bit won't be cleared.


To summarize:

~(-1.0) | 0.5

works as long as the radix is 2, the bias is odd and the highest exponent is reserved.

~(-1.0) | 1.0

works for any radix or bias as long as the highest exponent is not reserved.

like image 131
Christoph Avatar answered Nov 15 '22 04:11

Christoph


For 99.99% of all applications, you should assume IEEE 754 and use the constants defined in <float.h>. In the other 0.01%, you'll be working with very specialized hardware, and in that case, you should know what to use based on the hardware.

like image 40
Adam Rosenfield Avatar answered Nov 15 '22 05:11

Adam Rosenfield