Is it possible to compute the ranges of float, double and long double data types in a portable way without reading float.h and using ANSI C? By portable, I mean including those cases when the target machine does not adhere to IEEE 754 standard. I'm reading the K&R book and exercise 2-1 asks me to "compute" them so I suppose that means avoiding float.h completely which includes FLT_MIN, FLT_MAX, DBL_MIN and DBL_MAX (reading these values directly would certainly not classify as "computing").

It's possible (at least for IEEE 754 <code>float</code> and <code>double</code> values) to compute the greatest floating-point value via (pseudo-code): <pre class="prettyprint"><code>~(-1.0) | 0.5 </code></pre> Before we can do our bit-twiddling, we'll have to convert the floating-point values to integers and then back again. This can be done in the following way: <pre class="prettyprint"><code>uint64_t m_one, half; double max; *(double *)(void *)&m_one = -1.0; *(double *)(void *)&half = 0.5; *(uint64_t *)(void *)&max = ~m_one | half; </code></pre> So how does it work? For that, we have to know how the floating-point values will be encoded. The highest bit encodes the sign, the next <code>k</code> bits encode the exponent and the lowest bits will hold the fractional part. For powers of <code>2</code>, the fractional part is <code>0</code>. The exponent will be stored with a bias (offset) of <code>2**(k-1) - 1</code>, which means an exponent of <code>0</code> corresponds to a pattern with all but the highest bit set. There are two exponent bit patterns with special meaning: <ul> <li>if no bit is set, the value will be <code>0</code> if the fractional part is zero; otherwise, the value is a subnormal</li> <li>if all bits are set, the value is either <code>infinity</code> or <code>NaN</code> </li> </ul> This means the greatest regular exponent will be encoded via setting all bits except the lowest one, which corresponds to a value of <code>2**k - 2</code> or <code>2**(k-1) - 1</code> if you substract the bias. For <code>double</code> values, <code>k = 11</code>, ie the highest exponent will be <code>1023</code>, so the greatest floating point value is of order <code>2**1023</code> which is about <code>1E+308</code>. The greatest value will have <ul> <li>the sign bit set to <code>0</code> </li> <li>all but the lowest exponent bit set to <code>1</code> </li> <li>all fractional bits set to <code>1</code> </li> </ul> Now, it's possible to understand how our magic numbers work: <ul> <li> <code>-1.0</code> has its sign bit set, the exponent is the bias - ie all bits but the highest one are present - and the fractional part is <code>0</code> </li> <li> <code>~(-1.0)</code> has only the highest exponent bit and all fractional bits set</li> <li> <code>0.5</code> has a sign bit and fractional part of <code>0</code>; the exponent will be the bias minus <code>1</code>, ie all but the highest and lowest exponent bits will be present</li> </ul> When we combine these two values via logical or, we'll get the bit pattern we wanted. <hr> The computation works for x86 80-bit extended precision values (aka <code>long double</code>) as well, but the bit-twiddling must be done byte-wise as there's no integer type large enough to hold the values on 32-bit hardware. The bias isn't actually required to be <code>2**(k-1) - 1</code> - it'll work for an arbitrary bias as long as it is odd. The bias must be odd because otherwise the bit-patterns for the exponent of <code>1.0</code> and <code>0.5</code> will differ in other places than the lowest bit. If the base <code>b</code> (aka radix) of the floating-point type is not <code>2</code>, you have to use <code>b**(-1)</code> instead of <code>0.5 = 2**(-1)</code>. If the greatest exponent value is not reserverd, use <code>1.0</code> instead of <code>0.5</code>. This will work regardless of base or bias (meaning it's no longer restricted to odd values). The difference in using <code>1.0</code> is that the lowest exponent bit won't be cleared. <hr> To summarize: <pre class="prettyprint"><code>~(-1.0) | 0.5 </code></pre> works as long as the radix is <code>2</code>, the bias is odd and the highest exponent is reserved. <pre class="prettyprint"><code>~(-1.0) | 1.0 </code></pre> works for any radix or bias as long as the highest exponent is not reserved.

For 99.99% of all applications, you should assume IEEE 754 and use the constants defined in <code><float.h></code>. In the other 0.01%, you'll be working with very specialized hardware, and in that case, you should know what to use based on the hardware.

Computing the ranges of floating point data types

Tags:

c

Is it possible to compute the ranges of float, double and long double data types in a portable way without reading float.h and using ANSI C? By portable, I mean including those cases when the target machine does not adhere to IEEE 754 standard.

I'm reading the K&R book and exercise 2-1 asks me to "compute" them so I suppose that means avoiding float.h completely which includes FLT_MIN, FLT_MAX, DBL_MIN and DBL_MAX (reading these values directly would certainly not classify as "computing").

929

asked Feb 08 '09 19:02

Ree

2 Answers

It's possible (at least for IEEE 754 float and double values) to compute the greatest floating-point value via (pseudo-code):

~(-1.0) | 0.5

Before we can do our bit-twiddling, we'll have to convert the floating-point values to integers and then back again. This can be done in the following way:

uint64_t m_one, half;
double max;

*(double *)(void *)&m_one = -1.0;
*(double *)(void *)&half = 0.5;
*(uint64_t *)(void *)&max = ~m_one | half;

So how does it work? For that, we have to know how the floating-point values will be encoded.

The highest bit encodes the sign, the next k bits encode the exponent and the lowest bits will hold the fractional part. For powers of 2, the fractional part is 0.

The exponent will be stored with a bias (offset) of 2**(k-1) - 1, which means an exponent of 0 corresponds to a pattern with all but the highest bit set.

There are two exponent bit patterns with special meaning:

if no bit is set, the value will be 0 if the fractional part is zero; otherwise, the value is a subnormal
if all bits are set, the value is either infinity or NaN

This means the greatest regular exponent will be encoded via setting all bits except the lowest one, which corresponds to a value of 2**k - 2 or 2**(k-1) - 1 if you substract the bias.

For double values, k = 11, ie the highest exponent will be 1023, so the greatest floating point value is of order 2**1023 which is about 1E+308.

The greatest value will have

the sign bit set to 0
all but the lowest exponent bit set to 1
all fractional bits set to 1

Now, it's possible to understand how our magic numbers work:

-1.0 has its sign bit set, the exponent is the bias - ie all bits but the highest one are present - and the fractional part is 0
~(-1.0) has only the highest exponent bit and all fractional bits set
0.5 has a sign bit and fractional part of 0; the exponent will be the bias minus 1, ie all but the highest and lowest exponent bits will be present

When we combine these two values via logical or, we'll get the bit pattern we wanted.

The computation works for x86 80-bit extended precision values (aka long double) as well, but the bit-twiddling must be done byte-wise as there's no integer type large enough to hold the values on 32-bit hardware.

The bias isn't actually required to be 2**(k-1) - 1 - it'll work for an arbitrary bias as long as it is odd. The bias must be odd because otherwise the bit-patterns for the exponent of 1.0 and 0.5 will differ in other places than the lowest bit.

If the base b (aka radix) of the floating-point type is not 2, you have to use b**(-1) instead of 0.5 = 2**(-1).

If the greatest exponent value is not reserverd, use 1.0 instead of 0.5. This will work regardless of base or bias (meaning it's no longer restricted to odd values). The difference in using 1.0 is that the lowest exponent bit won't be cleared.

To summarize:

~(-1.0) | 0.5

works as long as the radix is 2, the bias is odd and the highest exponent is reserved.

~(-1.0) | 1.0

works for any radix or bias as long as the highest exponent is not reserved.

131

answered Nov 15 '22 04:11

Christoph

For 99.99% of all applications, you should assume IEEE 754 and use the constants defined in <float.h>. In the other 0.01%, you'll be working with very specialized hardware, and in that case, you should know what to use based on the hardware.

answered Nov 15 '22 05:11

Adam Rosenfield

Related questions
                            
                                Understanding Buffering in C
                            
                                New to programming, issue with structures and functions
                            
                                Why conversion from unsigned long long to double can lead to data loss?
                            
                                Why is the format of %p and %x different in a format string?
                            
                                How does MAKE remember the file timestamps
                            
                                What will be the value of strlen(str)- 1 in a 'for' loop condition when str is empty?
                            
                                Calculate the size to a Base 64 decoded message
                            
                                Is it required to free (or delete) the arguments passed into main in C?
                            
                                Can you know whether free() was successful in C without crashing?
                            
                                Why i cant check if a variable have a value using == true?
                            
                                Segmentation fault before entering main [closed]
                            
                                Working inline assembly in C for bit parity?
                            
                                What is the C++ equivalent for the CRT?
                            
                                Post increment behavior
                            
                                What is the secret behind Python's len() builtin time complexity of O(1) [closed]
                            
                                why result of (double + int) is 0 (C language)
                            
                                Which way is faster in initializing registers in a microcontroller?
                            
                                Incrementing the array pointer beyond the last item
                            
                                Understanding array manipulation pointers syntax in C
                            
                                Is malloc dynamic memory allocation?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With