I learned from the book Computer Systems: A Programmer's Perspective that the IEEE standard requires the double precision floating number to be represented using the following 64-bit binary format: <ul> <li>s: 1 bit for sign</li> <li>exp: 11 bits for exponent</li> <li>frac: 52 bits for fraction</li> </ul> The +infinity is represented as a special value with the following pattern: <ul> <li>s = 0</li> <li>all exp bits are 1</li> <li>all fraction bits are 0</li> </ul> And I think the full 64-bit for double should be in the following order: (s)(exp)(frac) So I write the following C code to verify it: <pre class="prettyprint"><code>//Check the infinity double x1 = (double)0x7ff0000000000000; // This should be the +infinity double x2 = (double)0x7ff0000000000001; // Note the extra ending 1, x2 should be NaN printf("\nx1 = %f, x2 = %f sizeof(double) = %d", x1,x2, sizeof(x2)); if (x1 == x2) printf("\nx1 == x2"); else printf("\nx1 != x2"); </code></pre> But the result is: <pre class="prettyprint"><code>x1 = 9218868437227405300.000000, x2 = 9218868437227405300.000000 sizeof(double) = 8 x1 == x2 </code></pre> Why is the number a valid number rather than some infinity error? Why x1==x2? (I am using the MinGW GCC compiler.) <h3>ADD 1</h3> I modified the code as below and the validated the Infinity and NaN successfully. <pre class="prettyprint"><code>//Check the infinity and NaN unsigned long long x1 = 0x7ff0000000000000ULL; // +infinity as double unsigned long long x2 = 0xfff0000000000000ULL; // -infinity as double unsigned long long x3 = 0x7ff0000000000001ULL; // NaN as double double y1 =* ((double *)(&x1)); double y2 =* ((double *)(&x2)); double y3 =* ((double *)(&x3)); printf("\nsizeof(long long) = %d", sizeof(x1)); printf("\nx1 = %f, x2 = %f, x3 = %f", x1, x2, x3); // %f is good enough for output printf("\ny1 = %f, y2 = %f, y3 = %f", y1, y2, y3); </code></pre> The result is: <pre class="prettyprint"><code>sizeof(long long) = 8 x1 = 1.#INF00, x2 = -1.#INF00, x3 = 1.#SNAN0 y1 = 1.#INF00, y2 = -1.#INF00, y3 = 1.#QNAN0 </code></pre> The detailed output looks a bit strange, but I think the point is clear. PS.: It seems the pointer conversion is not necessary. Just use <code>%f</code> to tell the <code>printf</code> function to interpret the <code>unsigned long long</code> variable in <code>double</code> format. <h3>ADD 2</h3> Out of curiosity, I checked the bit represetation of the variables with the following code. <pre class="prettyprint"><code>typedef unsigned char *byte_pointer; void show_bytes(byte_pointer start, int len) { int i; for (i = len-1; i>=0; i--) { printf("%.2x", start[i]); } printf("\n"); } </code></pre> And I tried the code below: <pre class="prettyprint"><code>//check the infinity and NaN unsigned long long x1 = 0x7ff0000000000000ULL; // +infinity as double unsigned long long x2 = 0xfff0000000000000ULL; // -infinity as double unsigned long long x3 = 0x7ff0000000000001ULL; // NaN as double double y1 =* ((double *)(&x1)); double y2 =* ((double *)(&x2)); double y3 = *((double *)(&x3)); unsigned long long x4 = x1 + x2; // I want to check (+infinity)+(-infinity) double y4 = y1 + y2; // I want to check (+infinity)+(-infinity) printf("\nx1: "); show_bytes((byte_pointer)&x1, sizeof(x1)); printf("\nx2: "); show_bytes((byte_pointer)&x2, sizeof(x2)); printf("\nx3: "); show_bytes((byte_pointer)&x3, sizeof(x3)); printf("\nx4: "); show_bytes((byte_pointer)&x4, sizeof(x4)); printf("\ny1: "); show_bytes((byte_pointer)&y1, sizeof(y1)); printf("\ny2: "); show_bytes((byte_pointer)&y2, sizeof(y2)); printf("\ny3: "); show_bytes((byte_pointer)&y3, sizeof(y3)); printf("\ny4: "); show_bytes((byte_pointer)&y4, sizeof(y4)); </code></pre> The output is: <pre class="prettyprint"><code>x1: 7ff0000000000000 x2: fff0000000000000 x3: 7ff0000000000001 x4: 7fe0000000000000 y1: 7ff0000000000000 y2: fff0000000000000 y3: 7ff8000000000001 y4: fff8000000000000 // <== Different with x4 </code></pre> The strange part is, though x1 and x2 have the identical bit pattern as y1 and y2, the sum x4 is different from y4. And <pre class="prettyprint"><code>printf("\ny4=%f", y4); </code></pre> gives this: <pre class="prettyprint"><code>y4=-1.#IND00 // What does it mean??? </code></pre> Why are they different? And how is y4 obtained?

The initialization <pre class="prettyprint"><code>double x1=(double)0x7ff0000000000000; </code></pre> is converting an integer number literal to a <code>double</code>. You probably want to share the bitwise representation. This is implementation specific (perhaps unspecified bahavior), but you might use an union: <pre class="prettyprint"><code>union { double x; long long n; } u; u.n = 0x7ff0000000000000LL; </code></pre> then use <code>u.x</code> ; I am assuming that <code>long long</code> and <code>double</code> are both 64 bits on your machine. And endianess and floating point representation also matters. See also http://floating-point-gui.de/ Notice that not all processors are x86, and not all floating point implementations are IEEE754 (even if in 2014 most of them are). Your code probably won't work the same on an ARM processor, e.g. in your tablet.

How is infinity represented in a C double?

Tags:

c

floating-point

I learned from the book Computer Systems: A Programmer's Perspective that the IEEE standard requires the double precision floating number to be represented using the following 64-bit binary format:

s: 1 bit for sign
exp: 11 bits for exponent
frac: 52 bits for fraction

The +infinity is represented as a special value with the following pattern:

s = 0
all exp bits are 1
all fraction bits are 0

And I think the full 64-bit for double should be in the following order:

(s)(exp)(frac)

So I write the following C code to verify it:

//Check the infinity
double x1 = (double)0x7ff0000000000000;  // This should be the +infinity
double x2 = (double)0x7ff0000000000001; //  Note the extra ending 1, x2 should be NaN
printf("\nx1 = %f, x2 = %f sizeof(double) = %d", x1,x2, sizeof(x2));
if (x1 == x2)
    printf("\nx1 == x2");
else
    printf("\nx1 != x2");

But the result is:

x1 = 9218868437227405300.000000, x2 = 9218868437227405300.000000 sizeof(double) = 8
x1 == x2

Why is the number a valid number rather than some infinity error?

Why x1==x2?

(I am using the MinGW GCC compiler.)

ADD 1

I modified the code as below and the validated the Infinity and NaN successfully.

//Check the infinity and NaN
unsigned long long x1 = 0x7ff0000000000000ULL; // +infinity as double
unsigned long long x2 = 0xfff0000000000000ULL; // -infinity as double
unsigned long long x3 = 0x7ff0000000000001ULL; // NaN as double
double y1 =* ((double *)(&x1));
double y2 =* ((double *)(&x2));
double y3 =* ((double *)(&x3));

printf("\nsizeof(long long) = %d", sizeof(x1));
printf("\nx1 = %f, x2 = %f, x3 = %f", x1, x2, x3); // %f is good enough for output
printf("\ny1 = %f, y2 = %f, y3 = %f", y1, y2, y3);

The result is:

sizeof(long long) = 8
x1 = 1.#INF00, x2 = -1.#INF00, x3 = 1.#SNAN0
y1 = 1.#INF00, y2 = -1.#INF00, y3 = 1.#QNAN0

The detailed output looks a bit strange, but I think the point is clear.

PS.: It seems the pointer conversion is not necessary. Just use %f to tell the printf function to interpret the unsigned long long variable in double format.

ADD 2

Out of curiosity, I checked the bit represetation of the variables with the following code.

typedef unsigned char *byte_pointer;

void show_bytes(byte_pointer start, int len)
{
    int i;
    for (i = len-1; i>=0; i--)
    {
        printf("%.2x", start[i]);
    }
    printf("\n");
}

And I tried the code below:

//check the infinity and NaN
unsigned long long x1 = 0x7ff0000000000000ULL; // +infinity as double
unsigned long long x2 = 0xfff0000000000000ULL; // -infinity as double
unsigned long long x3 = 0x7ff0000000000001ULL; // NaN as double
double y1 =* ((double *)(&x1));
double y2 =* ((double *)(&x2));
double y3 = *((double *)(&x3));

unsigned long long x4 = x1 + x2;  // I want to check (+infinity)+(-infinity)
double y4 = y1 + y2; // I want to check (+infinity)+(-infinity)

printf("\nx1: ");
show_bytes((byte_pointer)&x1, sizeof(x1));
printf("\nx2: ");
show_bytes((byte_pointer)&x2, sizeof(x2));
printf("\nx3: ");
show_bytes((byte_pointer)&x3, sizeof(x3));
printf("\nx4: ");
show_bytes((byte_pointer)&x4, sizeof(x4));

printf("\ny1: ");
show_bytes((byte_pointer)&y1, sizeof(y1));
printf("\ny2: ");
show_bytes((byte_pointer)&y2, sizeof(y2));
printf("\ny3: ");
show_bytes((byte_pointer)&y3, sizeof(y3));
printf("\ny4: ");
show_bytes((byte_pointer)&y4, sizeof(y4));

The output is:

x1: 7ff0000000000000

x2: fff0000000000000

x3: 7ff0000000000001

x4: 7fe0000000000000

y1: 7ff0000000000000

y2: fff0000000000000

y3: 7ff8000000000001

y4: fff8000000000000  // <== Different with x4

The strange part is, though x1 and x2 have the identical bit pattern as y1 and y2, the sum x4 is different from y4.

And

printf("\ny4=%f", y4);

gives this:

y4=-1.#IND00  // What does it mean???

Why are they different? And how is y4 obtained?

915

asked Nov 01 '14 11:11

smwikipedia

2 Answers

First, 0x7ff0000000000000 is indeed the bit representation of a double infinity. But the cast does not set the bit representation, it converts the logical value of 0x7ff0000000000000 interpreted as a 64 bit integer. So, you need to use some other way to set the bit pattern.

The straightforward way to set the bit pattern would be

uint64_t bits = 0x7ff0000000000000;
double infinity = *(double*)&bits;

However, this is undefined behavior. The C standard forbids reading a value that has been stored as one fundamental type (uint64_t) as another fundamental type (double). This is known as strict aliasing rules, and allows the compiler to generate better code because it can assume that the order of the read of one type and a write of another type is irrelevant.

The only exception to this rule is the char types: You are explicitly allowed to cast any pointer to a char* and back. So you could try to use this code:

char bits[] = {0x7f, 0xf0, 0, 0, 0, 0, 0, 0};
double infinity = *(double*)bits;

Even though this is not undefined behavior anymore, it is still implementation defined behavior: The order of the bytes in a double depends on your machine. The given code works on a big endian machine like ARM and the Power family, but not on X86. For the X86 you need this version:

char bits[] = {0, 0, 0, 0, 0, 0, 0xf0, 0x7f};
double infinity = *(double*)bits;

There is really no way around this implementation defined behavior since there is no guarantee that a machine will store floating point values in the same order as integer values. There are even machines that use byte orders like this: <1, 0, 3, 2> I don't even want to know who came up with this brilliant idea, but it exists and we have to live with it.

To your last question: floating point arithmetic is inherently different from integer arithmetic. The bits have special meanings, and the floating point unit takes that into account. Especially the special values like infinities, NANs, and denormalized numbers are treated in a special way. And since +inf + -inf is defined to yield a NAN, your floating point unit emits the bit pattern of a NAN. The integer unit does not know about infinities or NAN, so it just interpretes the bit pattern as a huge integer and happily performs an integer addition (which happens to overflow in this case). The resulting bit pattern is not that of a NAN. It happens to be the bit pattern of a really huge, positive floating point number (2^1023, to be precise), but that bears no meaning.

Actually, there is a way to set the bit patterns of all values except NANs in a portable way: Given three variables containing the bits of the sign, exponent, and mantissa, you can do this:

uint64_t sign = ..., exponent = ..., mantissa = ...;
double result;
assert(!(exponent == 0x7ff && mantissa));    //Can't set the bits of a NAN in this way.
if(exponent) {
    //This code does not work for denormalized numbers. And it won't honor the value of mantissa when the exponent signals NAN or infinity.
    result = mantissa + (1ull << 52);    //Add the implicit bit.
    result /= (1ull << 52);    //This makes sure that the exponent is logically zero (equals the bias), so that the next operation will work as expected.
    result *= pow(2, (double)((signed)exponent - 0x3ff));    //This sets the exponent.
} else {
    //This code works for denormalized numbers.
    result = mantissa;    //No implicit bit.
    result /= (1ull << 51);    //This ensures that the next operation works as expected.
    result *= pow(2, -0x3ff);    //Scale down to the denormalized range.
}
result *= (sign ? -1.0 : 1.0);    //This sets the sign.

This uses the floating point unit itself to move the bits into the right place. Since there is no way to interact with the mantissa bits of a NAN using floating point arithmetic, it is not possible to include the generation of NANs in this code. Well, you could generate a NAN, but you'd have no control on its mantissa bit pattern.

150

answered Sep 16 '22 16:09

cmaster - reinstate monica

The initialization

double x1=(double)0x7ff0000000000000;

is converting an integer number literal to a double. You probably want to share the bitwise representation. This is implementation specific (perhaps unspecified bahavior), but you might use an union:

union { double x; long long n; } u;
u.n = 0x7ff0000000000000LL;

then use u.x ; I am assuming that long long and double are both 64 bits on your machine. And endianess and floating point representation also matters.

Basile Starynkevitch

Related questions
                            
                                align macro kernel
                            
                                How to change optimization level of one function?
                            
                                Can the following code be true for pointers to different things
                            
                                Copy one pointer content to another
                            
                                C++ array[index] vs index[array] [duplicate]
                            
                                List environment variables with C in UNIX
                            
                                Comparing user-inputted characters in C
                            
                                What does ^= mean in C/C++?
                            
                                Maximum values for time_t (struct timespec)
                            
                                What does an expression like arr[''hi there"] imply?
                            
                                Using a struct in a header file "unknown type" error
                            
                                How do I create a global variable that is thread-specific in C using POSIX threads?
                            
                                Why am I getting undefined reference to pthread_mutexattr_settype?
                            
                                C - Undefined Reference to WSAStartup@8'
                            
                                Multicore programming: what's necessary to do it?
                            
                                What is the pointer-to-pointer technique for the simpler traversal of linked lists? [duplicate]
                            
                                Is there a portable C compiler for windows?
                            
                                Is typedef'ing a pointer type considered bad practice? [duplicate]
                            
                                sizeof a struct member [duplicate]
                            
                                Are bit flags using ints in C/C++ actually safe?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With