Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How is infinity represented in a C double?

I learned from the book Computer Systems: A Programmer's Perspective that the IEEE standard requires the double precision floating number to be represented using the following 64-bit binary format:

  • s: 1 bit for sign
  • exp: 11 bits for exponent
  • frac: 52 bits for fraction

The +infinity is represented as a special value with the following pattern:

  • s = 0
  • all exp bits are 1
  • all fraction bits are 0

And I think the full 64-bit for double should be in the following order:

(s)(exp)(frac)

So I write the following C code to verify it:

//Check the infinity
double x1 = (double)0x7ff0000000000000;  // This should be the +infinity
double x2 = (double)0x7ff0000000000001; //  Note the extra ending 1, x2 should be NaN
printf("\nx1 = %f, x2 = %f sizeof(double) = %d", x1,x2, sizeof(x2));
if (x1 == x2)
    printf("\nx1 == x2");
else
    printf("\nx1 != x2");

But the result is:

x1 = 9218868437227405300.000000, x2 = 9218868437227405300.000000 sizeof(double) = 8
x1 == x2

Why is the number a valid number rather than some infinity error?

Why x1==x2?

(I am using the MinGW GCC compiler.)

ADD 1

I modified the code as below and the validated the Infinity and NaN successfully.

//Check the infinity and NaN
unsigned long long x1 = 0x7ff0000000000000ULL; // +infinity as double
unsigned long long x2 = 0xfff0000000000000ULL; // -infinity as double
unsigned long long x3 = 0x7ff0000000000001ULL; // NaN as double
double y1 =* ((double *)(&x1));
double y2 =* ((double *)(&x2));
double y3 =* ((double *)(&x3));

printf("\nsizeof(long long) = %d", sizeof(x1));
printf("\nx1 = %f, x2 = %f, x3 = %f", x1, x2, x3); // %f is good enough for output
printf("\ny1 = %f, y2 = %f, y3 = %f", y1, y2, y3);

The result is:

sizeof(long long) = 8
x1 = 1.#INF00, x2 = -1.#INF00, x3 = 1.#SNAN0
y1 = 1.#INF00, y2 = -1.#INF00, y3 = 1.#QNAN0

The detailed output looks a bit strange, but I think the point is clear.

PS.: It seems the pointer conversion is not necessary. Just use %f to tell the printf function to interpret the unsigned long long variable in double format.

ADD 2

Out of curiosity, I checked the bit represetation of the variables with the following code.

typedef unsigned char *byte_pointer;

void show_bytes(byte_pointer start, int len)
{
    int i;
    for (i = len-1; i>=0; i--)
    {
        printf("%.2x", start[i]);
    }
    printf("\n");
}

And I tried the code below:

//check the infinity and NaN
unsigned long long x1 = 0x7ff0000000000000ULL; // +infinity as double
unsigned long long x2 = 0xfff0000000000000ULL; // -infinity as double
unsigned long long x3 = 0x7ff0000000000001ULL; // NaN as double
double y1 =* ((double *)(&x1));
double y2 =* ((double *)(&x2));
double y3 = *((double *)(&x3));

unsigned long long x4 = x1 + x2;  // I want to check (+infinity)+(-infinity)
double y4 = y1 + y2; // I want to check (+infinity)+(-infinity)

printf("\nx1: ");
show_bytes((byte_pointer)&x1, sizeof(x1));
printf("\nx2: ");
show_bytes((byte_pointer)&x2, sizeof(x2));
printf("\nx3: ");
show_bytes((byte_pointer)&x3, sizeof(x3));
printf("\nx4: ");
show_bytes((byte_pointer)&x4, sizeof(x4));

printf("\ny1: ");
show_bytes((byte_pointer)&y1, sizeof(y1));
printf("\ny2: ");
show_bytes((byte_pointer)&y2, sizeof(y2));
printf("\ny3: ");
show_bytes((byte_pointer)&y3, sizeof(y3));
printf("\ny4: ");
show_bytes((byte_pointer)&y4, sizeof(y4));

The output is:

x1: 7ff0000000000000

x2: fff0000000000000

x3: 7ff0000000000001

x4: 7fe0000000000000

y1: 7ff0000000000000

y2: fff0000000000000

y3: 7ff8000000000001

y4: fff8000000000000  // <== Different with x4

The strange part is, though x1 and x2 have the identical bit pattern as y1 and y2, the sum x4 is different from y4.

And

printf("\ny4=%f", y4);

gives this:

y4=-1.#IND00  // What does it mean???

Why are they different? And how is y4 obtained?

like image 915
smwikipedia Avatar asked Nov 01 '14 11:11

smwikipedia


People also ask

What is infinity in C?

INFINITY is a macro constant defined in the <math. h> library, and programmers use it to perform mathematical comparisons in C. #include <math.h>

What is infinity in floating-point?

Positive Infinity or Negative Infinity: Floating point value, sometimes represented as Infinity, –Infinity, INF, or –INF. Support for positive and negative infinity is provided for interoperability with other devices or systems that produce these values.

How do you represent infinity in IEEE 754?

Positive and negative infinity are represented thus: sign = 0 for positive infinity, 1 for negative infinity. biased exponent = all 1 bits. fraction = all 0 bits.

Is infinity a NaN?

In floating-point calculations, NaN is not the same as infinity, although both are typically handled as special cases in floating-point representations of real numbers as well as in floating-point operations.


2 Answers

First, 0x7ff0000000000000 is indeed the bit representation of a double infinity. But the cast does not set the bit representation, it converts the logical value of 0x7ff0000000000000 interpreted as a 64 bit integer. So, you need to use some other way to set the bit pattern.

The straightforward way to set the bit pattern would be

uint64_t bits = 0x7ff0000000000000;
double infinity = *(double*)&bits;

However, this is undefined behavior. The C standard forbids reading a value that has been stored as one fundamental type (uint64_t) as another fundamental type (double). This is known as strict aliasing rules, and allows the compiler to generate better code because it can assume that the order of the read of one type and a write of another type is irrelevant.

The only exception to this rule is the char types: You are explicitly allowed to cast any pointer to a char* and back. So you could try to use this code:

char bits[] = {0x7f, 0xf0, 0, 0, 0, 0, 0, 0};
double infinity = *(double*)bits;

Even though this is not undefined behavior anymore, it is still implementation defined behavior: The order of the bytes in a double depends on your machine. The given code works on a big endian machine like ARM and the Power family, but not on X86. For the X86 you need this version:

char bits[] = {0, 0, 0, 0, 0, 0, 0xf0, 0x7f};
double infinity = *(double*)bits;

There is really no way around this implementation defined behavior since there is no guarantee that a machine will store floating point values in the same order as integer values. There are even machines that use byte orders like this: <1, 0, 3, 2> I don't even want to know who came up with this brilliant idea, but it exists and we have to live with it.


To your last question: floating point arithmetic is inherently different from integer arithmetic. The bits have special meanings, and the floating point unit takes that into account. Especially the special values like infinities, NANs, and denormalized numbers are treated in a special way. And since +inf + -inf is defined to yield a NAN, your floating point unit emits the bit pattern of a NAN. The integer unit does not know about infinities or NAN, so it just interpretes the bit pattern as a huge integer and happily performs an integer addition (which happens to overflow in this case). The resulting bit pattern is not that of a NAN. It happens to be the bit pattern of a really huge, positive floating point number (2^1023, to be precise), but that bears no meaning.


Actually, there is a way to set the bit patterns of all values except NANs in a portable way: Given three variables containing the bits of the sign, exponent, and mantissa, you can do this:

uint64_t sign = ..., exponent = ..., mantissa = ...;
double result;
assert(!(exponent == 0x7ff && mantissa));    //Can't set the bits of a NAN in this way.
if(exponent) {
    //This code does not work for denormalized numbers. And it won't honor the value of mantissa when the exponent signals NAN or infinity.
    result = mantissa + (1ull << 52);    //Add the implicit bit.
    result /= (1ull << 52);    //This makes sure that the exponent is logically zero (equals the bias), so that the next operation will work as expected.
    result *= pow(2, (double)((signed)exponent - 0x3ff));    //This sets the exponent.
} else {
    //This code works for denormalized numbers.
    result = mantissa;    //No implicit bit.
    result /= (1ull << 51);    //This ensures that the next operation works as expected.
    result *= pow(2, -0x3ff);    //Scale down to the denormalized range.
}
result *= (sign ? -1.0 : 1.0);    //This sets the sign.

This uses the floating point unit itself to move the bits into the right place. Since there is no way to interact with the mantissa bits of a NAN using floating point arithmetic, it is not possible to include the generation of NANs in this code. Well, you could generate a NAN, but you'd have no control on its mantissa bit pattern.

like image 150
cmaster - reinstate monica Avatar answered Sep 16 '22 16:09

cmaster - reinstate monica


The initialization

double x1=(double)0x7ff0000000000000;

is converting an integer number literal to a double. You probably want to share the bitwise representation. This is implementation specific (perhaps unspecified bahavior), but you might use an union:

union { double x; long long n; } u;
u.n = 0x7ff0000000000000LL;

then use u.x ; I am assuming that long long and double are both 64 bits on your machine. And endianess and floating point representation also matters.

See also http://floating-point-gui.de/

Notice that not all processors are x86, and not all floating point implementations are IEEE754 (even if in 2014 most of them are). Your code probably won't work the same on an ARM processor, e.g. in your tablet.

like image 45
Basile Starynkevitch Avatar answered Sep 18 '22 16:09

Basile Starynkevitch