Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using floats or doubles instead of ints

I know that the default implementation of Lua uses floating point numbers only, thus circumventing the problem of dynamically determining the subtype of a number before choosing which variant of math function to use.

My question is -- if I try to emulate integers as doubles (or floats) in standard C99, is there a reliable (and simple) way to tell what is the maximum value representable precisely?

I mean, if I use 64-bit floats to represent integers, I certainly cannot represent all 64-bit integers (the pigeonhole principle applies here). How can I tell the maximum integer that is representable?

(Trying to list all values is not a solution -- if, for example, I'm using doubles in a 64-bit architecture, as I'd have to list 2^{64} numbers)

Thanks!

like image 872
Jay Avatar asked May 19 '11 14:05

Jay


1 Answers

The maximum ones-representable integer is 253 (9007199254740992) for a 64-bit double and 224 (16777216) for a 32-bit float. See the base digits on the Wikipedia page for IEEE floating point numbers.

Verifying this in Lua is pretty simple:

local maxdouble = 2^53

-- one less than the maximum can be represented precisely
print (string.format("%.0f",maxdouble-1)) --> 9007199254740991
-- the maximum itself can be represented precisely
print (string.format("%.0f",maxdouble))   --> 9007199254740992
-- one more than the maximum gets rounded down
print (string.format("%.0f",maxdouble+1)) --> 9007199254740992 again

If we don't have the IEEE-defined field sizes handy, knowing what we know about the design of floating point numbers, we can determine these values using a simple loop over the possible values:

#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#define min(a, b) (a < b ? a : b)
#define bits(type) (sizeof(type) * 8)
#define testimax(test_t) { \
  uintmax_t in = 1, out = 2; \
  size_t pow = 0, limit = min(bits(test_t), bits(uintmax_t)); \
  while (pow < limit && out == in + 1) { \
    in = in << 1; \
    out = (test_t) in + 1; \
    ++pow; \
  } \
  if (pow == limit) \
    puts(#test_t " is as precise as longest integer type"); \
  else printf(#test_t " conversion imprecise for 2^%d+1:\n" \
    "   in: %llu\n  out: %llu\n\n", pow, in + 1, out); \
}

int main(void)
{
    testimax(float);
    testimax(double);
    return 0;
}

The output of the above code:

float conversion imprecise for 2^24+1:
   in: 16777217
  out: 16777216

double conversion imprecise for 2^53+1:
   in: 9007199254740993
  out: 9007199254740992

Of course, due to the way floating-point precision works, a 64-bit double can represent numbers much larger than 264 as the floating exponent grows positive. The Wikipedia page on double-precision floating-point describes:

Between 252=4,503,599,627,370,496 and 253=9,007,199,254,740,992 the representable numbers are exactly the integers. For the next range, from 253 to 254, everything is multiplied by 2, so the representable numbers are the even ones, etc. Conversely, for the previous range from 251 to 252, the spacing is 0.5, etc.

The absolute largest value a double can hold is listed further down that page: 0x7fefffffffffffff, which computes to (1 + (1 − 2−52)) * 21023, or roughly 1.7976931348623157e308.

like image 145
15 revs Avatar answered Oct 13 '22 01:10

15 revs