Using floats or doubles instead of ints

Question

I know that the default implementation of Lua uses floating point numbers only, thus circumventing the problem of dynamically determining the subtype of a number before choosing which variant of math function to use.

My question is -- if I try to emulate integers as doubles (or floats) in standard C99, is there a reliable (and simple) way to tell what is the maximum value representable precisely?

I mean, if I use 64-bit floats to represent integers, I certainly cannot represent all 64-bit integers (the pigeonhole principle applies here). How can I tell the maximum integer that is representable?

(Trying to list all values is not a solution -- if, for example, I'm using doubles in a 64-bit architecture, as I'd have to list 2^{64} numbers)

Thanks!

15 revs · Accepted Answer

The maximum ones-representable integer is 2⁵³ (9007199254740992) for a 64-bit double and 2²⁴ (16777216) for a 32-bit float. See the base digits on the Wikipedia page for IEEE floating point numbers.

Verifying this in Lua is pretty simple:

local maxdouble = 2^53

-- one less than the maximum can be represented precisely
print (string.format("%.0f",maxdouble-1)) --> 9007199254740991
-- the maximum itself can be represented precisely
print (string.format("%.0f",maxdouble))   --> 9007199254740992
-- one more than the maximum gets rounded down
print (string.format("%.0f",maxdouble+1)) --> 9007199254740992 again

If we don't have the IEEE-defined field sizes handy, knowing what we know about the design of floating point numbers, we can determine these values using a simple loop over the possible values:

#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#define min(a, b) (a < b ? a : b)
#define bits(type) (sizeof(type) * 8)
#define testimax(test_t) { \
  uintmax_t in = 1, out = 2; \
  size_t pow = 0, limit = min(bits(test_t), bits(uintmax_t)); \
  while (pow < limit && out == in + 1) { \
    in = in << 1; \
    out = (test_t) in + 1; \
    ++pow; \
  } \
  if (pow == limit) \
    puts(#test_t " is as precise as longest integer type"); \
  else printf(#test_t " conversion imprecise for 2^%d+1:
" \
    "   in: %llu
  out: %llu

", pow, in + 1, out); \
}

int main(void)
{
    testimax(float);
    testimax(double);
    return 0;
}

The output of the above code:

float conversion imprecise for 2^24+1:
   in: 16777217
  out: 16777216

double conversion imprecise for 2^53+1:
   in: 9007199254740993
  out: 9007199254740992

Of course, due to the way floating-point precision works, a 64-bit double can represent numbers much larger than 2⁶⁴ as the floating exponent grows positive. The Wikipedia page on double-precision floating-point describes:

Between 2⁵²=4,503,599,627,370,496 and 2⁵³=9,007,199,254,740,992 the representable numbers are exactly the integers. For the next range, from 2⁵³ to 2⁵⁴, everything is multiplied by 2, so the representable numbers are the even ones, etc. Conversely, for the previous range from 2⁵¹ to 2⁵², the spacing is 0.5, etc.

The absolute largest value a double can hold is listed further down that page: 0x7fefffffffffffff, which computes to (1 + (1 − 2⁻⁵²)) * 2¹⁰²³, or roughly 1.7976931348623157e308.

Using floats or doubles instead of ints

Tags:

floating-accuracy

lua

Jay

1 Answers

15 revs

Recent Activity

Donate For Us

Using floats or doubles instead of ints

Tags:

floating-accuracy

lua

Jay

1 Answers

15 revs

Related questions

Recent Activity

Donate For Us