I know that the default implementation of Lua uses floating point numbers only, thus circumventing the problem of dynamically determining the subtype of a number before choosing which variant of math function to use.
My question is -- if I try to emulate integers as doubles (or floats) in standard C99, is there a reliable (and simple) way to tell what is the maximum value representable precisely?
I mean, if I use 64-bit floats to represent integers, I certainly cannot represent all 64-bit integers (the pigeonhole principle applies here). How can I tell the maximum integer that is representable?
(Trying to list all values is not a solution -- if, for example, I'm using doubles in a 64-bit architecture, as I'd have to list 2^{64} numbers)
Thanks!
The maximum ones-representable integer is 253 (9007199254740992) for a 64-bit double and 224 (16777216) for a 32-bit float. See the base digits on the Wikipedia page for IEEE floating point numbers.
Verifying this in Lua is pretty simple:
local maxdouble = 2^53
-- one less than the maximum can be represented precisely
print (string.format("%.0f",maxdouble-1)) --> 9007199254740991
-- the maximum itself can be represented precisely
print (string.format("%.0f",maxdouble)) --> 9007199254740992
-- one more than the maximum gets rounded down
print (string.format("%.0f",maxdouble+1)) --> 9007199254740992 again
If we don't have the IEEE-defined field sizes handy, knowing what we know about the design of floating point numbers, we can determine these values using a simple loop over the possible values:
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#define min(a, b) (a < b ? a : b)
#define bits(type) (sizeof(type) * 8)
#define testimax(test_t) { \
uintmax_t in = 1, out = 2; \
size_t pow = 0, limit = min(bits(test_t), bits(uintmax_t)); \
while (pow < limit && out == in + 1) { \
in = in << 1; \
out = (test_t) in + 1; \
++pow; \
} \
if (pow == limit) \
puts(#test_t " is as precise as longest integer type"); \
else printf(#test_t " conversion imprecise for 2^%d+1:\n" \
" in: %llu\n out: %llu\n\n", pow, in + 1, out); \
}
int main(void)
{
testimax(float);
testimax(double);
return 0;
}
The output of the above code:
float conversion imprecise for 2^24+1:
in: 16777217
out: 16777216
double conversion imprecise for 2^53+1:
in: 9007199254740993
out: 9007199254740992
Of course, due to the way floating-point precision works, a 64-bit double can represent numbers much larger than 264 as the floating exponent grows positive. The Wikipedia page on double-precision floating-point describes:
Between 252=4,503,599,627,370,496 and 253=9,007,199,254,740,992 the representable numbers are exactly the integers. For the next range, from 253 to 254, everything is multiplied by 2, so the representable numbers are the even ones, etc. Conversely, for the previous range from 251 to 252, the spacing is 0.5, etc.
The absolute largest value a double can hold is listed further down that page: 0x7fefffffffffffff, which computes to (1 + (1 − 2−52)) * 21023, or roughly 1.7976931348623157e308.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With