What is the exact range of (contiguous) integers that can be expressed as a double (resp. float?) The reason I ask is because I am curious for questions such as this one when a loss of accuracy will occur.
That is
m
such that m+1
cannot be precisely expressed as a double (resp. float)?-n
such that -n-1
cannot be precisely expressed as a double (resp. float)? (May be the same as the above).This means that every integer between -n
and m
has an exact floating-point representation. I'm basically looking for the range [-n, m]
for both floats and doubles.
Let's limit the scope to the standard IEEE 754 32-bit and 64-bit floating point representations. I know that the float has 24 bits of precision and the double has 53 bits (both with a hidden leading bit), but due to the intricacies of the floating point representation I'm looking for an authoritative answer for this. Please don't wave your hands!
(Ideal answer would prove that all the integers from 0
to m
are expressible, and that m+1
is not.)
A double precision, floating-point number is a 64-bit approximation of a real number. The number can be zero or can range from -1.797693134862315E+308 to -2.225073858507201E-308, or from 2.225073858507201E-308 to 1.797693134862315E+308.
Double-precision floating-point format (sometimes called FP64 or float64) is a computer number format, usually occupying 64 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.
double has 2x more precision than float. float is a 32-bit IEEE 754 single precision Floating Point Number – 1 bit for the sign, 8 bits for the exponent, and 23* for the value. float has 7 decimal digits of precision.
In computing, double precision is a computer numbering format that occupies two adjacent storage locations in computer memory. A double precision number, sometimes simply called a double, may be defined to be an integer, fixed point, or floating point (in which case it is often referred to as FP64).
Since you're asking about IEEE floating-point types, the language does not matter.
#include <iostream>
using namespace std;
int main(){
float f0 = 16777215.; // 2^24 - 1
float f1 = 16777216.; // 2^24
float f2 = 16777217.; // 2^24 + 1
cout << (f0 == f1) << endl;
cout << (f1 == f2) << endl;
double d0 = 9007199254740991.; // 2^53 - 1
double d1 = 9007199254740992.; // 2^53
double d2 = 9007199254740993.; // 2^53 + 1
cout << (d0 == d1) << endl;
cout << (d1 == d2) << endl;
}
Output:
0
1
0
1
So the limit for float is 2^24. And the limit for double is 2^53. Negatives are the same since the only difference is the sign bit.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With