I've read a lot about floats, but it's all unnecessarily involved. I think I've got it pretty much understood, but there's just one thing I'd like to know for sure:
I know that, fractions of the form 1/pow(2,n)
, with n
an integer, can be represented exactly in floating point numbers. This means that if I add 1/32
to itself 32 million times, I would get exactly 1,000,000
.
What about something like 1/(32+16)
? It's one over the sum of two powers of two, does this work? Or is it 1/32+1/16
that works? This is where I'm confused, so if anyone could clarify that for me I would appreciate it.
Floating point numbers are used to represent noninteger fractional numbers and are used in most engineering and technical calculations, for example, 3.256, 2.1, and 0.0036. The most commonly used floating point standard is the IEEE standard.
The term floating point refers to the fact that a number's radix point (decimal point, or, more commonly in computers, binary point) can "float"; that is, it can be placed anywhere relative to the significant digits of the number.
There are three different floating point data types: float, double, and long double. As with integers, C++ does not define the actual size of these types (but it does guarantee minimum sizes). On modern architectures, floating point representation almost always follows IEEE 754 binary format.
There are two floating point primitive types. Data type float is sometimes called "single-precision floating point". Data type double has twice as many bits and is sometimes called "double-precision floating point".
The rule can be summed up as this:
So 1/(32 + 16)
is not representable in binary because it has a factor of 3 in the denominator. But 1/32 + 1/16 = 3/32
is.
That said, there are more restrictions to be representable in a floating-point type. For example, you only have 53 bits of mantissa in an IEEE double
so 1/2 + 1/2^500
is not representable.
So you can do sum of powers-of-two as long as the range of the exponents doesn't span more than 53 powers.
To generalize this to other bases:
A number can be exactly represented in base 10 if the prime factorization of the denominator consists of only 2's and 5's.
A rational number X
can be exactly represented in base N
if the prime factorization of the denominator of X
contains only primes found in the factorization of N
.
A finite number can be represented in the common IEEE 754 double-precision format if and only if it equals M•2e for some integers M and e such that -253 < M < 253 and -1074 ≤ e ≤ 971.
For single precision, -224 < M < 224 and -149 ≤ e ≤ 104.
For double-precision, these are consequences of the facts that the double-precision format uses 52 bits to store a significand (which normally has 53 bits due to an implicit 1) and uses 11 bits to store an exponent. 11 bits encodes numbers from 0 to 2047, but 0 and 2047 are excluded for special purposes, and the encoded number is biased by 1023, so it represents unbiased exponents from -1022 to 1023. However, these unbiased exponents are for significands in the interval [1, 2), and those significands have fractions. To express the significand as an integer, I adjusted the exponent range by 52. Single-precision is similar, with 23 bits to store a 24-bit significand, 8 bits for the exponent, and a bias of 127.
Expressing the representable numbers using an integer times a power of two rather than the more common fractional significand simplifies some number theory and other reasoning about floating-point properties. I used it in this answer because it allows the set of representable values to be expressed concisely.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With