Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What types of numbers are representable in binary floating-point?

Tags:

I've read a lot about floats, but it's all unnecessarily involved. I think I've got it pretty much understood, but there's just one thing I'd like to know for sure:

I know that, fractions of the form 1/pow(2,n), with n an integer, can be represented exactly in floating point numbers. This means that if I add 1/32 to itself 32 million times, I would get exactly 1,000,000.

What about something like 1/(32+16)? It's one over the sum of two powers of two, does this work? Or is it 1/32+1/16 that works? This is where I'm confused, so if anyone could clarify that for me I would appreciate it.

like image 895
Niet the Dark Absol Avatar asked Aug 25 '12 19:08

Niet the Dark Absol


People also ask

What numbers can be represented in floating point?

Floating point numbers are used to represent noninteger fractional numbers and are used in most engineering and technical calculations, for example, 3.256, 2.1, and 0.0036. The most commonly used floating point standard is the IEEE standard.

What is a floating-point number in binary?

The term floating point refers to the fact that a number's radix point (decimal point, or, more commonly in computers, binary point) can "float"; that is, it can be placed anywhere relative to the significant digits of the number.

How many types of floating numbers are there?

There are three different floating point data types: float, double, and long double. As with integers, C++ does not define the actual size of these types (but it does guarantee minimum sizes). On modern architectures, floating point representation almost always follows IEEE 754 binary format.

What are the 2 types of floating point?

There are two floating point primitive types. Data type float is sometimes called "single-precision floating point". Data type double has twice as many bits and is sometimes called "double-precision floating point".


2 Answers

The rule can be summed up as this:

  • A number can be represented exactly in binary if the prime factorization of the denominator contains only 2. (i.e. the denominator is a power-of-two)

So 1/(32 + 16) is not representable in binary because it has a factor of 3 in the denominator. But 1/32 + 1/16 = 3/32 is.

That said, there are more restrictions to be representable in a floating-point type. For example, you only have 53 bits of mantissa in an IEEE double so 1/2 + 1/2^500 is not representable.

So you can do sum of powers-of-two as long as the range of the exponents doesn't span more than 53 powers.


To generalize this to other bases:

  • A number can be exactly represented in base 10 if the prime factorization of the denominator consists of only 2's and 5's.

  • A rational number X can be exactly represented in base N if the prime factorization of the denominator of X contains only primes found in the factorization of N.

like image 185
Mysticial Avatar answered Oct 05 '22 19:10

Mysticial


A finite number can be represented in the common IEEE 754 double-precision format if and only if it equals M•2e for some integers M and e such that -253 < M < 253 and -1074 ≤ e ≤ 971.

For single precision, -224 < M < 224 and -149 ≤ e ≤ 104.

For double-precision, these are consequences of the facts that the double-precision format uses 52 bits to store a significand (which normally has 53 bits due to an implicit 1) and uses 11 bits to store an exponent. 11 bits encodes numbers from 0 to 2047, but 0 and 2047 are excluded for special purposes, and the encoded number is biased by 1023, so it represents unbiased exponents from -1022 to 1023. However, these unbiased exponents are for significands in the interval [1, 2), and those significands have fractions. To express the significand as an integer, I adjusted the exponent range by 52. Single-precision is similar, with 23 bits to store a 24-bit significand, 8 bits for the exponent, and a bias of 127.

Expressing the representable numbers using an integer times a power of two rather than the more common fractional significand simplifies some number theory and other reasoning about floating-point properties. I used it in this answer because it allows the set of representable values to be expressed concisely.

like image 29
Eric Postpischil Avatar answered Oct 05 '22 17:10

Eric Postpischil