I know a little bit about how floating-point numbers are represented, but not enough, I'm afraid. The general question is: <blockquote> For a given precision (for my purposes, the number of accurate decimal places in base 10), what range of numbers can be represented for 16-, 32- and 64-bit IEEE-754 systems? </blockquote> Specifically, I'm only interested in the range of 16-bit and 32-bit numbers accurate to +/-0.5 (the ones place) or +/- 0.0005 (the thousandths place).

For a given IEEE-754 floating point number X, if <pre class="prettyprint"><code>2^E <= abs(X) < 2^(E+1) </code></pre> then the distance from X to the next largest representable floating point number (epsilon) is: <pre class="prettyprint"><code>epsilon = 2^(E-52) % For a 64-bit float (double precision) epsilon = 2^(E-23) % For a 32-bit float (single precision) epsilon = 2^(E-10) % For a 16-bit float (half precision) </code></pre> The above equations allow us to compute the following: <ul> <li> For half precision... If you want an accuracy of +/-0.5 (or 2^-1), the maximum size that the number can be is 2^10. Any larger than this and the distance between floating point numbers is greater than 0.5. If you want an accuracy of +/-0.0005 (about 2^-11), the maximum size that the number can be is 1. Any larger than this and the distance between floating point numbers is greater than 0.0005. </li> <li> For single precision... If you want an accuracy of +/-0.5 (or 2^-1), the maximum size that the number can be is 2^23. Any larger than this and the distance between floating point numbers is greater than 0.5. If you want an accuracy of +/-0.0005 (about 2^-11), the maximum size that the number can be is 2^13. Any larger than this and the distance between floating point numbers is greater than 0.0005. </li> <li> For double precision... If you want an accuracy of +/-0.5 (or 2^-1), the maximum size that the number can be is 2^52. Any larger than this and the distance between floating point numbers is greater than 0.5. If you want an accuracy of +/-0.0005 (about 2^-11), the maximum size that the number can be is 2^42. Any larger than this and the distance between floating point numbers is greater than 0.0005. </li> </ul>

What range of numbers can be represented in a 16-, 32- and 64-bit IEEE-754 systems?

1 Answers

For a given IEEE-754 floating point number X, if

2^E <= abs(X) < 2^(E+1)

then the distance from X to the next largest representable floating point number (epsilon) is:

epsilon = 2^(E-52)    % For a 64-bit float (double precision) epsilon = 2^(E-23)    % For a 32-bit float (single precision) epsilon = 2^(E-10)    % For a 16-bit float (half precision)

The above equations allow us to compute the following:

For half precision...

If you want an accuracy of +/-0.5 (or 2^-1), the maximum size that the number can be is 2^10. Any larger than this and the distance between floating point numbers is greater than 0.5.

If you want an accuracy of +/-0.0005 (about 2^-11), the maximum size that the number can be is 1. Any larger than this and the distance between floating point numbers is greater than 0.0005.
For single precision...

If you want an accuracy of +/-0.5 (or 2^-1), the maximum size that the number can be is 2^23. Any larger than this and the distance between floating point numbers is greater than 0.5.

If you want an accuracy of +/-0.0005 (about 2^-11), the maximum size that the number can be is 2^13. Any larger than this and the distance between floating point numbers is greater than 0.0005.
For double precision...

If you want an accuracy of +/-0.5 (or 2^-1), the maximum size that the number can be is 2^52. Any larger than this and the distance between floating point numbers is greater than 0.5.

If you want an accuracy of +/-0.0005 (about 2^-11), the maximum size that the number can be is 2^42. Any larger than this and the distance between floating point numbers is greater than 0.0005.

answered Sep 17 '22 17:09

gnovice

Related questions
                            
                                Python float to int conversion
                            
                                Large numbers erroneously rounded in JavaScript
                            
                                long double vs double
                            
                                Argument order to std::min changes compiler output for floating-point
                            
                                Why 0.1 + 0.2 == 0.3 in D?
                            
                                Value for epsilon in Python
                            
                                Why does adding two decimals in Javascript produce a wrong result? [duplicate]
                            
                                Is a whole number float divided by itself guaranteed to be 1.f?
                            
                                What is the difference between int() and floor() in Python 3?
                            
                                How to check that a string is parseable to a double? [duplicate]
                            
                                What's the use of suffix `f` on float value
                            
                                Why do floating-point numbers have signed zeros?
                            
                                Python, print all floats to 2 decimal places in output
                            
                                Convert String to Integer/Float in Haskell?
                            
                                Converting a double to an int in Javascript without rounding
                            
                                Floating point comparison functions for C#
                            
                                How do I type a floating point infinity literal in python
                            
                                How do I check if a zero is positive or negative?
                            
                                Checking if float is an integer
                            
                                Make C floating point literals float (rather than double)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What range of numbers can be represented in a 16-, 32- and 64-bit IEEE-754 systems?

Tags:

floating-point

precision

ieee-754

numerical

Nate Parsons

People also ask

1 Answers

gnovice

Recent Activity

Donate For Us