How to Calculate Double + Float Precision

Tags:

floating-point

I have been trying to find how to calculate the Floating/Double precision/range numbers -3.402823e38 .. 3.402823e38 and -1.79769313486232e308 .. 1.79769313486232e308.

For int32 you would do 2^32=4294967296/2 you get a range of -2147483648 to 2147483647. So how do i figure out the precision numbers for float and double. I think i am searching the wrong terms since nothing is coming up anywhere.

447

asked Jan 06 '11 01:01

Mike Diaz

1 Answers

Well, both types actually look like the following:

[sign] [exponent] [mantissa]

representing a number in the following form:

[sign] 1.[mantissa] × 2^[exponent]

with the size of the exponent and mantissa varying. For float the exponent is eight bits wide, while double has an eleven-bit exponent. Furthermore, the exponent is stored unsigned with a bias which is 127 for float and 1023 for double. This results in a range for the exponent of −126 through 127 for float and −1022 though 1023 for double.

The exponent is the exponent for 2^something so when calculating 2¹²⁷ you'll get 1.7 × 10³⁸ which gets you in the approximate range of the float maximum value. Similarly for double with 9 × 10³⁰⁷.

Obviously those numbers are not exactly those we expect. This is where the mantissa comes into play. The mantissa represents a normalized binary number that always begins with “1.” (that's the normalized part). The rest is simply the digits after the dot. Since the maximum mantissa is then roughly 1.111111111... in binary, which is almost 2, we'll get approximately 3.4 × 10³⁸ as float's maximum value and 1.79 × 10³⁰⁸ as the maximum value for double.

[EDIT 2011-01-06] As Mark points out below (and below the question), the exact formula is the following:

Formula to calculate the exact maximum value for an IEEE-754 floating-point type: 2^(2^(e-1) )⋅(1-2^(-p) )

where e is the number of bits in the exponent and p is the number of bits in the mantissa, including the aforementioned implicit bit (due to normalization). The formula replicates what we have seen above, only now accurate. The first factor, 2^{2^{e − 1}}, is the maximum exponent, multiplied by two (we save the two in the second factor then). The second factor is the largest number we can represent below one. I said above that the number is almost two. Since we exaggerated the exponent by a factor of two in this formula, we need to account for that and now have a number that is almost one. I hope it's not too confusing.

In any case, for float (with e = 8 and p = 24) we get the exact value 340282346638528859811704183484516925440 or roughly 3.4 × 10³⁸. double then yields (with e = 10 and p = 53) 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368 or roughly 1.80 × 10³⁰⁸.

[/EDIT]

Another thing: You're bringing up the term “precision” in your question but you quote the ranges of the types. Precision is a quite different thing and refers to how many significant digits the type can retain. Again, the answer here lies in the mantissa which is 23 and 52 bits for float and double, respectively. Since the numbers are stored normalized we actually have an implicit bit added to that, which puts us at 24 and 53 bits. Now, the way how digits after the decimal (or binary here) point work is the following:

 1.   1     0     1     1
 ↑    ↑     ↑     ↑     ↑
2^0  2^-1  2^-2  2^-3  2^-4
 =    =     =     =     =
 1   0.5   0.25  0.125 0.0625

So the very last digit in the double mantissa represents a value of roughly 2.2 × 10⁻¹⁶ or 2⁻⁵², so if the exponent is 1, this is the smallest value we can add to the number – placing the double precision around 16 decimal digits. Likewise for float with roughly seven digits.

answered Nov 14 '22 17:11

Joey

Related questions
                            
                                Error 83 error C2398: conversion from 'double' to 'float' requires a narrowing conversion
                            
                                In BASH convert a string with . in float
                            
                                How to check if a double has at most n decimal places?
                            
                                SQL Server: Calculation with numeric literals
                            
                                Narrowing conversion from double to float: is overflow behaviour guaranteed?
                            
                                Why will decimal128 be probably standardized and quad precision will not?
                            
                                What are coding conventions for using floating-point in Linux device drivers?
                            
                                How can I convert four characters into a 32-bit IEEE-754 float in Perl?
                            
                                0 + 0 + 0... + 0 != 0
                            
                                bash: iterate over list of floating numbers
                            
                                What is (+0)+(-0) by IEEE floating point standard?
                            
                                round in PHP shows scientific notation instead of full number
                            
                                Can I guarantee the C++ compiler will not reorder my calculations?
                            
                                ScalaTest - testing equality between two floating point arrays with error margin
                            
                                Convert float[] to byte[] to float[] again
                            
                                How does modulus operation works with float data type?
                            
                                How to get the bits of a "double" as a "long"
                            
                                How to divide tiny double precision numbers correctly without precision errors?
                            
                                Python precision in string formatting with float numbers
                            
                                Is the use of machine epsilon appropriate for floating-point equality tests?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With