Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How many decimal places does the primitive float and double support? [duplicate]

Tags:

c++

I have read that double stores 15 digits and float stores 7 digits.

My question is, are these numbers the number of decimal places supported or total number of digits in a number?

like image 846
code511788465541441 Avatar asked Jan 20 '15 12:01

code511788465541441


4 Answers

If you are on an architecture using IEEE-754 floating point arithmetic (as in most architectures), then the type float corresponds to single precision, and the type double corresponds to double precision, as described in the standard.

Let's make some numbers:

Single precision:

32 bits to represent the number, out of which 24 bits are for mantissa. This means that the least significant bit (LSB) has a relative value of 2^(-24) respect to the MSB, which is the "hidden 1", and it is not represented. Therefore, for a fixed exponent, the minimum representable value is 10^(-7.22) times the exponent. What this means is that for a representation in base exponent notation (3.141592653589 E 25), only "7.22" decimal numbers are significant, which in practice means that at least 7 decimals will be always correct.

Double precision:

64 bits to represent the number, out of which 53 bits are for mantissa. Following the same reasoning, expressing 2^(-53) as a power of 10 results in 10^(-15.95), which in term means that at least 15 decimals will be always correct.

like image 51
Samuel Navarro Lou Avatar answered Oct 16 '22 13:10

Samuel Navarro Lou


Those are the total number of "significant figures" if you will, counting from left to right, regardless of where the decimal point is. Beyond those numbers of digits, accuracy is not preserved.

The counts you listed are for the base 10 representation.

like image 27
John Zwinck Avatar answered Oct 16 '22 15:10

John Zwinck


There are macros for the number of decimal places each type supports. The gcc docs explain what they are and also what they mean:

FLT_DIG

This is the number of decimal digits of precision for the float data type. Technically, if p and b are the precision and base (respectively) for the representation, then the decimal precision q is the maximum number of decimal digits such that any floating point number with q base 10 digits can be rounded to a floating point number with p base b digits and back again, without change to the q decimal digits.

The value of this macro is supposed to be at least 6, to satisfy ISO C.

DBL_DIG
LDBL_DIG

These are similar to FLT_DIG, but for the data types double and long double, respectively. The values of these macros are supposed to be at least 10.

On both gcc 4.9.2 and clang 3.5.0, these macros yield 6 and 15, respectively.

like image 5
Barry Avatar answered Oct 16 '22 13:10

Barry


are these numbers the number of decimal places supported or total number of digits in a number?

They are the significant digits contained in every number (although you may not need all of them, but they're still there). The mantissa of the same type always contains the same number of bits, so every number consequentially contains the same number of valid "digits" if you think in terms of decimal digits. You cannot store more digits than will fit into the mantissa.

The number of "supported" digits is, however, much larger, for example float will usually support up to 38 decimal digits and double will support up to 308 decimal digits, but most of these digits are not significant (that is, "unknown").

Although technically, this is wrong, since float and double do not have universally well-defined sizes like I presumed above (they're implementation-defined). Also, storage sizes are not necessarily the same as the sizes of intermediate results.

The C++ standard is very reluctant at precisely defining any fundamental type, leaving almost everything to the implementation. Floating point types are no exception:

3.9.1 / 8
There are three floating point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double. The value representation of floating-point types is implementation-defined.

Now of course all of this is not particularly helpful in practice.

In practice, floating point is (usually) IEEE 754 compliant, with float having a width of 32 bits and double having a width of 64 bits (as stored in memory, registers have higher precision on some notable mainstream architectures).

This is equivalent to 24 bits and 53 bits of matissa, respectively, or 7 and 15 full decimals.

like image 3
Damon Avatar answered Oct 16 '22 15:10

Damon