I'm currently learning inter-type data convertion in cpp. I have been taught that
For a really large int, we can (for some computers) suffer a loss of precision when converting to double.
But no reason was provided for the statement.
Could someone please provide an explanation and an example? Thanks
The biggest/largest integer that can be stored in a double without losing precision is the same as the largest possible value of a double. That is, DBL_MAX or approximately 1.8 × 10308 (if your double is an IEEE 754 64-bit double). It's an integer.
The doubleValue() method of Java Float class returns a double value corresponding to this Float Object by widening the primitive values or in simple words by directly converting it to double via doubleValue() method .
The %d format specifier expects an int argument, but you're passing a double . Using the wrong format specifier invokes undefined behavior. To print a double , use %f .
The int and double are major primitive data types. The main difference between int and double is that int is used to store 32 bit two's complement integer while double is used to store 64 bit double precision floating point value. In brief, double takes twice memory space than int to store data.
Let's say that the floating point number uses N bits of storage.
Now, let us assume that this float can precisely represent all integers that can be represented by an integer type of N bits. Since the N bit integer requires all of its N bits to represent all of its values, so would be the requirement for this float.
A floating point number should be able to represent fractional numbers. However, since all of the bits are used to represent the integers, there are zero bits left to represent any fractional number. This is a contradiction, and we must conclude that the assumption that float can precisely represent all integers as equally sized integer type must be erroneous.
Since there must be non-representable integers in the range of a N bit integer, it is possible that converting such integer to a floating point of N bits will lose precision, if the converted value happens to be one of the non-representable ones.
Now, since a floating point can represent a subset of rational numbers, some of those representable values may indeed be integers. In particular, the IEEE-754 spec guarantees that a binary double precision floating point can represent all integers up to 253. This property is directly associated with the length of the mantissa.
Therefore it is not possible to lose precision of a 32 bit integer when converting to a double on a system which conforms to IEEE-754.
More technically, the floating point unit of x86 architecture actually uses a 80-bit extended floating point format, which is designed to be able to represent precisely all of 64 bit integers and can be accessed using the long double
type.
This may happen if int
is 64 bit and double
is 64 bit as well. Floating point numbers are composed of mantissa (represents the digits) and exponent. As mantissa for the double
in such a case has less bits than the int
, then double is able to represent less digits and a loss of precision happens.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With