Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

converting really large int to double, loss of precision on some computer

I'm currently learning inter-type data convertion in cpp. I have been taught that

For a really large int, we can (for some computers) suffer a loss of precision when converting to double.

But no reason was provided for the statement.

Could someone please provide an explanation and an example? Thanks

like image 879
Thor Avatar asked Sep 25 '18 05:09

Thor


People also ask

What is the largest number that can be stored in a double?

The biggest/largest integer that can be stored in a double without losing precision is the same as the largest possible value of a double. That is, DBL_MAX or approximately 1.8 × 10308 (if your double is an IEEE 754 64-bit double). It's an integer.

Can you turn a float into a double?

The doubleValue() method of Java Float class returns a double value corresponding to this Float Object by widening the primitive values or in simple words by directly converting it to double via doubleValue() method .

Can we convert int to double in C?

The %d format specifier expects an int argument, but you're passing a double . Using the wrong format specifier invokes undefined behavior. To print a double , use %f .

Which is bigger integer or double?

The int and double are major primitive data types. The main difference between int and double is that int is used to store 32 bit two's complement integer while double is used to store 64 bit double precision floating point value. In brief, double takes twice memory space than int to store data.


2 Answers

Let's say that the floating point number uses N bits of storage.

Now, let us assume that this float can precisely represent all integers that can be represented by an integer type of N bits. Since the N bit integer requires all of its N bits to represent all of its values, so would be the requirement for this float.

A floating point number should be able to represent fractional numbers. However, since all of the bits are used to represent the integers, there are zero bits left to represent any fractional number. This is a contradiction, and we must conclude that the assumption that float can precisely represent all integers as equally sized integer type must be erroneous.

Since there must be non-representable integers in the range of a N bit integer, it is possible that converting such integer to a floating point of N bits will lose precision, if the converted value happens to be one of the non-representable ones.


Now, since a floating point can represent a subset of rational numbers, some of those representable values may indeed be integers. In particular, the IEEE-754 spec guarantees that a binary double precision floating point can represent all integers up to 253. This property is directly associated with the length of the mantissa.

Therefore it is not possible to lose precision of a 32 bit integer when converting to a double on a system which conforms to IEEE-754.


More technically, the floating point unit of x86 architecture actually uses a 80-bit extended floating point format, which is designed to be able to represent precisely all of 64 bit integers and can be accessed using the long double type.

like image 80
eerorika Avatar answered Oct 14 '22 07:10

eerorika


This may happen if int is 64 bit and double is 64 bit as well. Floating point numbers are composed of mantissa (represents the digits) and exponent. As mantissa for the double in such a case has less bits than the int, then double is able to represent less digits and a loss of precision happens.

like image 35
Juraj Blaho Avatar answered Oct 14 '22 06:10

Juraj Blaho