I'm currently learning inter-type data convertion in cpp. I have been taught that <blockquote> For a really large int, we can (for some computers) suffer a loss of precision when converting to double. </blockquote> But no reason was provided for the statement. Could someone please provide an explanation and an example? Thanks

This may happen if <code>int</code> is 64 bit and <code>double</code> is 64 bit as well. Floating point numbers are composed of mantissa (represents the digits) and exponent. As mantissa for the <code>double</code> in such a case has less bits than the <code>int</code>, then double is able to represent less digits and a loss of precision happens.

converting really large int to double, loss of precision on some computer

2 Answers

Let's say that the floating point number uses N bits of storage.

Now, let us assume that this float can precisely represent all integers that can be represented by an integer type of N bits. Since the N bit integer requires all of its N bits to represent all of its values, so would be the requirement for this float.

A floating point number should be able to represent fractional numbers. However, since all of the bits are used to represent the integers, there are zero bits left to represent any fractional number. This is a contradiction, and we must conclude that the assumption that float can precisely represent all integers as equally sized integer type must be erroneous.

Since there must be non-representable integers in the range of a N bit integer, it is possible that converting such integer to a floating point of N bits will lose precision, if the converted value happens to be one of the non-representable ones.

Now, since a floating point can represent a subset of rational numbers, some of those representable values may indeed be integers. In particular, the IEEE-754 spec guarantees that a binary double precision floating point can represent all integers up to 2⁵³. This property is directly associated with the length of the mantissa.

Therefore it is not possible to lose precision of a 32 bit integer when converting to a double on a system which conforms to IEEE-754.

More technically, the floating point unit of x86 architecture actually uses a 80-bit extended floating point format, which is designed to be able to represent precisely all of 64 bit integers and can be accessed using the long double type.

answered Oct 14 '22 07:10

eerorika

This may happen if int is 64 bit and double is 64 bit as well. Floating point numbers are composed of mantissa (represents the digits) and exponent. As mantissa for the double in such a case has less bits than the int, then double is able to represent less digits and a loss of precision happens.

answered Oct 14 '22 06:10

Juraj Blaho

Related questions
                            
                                Calling a function from a DLL which is developed in C++ from C
                            
                                Xcode 9 falls to build partial template specialization in c++
                            
                                C++ using ifstream to read file
                            
                                dynamically sized classes in c++
                            
                                Is vector in struct time consuming? Is it better to use a pointer?
                            
                                PyArray_Check gives Segmentation Fault with Cython/C++
                            
                                Why are size_t and unsigned int slower than int?
                            
                                Is it possible to create a winapi window with only borders
                            
                                Using Boost-Beast (Asio) http client with SSL (HTTPS)
                            
                                Problems with using tensorflow lite C++ API in Android Studio Project
                            
                                How to insert an integer with leading zeros into a std::string?
                            
                                parity of set bits after xor of two numbers
                            
                                Why C++ is called federation of languages?
                            
                                Variable Length Arrays: How to create a buffer with variable size in C++
                            
                                Lock only one of two possible mutexes
                            
                                Is it possible to auto-deduce base class template parameters from constructor?
                            
                                Using try_emplace with a shared_ptr
                            
                                What is the reason for having read-only data defined in .text section?
                            
                                Why do we even need the delete operator? (Can't we just use delete[])
                            
                                Call of overloaded method with nullptr is ambiguous

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

converting really large int to double, loss of precision on some computer

Tags:

c++

types

integer

casting

double

Thor

People also ask

2 Answers

eerorika

Juraj Blaho

Recent Activity

Donate For Us