I'm reading CS:APP, and regarding casts it says that when casting from int to float, the number cannot overflow, but it may be rounded. It seemed odd to me as I didn't know what there was to round, so I've tried it out. I thought that this would only be the case for very large integers (near <code>INT_MAX</code>/<code>INT_MIN</code>), but rounding happens at values around a hundred million as well. (Not sure where exactly this happens first). Why does this happen? The range of <code>float</code> far exceeds that of <code>int</code>. One might say that floating point numbers cannot be represented exactly, but when converting from <code>int</code>to <code>double</code> there is no change in value. The advantage of <code>double</code>over <code>float</code> is that it has greater range and precision. But <code>float</code>still has enough range to "encapsulate" integers, and precision shouldn't really matter as integers have no decimal places (well, all 0), or am I thinking wrong? Here's some output that I got (here is the code: http://pastebin.com/K3E3A6Ni): <pre class="prettyprint"><code>FLT_MAX = 340282346638528859811704183484516925440.000000 INT_MAX = 2147483647 (float)INT_MAX = 2147483648.000000 (double)INT_MAX = 2147483647.000000 INT_MIN = -2147483648 (float)INT_MIN = -2147483648.000000 ====other values close to INT_MIN INT_MAX==== INT_MAX-1 = 2147483646 (float)INT_MAX-1 = 2147483648.000000 INT_MIN+1 = -2147483647 (float)INT_MIN+1 = -2147483648.000000 INT_MAX-2 = 2147483645 (float)INT_MAX-2 = 2147483648.000000 INT_MAX-10 = 2147483637 (float)INT_MAX-10 = 2147483648.000000 INT_MAX-100 = 2147483547 (float)INT_MAX-100 = 2147483520.000000 INT_MAX-1000 = 2147482647 (float)INT_MAX-1000 = 2147482624.000000 (float)1.234.567.809 = 1234567808.000000 (float)1.234.567.800 = 1234567808.000000 (float)1.000.000.005 = 1000000000.000000 (float)800.000.003 = 800000000.000000 (float)500.000.007 = 500000000.000000 (float)100.000.009 = 100000008.000000 </code></pre>

I'm assuming that by <code>float</code> you mean a 32-bit IEEE-754 binary floating point value, by <code>double</code> you mean a 64-bit IEEE-754 binary floating point value, and by <code>int</code> you mean a 32-bit integer. <blockquote> Why does this happen? The range of float far exceeds that of int </blockquote> Yes, but the precision of <code>float</code> is only 7-9 decimal digits. To be more specific, the significand is only 24 bits wide... so if you're trying to store 32 bits of information in there, you're going to have problems. <blockquote> but when converting from <code>int</code> to <code>double</code> there is no change in value </blockquote> Sure, because a <code>double</code> has a 53-bit significand - plenty of room for a 32-bit integer there! To think of it another way, the gap between consecutive <code>int</code> values is always 1... whereas the gap between consecutive <code>float</code> values starts very, very small... but increases as the magnitude of the value increases. It gets to "more than 2" well before you hit the limit of <code>int</code>... so you get to the stage where not every <code>int</code> can be exactly represented. To think of it another way, you can simply use the pigeon-hole principle... even ignoring NaN values, there can be at most 232<code>float</code> values, and at least one of those is not the exact value of an <code>int</code> - take 0.5, for example. There are 232<code>int</code> values, therefore at least one <code>int</code> value doesn't have an exact <code>float</code> representation.

Why does a cast from int to float round the value?

Tags:

c

casting

int

rounding

I'm reading CS:APP, and regarding casts it says that when casting from int to float, the number cannot overflow, but it may be rounded.

It seemed odd to me as I didn't know what there was to round, so I've tried it out. I thought that this would only be the case for very large integers (near INT_MAX/INT_MIN), but rounding happens at values around a hundred million as well. (Not sure where exactly this happens first).

Why does this happen? The range of float far exceeds that of int. One might say that floating point numbers cannot be represented exactly, but when converting from intto double there is no change in value. The advantage of doubleover float is that it has greater range and precision. But floatstill has enough range to "encapsulate" integers, and precision shouldn't really matter as integers have no decimal places (well, all 0), or am I thinking wrong?

Here's some output that I got (here is the code: http://pastebin.com/K3E3A6Ni):

FLT_MAX = 340282346638528859811704183484516925440.000000  
INT_MAX     = 2147483647  
(float)INT_MAX = 2147483648.000000  
(double)INT_MAX = 2147483647.000000  
INT_MIN     = -2147483648  
(float)INT_MIN = -2147483648.000000  

====other values close to INT_MIN INT_MAX====  
INT_MAX-1     = 2147483646  
(float)INT_MAX-1 = 2147483648.000000  
INT_MIN+1     = -2147483647  
(float)INT_MIN+1 = -2147483648.000000  
INT_MAX-2      = 2147483645  
(float)INT_MAX-2  = 2147483648.000000  
INT_MAX-10     = 2147483637  
(float)INT_MAX-10 = 2147483648.000000  
INT_MAX-100         = 2147483547  
(float)INT_MAX-100  = 2147483520.000000  
INT_MAX-1000         = 2147482647  
(float)INT_MAX-1000 = 2147482624.000000  

(float)1.234.567.809 = 1234567808.000000  
(float)1.234.567.800 = 1234567808.000000  
(float)1.000.000.005 = 1000000000.000000  
(float)800.000.003   = 800000000.000000  
(float)500.000.007   = 500000000.000000  
(float)100.000.009   = 100000008.000000

799

asked Feb 04 '15 15:02

Beko

2 Answers

I'm assuming that by float you mean a 32-bit IEEE-754 binary floating point value, by double you mean a 64-bit IEEE-754 binary floating point value, and by int you mean a 32-bit integer.

Why does this happen? The range of float far exceeds that of int

Yes, but the precision of float is only 7-9 decimal digits. To be more specific, the significand is only 24 bits wide... so if you're trying to store 32 bits of information in there, you're going to have problems.

but when converting from int to double there is no change in value

Sure, because a double has a 53-bit significand - plenty of room for a 32-bit integer there!

To think of it another way, the gap between consecutive int values is always 1... whereas the gap between consecutive float values starts very, very small... but increases as the magnitude of the value increases. It gets to "more than 2" well before you hit the limit of int... so you get to the stage where not every int can be exactly represented.

To think of it another way, you can simply use the pigeon-hole principle... even ignoring NaN values, there can be at most 2³²float values, and at least one of those is not the exact value of an int - take 0.5, for example. There are 2³²int values, therefore at least one int value doesn't have an exact float representation.

118

answered Oct 20 '22 08:10

Jon Skeet

A typical float that is implemented with the 32-bit IEEE-754 representation has only 24 bits for the significand, which allows for about 7 decimal digits of precision. So you'll see rounding as soon as you hit the millions (2²⁴ ≈ 16M).

(For a double, the significand has 53 bits, and 2⁵³ ≈ 9×10¹⁵.)

answered Oct 20 '22 10:10

Kerrek SB

Related questions
                            
                                Count number of digits after `.` in floating point numbers?
                            
                                Permission denied when running C program [closed]
                            
                                Override c library file functions?
                            
                                I can not understand some sentences in C99
                            
                                What is the glibc GLRO macro?
                            
                                how to access the 3-d array using pointer to an array
                            
                                delete memory allocated with lua_newuserdata
                            
                                Why ISO_C_BINDING
                            
                                How to send and receive string using MPI
                            
                                Convert size_t to string
                            
                                Gtk-ERROR **: GTK+ 2.x symbols detected
                            
                                what does "static int function(...) __acquires(..) __releases(...){" mean?
                            
                                Malloc works without type cast before malloc C/C++ [duplicate]
                            
                                Reversing a '\0' terminated C string in place?
                            
                                Why char* can hold a single char in C?
                            
                                Linux terminal - error: label at end of compound statement
                            
                                How to deal with: redeclaration of C++ built-in type ‘char16_t’
                            
                                variable 1 = ({statement 1;statement 2;}) construct in C
                            
                                what's wrong in using interrupt handlers as event listeners
                            
                                Difference between SIGUSR1 and SIGUSR2

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With