Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference in casting float to int, 32-bit C

Tags:

c

casting

32-bit

I currently working with an old code that needs to run a 32-bit system. During this work I stumbled across an issue that (out of academic interest) I would like to understand the cause of.

It seems that casting from float to int in 32-bit C behaves differently if the cast is done on a variable or on an expression. Consider the program:

#include <stdio.h>
int main() {
   int i,c1,c2;
   float f1,f10;
   for (i=0; i< 21; i++)  {
      f1 = 3+i*0.1;
      f10 = f1*10.0;
      c1 = (int)f10;
      c2 = (int)(f1*10.0);
      printf("%d, %d, %d, %11.9f, %11.9f\n",c1,c2,c1-c2,f10,f1*10.0);
   }
}

Compiled (using gcc) either directly on a 32-bit system or on a 64-bit system using the -m32 modifier the output of the program is:

30, 30, 0, 30.000000000 30.000000000
31, 30, 1, 31.000000000 30.999999046
32, 32, 0, 32.000000000 32.000000477
33, 32, 1, 33.000000000 32.999999523
34, 34, 0, 34.000000000 34.000000954
35, 35, 0, 35.000000000 35.000000000
36, 35, 1, 36.000000000 35.999999046
37, 37, 0, 37.000000000 37.000000477
38, 37, 1, 38.000000000 37.999999523
39, 39, 0, 39.000000000 39.000000954
40, 40, 0, 40.000000000 40.000000000
41, 40, 1, 41.000000000 40.999999046
42, 41, 1, 42.000000000 41.999998093
43, 43, 0, 43.000000000 43.000001907
44, 44, 0, 44.000000000 44.000000954
45, 45, 0, 45.000000000 45.000000000
46, 45, 1, 46.000000000 45.999999046
47, 46, 1, 47.000000000 46.999998093
48, 48, 0, 48.000000000 48.000001907
49, 49, 0, 49.000000000 49.000000954
50, 50, 0, 50.000000000 50.000000000 

Hence, it is clear that a difference exists between casting a variable and an expression. Note, that the issue exists also if float is changed to double and/or int is changed to short or long, also the issue do not manifest if program is compiled as 64-bit.

To clarify, the issue that I'm trying to understand here is not about floating-point arithmetic/rounding, but rather differences in memory handling in 32-bit.

The issue were tested on:

  • Linux version 4.15.0-45-generic (buildd@lgw01-amd64-031) (gcc version 7.3.0 (Ubuntu 7.3.0-16ubuntu3)), program compiled using: gcc -m32 Cast32int.c

  • Linux version 2.4.20-8 ([email protected]) (gcc version 3.2.2 20030222 (Red Hat Linux 3.2.2-5)), program compiled using: gcc Cast32int.c

Any pointers to help me understand what is going on here are appreciated.

like image 412
cpaitor Avatar asked Feb 26 '19 08:02

cpaitor


People also ask

What does casting a float to an int do?

Since a float is bigger than int, you can convert a float to an int by simply down-casting it e.g. (int) 4.0f will give you integer 4. By the way, you must remember that typecasting just get rid of anything after the decimal point, they don't perform any rounding or flooring operation on the value.

Can any 32-bit integer be represented in a 32-bit float?

Yes. All N bit ints can be represented in a floating point representation that has at least N-1 mantissa bits (because of the implicit leading 1 bit that doesn't need to be stored) and an exponent that can store at least N, i.e. has log(N)+1 bits.

Is float guaranteed to be 32-bit?

Knowing that the sizeof(float) is 32 bits may be sufficient, in general, but insufficient in all cases. IEEE 758 defines the well known 32-bit binary32 which is commonly implemented. But IEEE 758 also defines a 32-bit decimal32, which is primarily use for storage and not computation.

Are floats 32 or 64 bit?

Floats generally come in two flavours: “single” and “double” precision. Single precision floats are 32-bits in length while “doubles” are 64-bits. Due to the finite size of floats, they cannot represent all of the real numbers - there are limitations on both their precision and range.


4 Answers

With MS Visual C 2008 I was able to reproduce this.

Inspecting the assembler, the difference between the two is an intermediate store and fetch of a result with intermediate conversions:

  f10 = f1*10.0;          // double result f10 converted to float and stored
  c1 = (int)f10;          // float result f10 fetched and converted to double
  c2 = (int)(f1*10.0);    // no store/fetch/convert

The assembler generated pushes values onto the FPU stack that get converted to 64 bits and then are multiplied. For c1 the result is then converted back to float and stored and is then retrieved again and placed on the FPU stack (and converted to double again) for a call to __ftol2_sse, a run-time function to convert a double to int.

For c2 the intermediate value is not converted to and from float and passed immediately to the __ftol2_sse function. For this function see also the answer at Convert double to int?.

Assembler:

      f10 = f1*10;
fld         dword ptr [f1] 
fmul        qword ptr [__real@4024000000000000 (496190h)] 
fstp        dword ptr [f10] 

      c2 = (int)(f1*10);
fld         dword ptr [f1] 
fmul        qword ptr [__real@4024000000000000 (496190h)] 
call        __ftol2_sse
mov         dword ptr [c2],eax 

      c1 = (int)f10;
fld         dword ptr [f10] 
call        __ftol2_sse
mov         dword ptr [c1],eax 
like image 125
Paul Ogilvie Avatar answered Oct 16 '22 21:10

Paul Ogilvie


In the “32-bit system,” the difference is caused by the fact that f1*10.0 uses full double precision, while f10 has only float precision because that is its type. f1*10.0 uses double precision because 10.0 is a double constant. When f1*10.0 is assigned to f10, the value changes because it is implicitly converted to float, which has less precision.

If you use the float constant 10.0f instead, the differences vanish.

Consider the first case, when i is 1. Then:

  • In f1 = 3+i*0.1, 0.1 is a double constant, so the arithmetic is performed in double, and the result is 3.100000000000000088817841970012523233890533447265625. Then, to assign this to f1, it is converted to float, which produces 3.099999904632568359375.
  • In f10 = f1*10.0;, 10.0 is a double constant, so the arithmetic is again performed in double, and the result is 30.99999904632568359375. For assignment to f10, this is converted to float, and the result is 31.
  • Later, when f10 and f1*10.0 are printed, we see the values given above, with nine digits after the decimal point, “31.000000000” for f10, and “30.999999046”.

If you print f1*10.0f, with the float constant 10.0f instead of the double constant 10.0, the result will be “31.000000000” rather than “30.999999046”.

(The above uses IEEE-754 basic 32-bit and 64-bit binary floating-point arithmetic.)

In particular, note this: The difference between f1*10.0 and f10 arises when f1*10.0 is converted to float for assignment to f10. While C permits implementations to use extra precision in evaluating expressions, it requires implementations to discard this precision in assignments and casts. Therefore, in a standard-conforming compiler, the assignment to f10 must use float precision. This means, even when the program is compiled for a “64-bit system,” the differences should occur. If they do not, the compiler does not conforming to the C standard.

Furthermore, if float is changed to double, the conversion to float does not happen, and the value will not be changed. In this case, no differences between f1*10.0 and f10 should manifest.

Given that the question reports the differences do not manifest with a “64-bit” compilation and do manifest with double, it is questionable whether the observations have been reported accurately. To clarify this, the exact code should be shown, and the observations should be reproduced by a third party.

like image 41
Eric Postpischil Avatar answered Oct 16 '22 19:10

Eric Postpischil


The C standard is not very strict about how floating point math is to be carried out. The standard allows an implementation to do calculation with higher precision than the involved types.

The result in your case is likely to come from the fact that c1 is calculated as "float-to-int" while c2 is calculated as "double-to-int" (or even higher precision).

Here is another example showing the same behavior.

#define DD 0.11111111

int main()
{
  int i = 27;

  int c1,c2,c3;
  float f1;
  double d1;
  printf("%.60f\n", DD);

  f1 = i * DD;
  d1 = i * DD;
  c1 = (int)f1;
  c2 = (int)(i * DD);
  c3 = (int)d1;

  printf("----------------------\n");
  printf("f1: %.60f\n", f1);
  printf("d1: %.60f\n", d1);
  printf("m : %.60f\n", i * DD);
  printf("%d, %d, %d\n",c1,c2,c3);
}

My output:

0.111111109999999999042863407794357044622302055358886718750000
----------------------
f1: 3.000000000000000000000000000000000000000000000000000000000000
d1: 2.999999970000000182324129127664491534233093261718750000000000
m : 2.999999970000000182324129127664491534233093261718750000000000
3, 2, 2

The trick here is the number of ones in 0.11111111. The accurate result is "2.99999997". As you change the number of ones the accurate result is still in the form "2.99...997" (i.e. the number of 9 increases when the number of 1 increases).

At some point (aka some number of ones) you will reach a point where storing the result in a float rounds the result to "3.0" while the double is still able to hold "2.999999.....". Then a conversion to int will give different results.

Increasing the number of ones further will lead to a point where the double will also round to "3.0" and the conversion to int will consequently yield the same result.

like image 28
Support Ukraine Avatar answered Oct 16 '22 21:10

Support Ukraine


The main reason is that the rounding-control (RC) field of the x87 FPU control register values are inconsistent in the follow two lines. eventually the values of c1 and c2 are different.

0x08048457 <+58>:    fstps  0x44(%esp)
0x0804848b <+110>:   fistpl 0x3c(%esp)

Add gcc compile option -mfpmath=387 -mno-sse, it can be reproduced (Even without -m32, or change the float to a double)
Like this:

gcc -otest test.c -g -mfpmath=387 -mno-sse -m32

Then use gdb to debug, breakpoint at 0x0804845b, and run to i=1

    0x08048457 <+58>:    fstps  0x44(%esp)
    0x0804845b <+62>:    flds   0x44(%esp)

    (gdb) info float
    =>R7: Valid   0x4003f7ffff8000000000 +30.99999904632568359      
      R6: Empty   0x4002a000000000000000
      R5: Empty   0x00000000000000000000
      R4: Empty   0x00000000000000000000
      R3: Empty   0x00000000000000000000
      R2: Empty   0x00000000000000000000
      R1: Empty   0x00000000000000000000
      R0: Empty   0x00000000000000000000

    Status Word:         0x3820                  PE                        
                           TOP: 7
    Control Word:        0x037f   IM DM ZM OM UM PM
                           PC: Extended Precision (64-bits)
                           RC: Round to nearest
    Tag Word:            0x3fff
    Instruction Pointer: 0x00:0x08048455
    Operand Pointer:     0x00:0x00000000
    Opcode:              0x0000

    (gdb) x /xw 0x44+$esp
    0xffffb594:     0x41f80000 ==> 31.0, s=0, M=1.1111 E=4

observe the execution results of fstps,
at this time, the RC value on the control register on the fpu is Round to nearest.
the value on the fpu register is 30.99999904632568359 (80bit).
the value on 0x44(%esp) (variable "f10") is 31.0. (round to nearest)

Then use gdb to debug, breakpoint at 0x0804848b, and run to i=1

    0x0804848b <+110>:   fistpl 0x3c(%esp)

    (gdb) info float
    =>R7: Valid   0x4003f7ffff8000000000 +30.99999904632568359      
      R6: Empty   0x4002a000000000000000
      R5: Empty   0x00000000000000000000
      R4: Empty   0x00000000000000000000
      R3: Empty   0x00000000000000000000
      R2: Empty   0x00000000000000000000
      R1: Empty   0x00000000000000000000
      R0: Empty   0x00000000000000000000

    Status Word:         0x3820                  PE                        
                           TOP: 7
    Control Word:        0x0c7f   IM DM ZM OM UM PM
                           PC: Single Precision (24-bits)
                           RC: Round toward zero
    Tag Word:            0x3fff
    Instruction Pointer: 0x00:0x08048485
    Operand Pointer:     0x00:0x00000000
    Opcode:              0x0000

at this time, the RC value on the control register on the fpu is Round toward zero.
the value on the fpu register is 30.99999904632568359 (80bit). value is the same as above
obviously when the integer is converted, the decimal point is truncated, and the value is 30.

Below is the main decompiled code

    (gdb) disas main
    Dump of assembler code for function main:
       0x0804841d <+0>:     push   %ebp
       0x0804841e <+1>:     mov    %esp,%ebp
       0x08048420 <+3>:     and    $0xfffffff0,%esp
       0x08048423 <+6>:     sub    $0x50,%esp
       0x08048426 <+9>:     movl   $0x0,0x4c(%esp)
       0x0804842e <+17>:    jmp    0x80484de <main+193>
       0x08048433 <+22>:    fildl  0x4c(%esp)
       0x08048437 <+26>:    fldl   0x80485a8
       0x0804843d <+32>:    fmulp  %st,%st(1)
       0x0804843f <+34>:    fldl   0x80485b0
       0x08048445 <+40>:    faddp  %st,%st(1)
       0x08048447 <+42>:    fstps  0x48(%esp)
       0x0804844b <+46>:    flds   0x48(%esp)
       0x0804844f <+50>:    flds   0x80485b8
       0x08048455 <+56>:    fmulp  %st,%st(1)
       0x08048457 <+58>:    fstps  0x44(%esp)        // store to f10
       0x0804845b <+62>:    flds   0x44(%esp)
       0x0804845f <+66>:    fnstcw 0x2a(%esp)
       0x08048463 <+70>:    movzwl 0x2a(%esp),%eax
       0x08048468 <+75>:    mov    $0xc,%ah
       0x0804846a <+77>:    mov    %ax,0x28(%esp)
       0x0804846f <+82>:    fldcw  0x28(%esp)
       0x08048473 <+86>:    fistpl 0x40(%esp)
       0x08048477 <+90>:    fldcw  0x2a(%esp)
       0x0804847b <+94>:    flds   0x48(%esp)
       0x0804847f <+98>:    fldl   0x80485c0
       0x08048485 <+104>:   fmulp  %st,%st(1)
       0x08048487 <+106>:   fldcw  0x28(%esp)
       0x0804848b <+110>:   fistpl 0x3c(%esp)       // f1 * 10 convert int
       0x0804848f <+114>:   fldcw  0x2a(%esp)
       0x08048493 <+118>:   flds   0x48(%esp)
       0x08048497 <+122>:   fldl   0x80485c0
       0x0804849d <+128>:   fmulp  %st,%st(1)
       0x0804849f <+130>:   flds   0x44(%esp)
       0x080484a3 <+134>:   fxch   %st(1)
       0x080484a5 <+136>:   mov    0x3c(%esp),%eax
       0x080484a9 <+140>:   mov    0x40(%esp),%edx
       0x080484ad <+144>:   sub    %eax,%edx
       0x080484af <+146>:   mov    %edx,%eax
       0x080484b1 <+148>:   fstpl  0x18(%esp)
       0x080484b5 <+152>:   fstpl  0x10(%esp)
       0x080484b9 <+156>:   mov    %eax,0xc(%esp)
       0x080484bd <+160>:   mov    0x3c(%esp),%eax
       0x080484c1 <+164>:   mov    %eax,0x8(%esp)
       0x080484c5 <+168>:   mov    0x40(%esp),%eax
       0x080484c9 <+172>:   mov    %eax,0x4(%esp)
       0x080484cd <+176>:   movl   $0x8048588,(%esp)
       0x080484d4 <+183>:   call   0x80482f0 <printf@plt>
       0x080484d9 <+188>:   addl   $0x1,0x4c(%esp)
       0x080484de <+193>:   cmpl   $0x14,0x4c(%esp)
       0x080484e3 <+198>:   jle    0x8048433 <main+22>
       0x080484e9 <+204>:   leave  
       0x080484ea <+205>:   ret
like image 29
ccxxshow Avatar answered Oct 16 '22 20:10

ccxxshow