Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is the result of this explicit cast different from the implicit one?

Why is the result of this explicit cast different from the implicit one?

#include <stdio.h>

double  a;
double  b;
double  c;

long    d;

double    e;

int main() {
    a = 1.0;
    b = 2.0;
    c = .1;

    d = (b - a + c) / c;
    printf("%li\n", d);        //    10

    e = (b - a + c) / c;
    d = (long) e;
    printf("%li\n", d);        //    11
    }

If I do d = (long) ((b - a + c) / c); I also get 10. Why does the assignment to a double make a difference?

like image 346
Dennis Williamson Avatar asked Apr 15 '09 17:04

Dennis Williamson


People also ask

Which of the following defines implicit casting?

In implicit typecasting, the conversion involves a smaller data type to the larger type size. For example, the byte datatype implicitly typecast into short, char, int, long, float, and double. The process of converting the lower data type to that of a higher data type is referred to as Widening.

Why are implicit type conversions safe?

Implicit Conversions There is no special syntax for this type of conversion, this is the safest type of casting. No data is lost, for example, when converting from smaller to larger integral types or derived classes to base classes.


3 Answers

I suspect the difference is a conversion from an 80-bit floating point value to a long vs a conversion from an 80-bit floating point value to a 64-bit one and then a conversion to a long.

(The reason for 80 bits coming up at all is that that's a typical precision used for actual arithmetic, and the width of floating point registers.)

Suppose the 80-bit result is something like 10.999999999999999 - the conversion from that to a long yields 10. However, the nearest 64-bit floating point value to the 80-bit value is actually 11.0, so the two-stage conversion ends up yielding 11.

EDIT: To give this a bit more weight...

Here's a Java program which uses arbitrary-precision arithmetic to do the same calculation. Note that it converts the double value closest to 0.1 into a BigDecimal - that value is 0.1000000000000000055511151231257827021181583404541015625. (In other words, the exact result of the calculation is not 11 anyway.)

import java.math.*;

public class Test
{
    public static void main(String[] args)
    {
        BigDecimal c = new BigDecimal(0.1d);        
        BigDecimal a = new BigDecimal(1d);
        BigDecimal b = new BigDecimal(2d);

        BigDecimal result = b.subtract(a)
                             .add(c)
                             .divide(c, 40, RoundingMode.FLOOR);
        System.out.println(result);
    }
}

Here's the result:

10.9999999999999994448884876874217606030632

In other words, that's correct to about 40 decimal digits (way more than either 64 or 80 bit floating point can handle).

Now, let's consider what this number looks like in binary. I don't have any tools to easily do the conversion, but again we can use Java to help. Assuming a normalised number, the "10" part ends up using three bits (one less than for eleven = 1011). That leaves 60 bits of mantissa for extended precision (80 bits) and 48 bits for double precision (64 bits).

So, what's the closest number to 11 in each precision? Again, let's use Java:

import java.math.*;

public class Test
{
    public static void main(String[] args)
    {
        BigDecimal half = new BigDecimal("0.5");        
        BigDecimal eleven = new BigDecimal(11);

        System.out.println(eleven.subtract(half.pow(60)));
        System.out.println(eleven.subtract(half.pow(48)));        
    }
}

Results:

10.999999999999999999132638262011596452794037759304046630859375
10.999999999999996447286321199499070644378662109375

So, the three numbers we've got are:

Correct value: 10.999999999999999444888487687421760603063...
11-2^(-60): 10.999999999999999999132638262011596452794037759304046630859375
11-2^(-48): 10.999999999999996447286321199499070644378662109375

Now work out the closest value to the correct one for each precision - for extended precision, it's less than 11. Round each of those values to a long, and you end up with 10 and 11 respectively.

Hopefully this is enough evidence to convince the doubters ;)

like image 66
Jon Skeet Avatar answered Nov 02 '22 04:11

Jon Skeet


I get 10 & 11 on my 32-bit x86 linux system running gcc 4.3.2, too.

The relevant C/asm is here:

26:foo.c         ****     d = (b - a + c) / c;                                               
  42                            .loc 1 26 0
  43 0031 DD050000              fldl    b
  43      0000
  44 0037 DD050000              fldl    a
  44      0000
  45 003d DEE9                  fsubrp  %st, %st(1)
  46 003f DD050000              fldl    c
  46      0000
  47 0045 DEC1                  faddp   %st, %st(1)
  48 0047 DD050000              fldl    c
  48      0000
  49 004d DEF9                  fdivrp  %st, %st(1)
  50 004f D97DFA                fnstcw  -6(%ebp)
  51 0052 0FB745FA              movzwl  -6(%ebp), %eax
  52 0056 B40C                  movb    $12, %ah
  53 0058 668945F8              movw    %ax, -8(%ebp)
  54 005c D96DF8                fldcw   -8(%ebp)
  55 005f DB5DF4                fistpl  -12(%ebp)
  56 0062 D96DFA                fldcw   -6(%ebp)
  57 0065 8B45F4                movl    -12(%ebp), %eax
  58 0068 A3000000              movl    %eax, d
  58      00
  27:foo.c         ****
  28:foo.c         ****     printf("%li\n", d);                                                
  59                            .loc 1 28 0
  60 006d A1000000              movl    d, %eax
  60      00
  61 0072 89442404              movl    %eax, 4(%esp)
  62 0076 C7042400              movl    $.LC3, (%esp)
  62      000000
  63 007d E8FCFFFF              call    printf
  63      FF
  29:foo.c         ****     //    10                                                           
  30:foo.c         ****
  31:foo.c         ****     e = (b - a + c) / c;                                               
  64                            .loc 1 31 0
  65 0082 DD050000              fldl    b
  65      0000
  66 0088 DD050000              fldl    a
  66      0000
  67 008e DEE9                  fsubrp  %st, %st(1)
  68 0090 DD050000              fldl    c
  68      0000
  69 0096 DEC1                  faddp   %st, %st(1)
  70 0098 DD050000              fldl    c
  70      0000
  71 009e DEF9                  fdivrp  %st, %st(1)
  72 00a0 DD1D0000              fstpl   e
  72      0000
  32:foo.c         ****
  33:foo.c         ****     d = (long) e;                                                      
  73                            .loc 1 33 0
  74 00a6 DD050000              fldl    e
  74      0000
  75 00ac D97DFA                fnstcw  -6(%ebp)
  76 00af 0FB745FA              movzwl  -6(%ebp), %eax
  77 00b3 B40C                  movb    $12, %ah
  78 00b5 668945F8              movw    %ax, -8(%ebp)
  79 00b9 D96DF8                fldcw   -8(%ebp)
  80 00bc DB5DF4                fistpl  -12(%ebp)
  81 00bf D96DFA                fldcw   -6(%ebp)
  82 00c2 8B45F4                movl    -12(%ebp), %eax
  83 00c5 A3000000              movl    %eax, d
  83      00

The answer is left as an exercise for the interested reader.

like image 27
user47559 Avatar answered Nov 02 '22 04:11

user47559


codepad.org (gcc 4.1.2) reverses the results of your example, while on my local system (gcc 4.3.2) I get 11 in both cases. This suggests to me that it is a floating point issue. Alternatively, it could theoretically be truncating (b - a + c) which, in an integer context would evaluate to (2 - 1 + 0) / .1, which would be 10, whereas in a float context (2.0 - 1.0 + 0.1) / .1 = 1.1 / .1 = 11. That would be weird though.

like image 1
Simon Broadhead Avatar answered Nov 02 '22 05:11

Simon Broadhead