Why is the result of this explicit cast different from the implicit one?

Tags:

#include <stdio.h>

double  a;
double  b;
double  c;

long    d;

double    e;

int main() {
    a = 1.0;
    b = 2.0;
    c = .1;

    d = (b - a + c) / c;
    printf("%li\n", d);        //    10

    e = (b - a + c) / c;
    d = (long) e;
    printf("%li\n", d);        //    11
    }

If I do d = (long) ((b - a + c) / c); I also get 10. Why does the assignment to a double make a difference?

346

asked Apr 15 '09 17:04

Dennis Williamson

3 Answers

I suspect the difference is a conversion from an 80-bit floating point value to a long vs a conversion from an 80-bit floating point value to a 64-bit one and then a conversion to a long.

(The reason for 80 bits coming up at all is that that's a typical precision used for actual arithmetic, and the width of floating point registers.)

Suppose the 80-bit result is something like 10.999999999999999 - the conversion from that to a long yields 10. However, the nearest 64-bit floating point value to the 80-bit value is actually 11.0, so the two-stage conversion ends up yielding 11.

EDIT: To give this a bit more weight...

Here's a Java program which uses arbitrary-precision arithmetic to do the same calculation. Note that it converts the double value closest to 0.1 into a BigDecimal - that value is 0.1000000000000000055511151231257827021181583404541015625. (In other words, the exact result of the calculation is not 11 anyway.)

import java.math.*;

public class Test
{
    public static void main(String[] args)
    {
        BigDecimal c = new BigDecimal(0.1d);        
        BigDecimal a = new BigDecimal(1d);
        BigDecimal b = new BigDecimal(2d);

        BigDecimal result = b.subtract(a)
                             .add(c)
                             .divide(c, 40, RoundingMode.FLOOR);
        System.out.println(result);
    }
}

Here's the result:

10.9999999999999994448884876874217606030632

In other words, that's correct to about 40 decimal digits (way more than either 64 or 80 bit floating point can handle).

Now, let's consider what this number looks like in binary. I don't have any tools to easily do the conversion, but again we can use Java to help. Assuming a normalised number, the "10" part ends up using three bits (one less than for eleven = 1011). That leaves 60 bits of mantissa for extended precision (80 bits) and 48 bits for double precision (64 bits).

So, what's the closest number to 11 in each precision? Again, let's use Java:

import java.math.*;

public class Test
{
    public static void main(String[] args)
    {
        BigDecimal half = new BigDecimal("0.5");        
        BigDecimal eleven = new BigDecimal(11);

        System.out.println(eleven.subtract(half.pow(60)));
        System.out.println(eleven.subtract(half.pow(48)));        
    }
}

Results:

10.999999999999999999132638262011596452794037759304046630859375
10.999999999999996447286321199499070644378662109375

So, the three numbers we've got are:

Correct value: 10.999999999999999444888487687421760603063...
11-2^(-60): 10.999999999999999999132638262011596452794037759304046630859375
11-2^(-48): 10.999999999999996447286321199499070644378662109375

Now work out the closest value to the correct one for each precision - for extended precision, it's less than 11. Round each of those values to a long, and you end up with 10 and 11 respectively.

Hopefully this is enough evidence to convince the doubters ;)

answered Nov 02 '22 04:11

Jon Skeet

I get 10 & 11 on my 32-bit x86 linux system running gcc 4.3.2, too.

The relevant C/asm is here:

26:foo.c         ****     d = (b - a + c) / c;                                               
  42                            .loc 1 26 0
  43 0031 DD050000              fldl    b
  43      0000
  44 0037 DD050000              fldl    a
  44      0000
  45 003d DEE9                  fsubrp  %st, %st(1)
  46 003f DD050000              fldl    c
  46      0000
  47 0045 DEC1                  faddp   %st, %st(1)
  48 0047 DD050000              fldl    c
  48      0000
  49 004d DEF9                  fdivrp  %st, %st(1)
  50 004f D97DFA                fnstcw  -6(%ebp)
  51 0052 0FB745FA              movzwl  -6(%ebp), %eax
  52 0056 B40C                  movb    $12, %ah
  53 0058 668945F8              movw    %ax, -8(%ebp)
  54 005c D96DF8                fldcw   -8(%ebp)
  55 005f DB5DF4                fistpl  -12(%ebp)
  56 0062 D96DFA                fldcw   -6(%ebp)
  57 0065 8B45F4                movl    -12(%ebp), %eax
  58 0068 A3000000              movl    %eax, d
  58      00
  27:foo.c         ****
  28:foo.c         ****     printf("%li\n", d);                                                
  59                            .loc 1 28 0
  60 006d A1000000              movl    d, %eax
  60      00
  61 0072 89442404              movl    %eax, 4(%esp)
  62 0076 C7042400              movl    $.LC3, (%esp)
  62      000000
  63 007d E8FCFFFF              call    printf
  63      FF
  29:foo.c         ****     //    10                                                           
  30:foo.c         ****
  31:foo.c         ****     e = (b - a + c) / c;                                               
  64                            .loc 1 31 0
  65 0082 DD050000              fldl    b
  65      0000
  66 0088 DD050000              fldl    a
  66      0000
  67 008e DEE9                  fsubrp  %st, %st(1)
  68 0090 DD050000              fldl    c
  68      0000
  69 0096 DEC1                  faddp   %st, %st(1)
  70 0098 DD050000              fldl    c
  70      0000
  71 009e DEF9                  fdivrp  %st, %st(1)
  72 00a0 DD1D0000              fstpl   e
  72      0000
  32:foo.c         ****
  33:foo.c         ****     d = (long) e;                                                      
  73                            .loc 1 33 0
  74 00a6 DD050000              fldl    e
  74      0000
  75 00ac D97DFA                fnstcw  -6(%ebp)
  76 00af 0FB745FA              movzwl  -6(%ebp), %eax
  77 00b3 B40C                  movb    $12, %ah
  78 00b5 668945F8              movw    %ax, -8(%ebp)
  79 00b9 D96DF8                fldcw   -8(%ebp)
  80 00bc DB5DF4                fistpl  -12(%ebp)
  81 00bf D96DFA                fldcw   -6(%ebp)
  82 00c2 8B45F4                movl    -12(%ebp), %eax
  83 00c5 A3000000              movl    %eax, d
  83      00

The answer is left as an exercise for the interested reader.

answered Nov 02 '22 04:11

user47559

codepad.org (gcc 4.1.2) reverses the results of your example, while on my local system (gcc 4.3.2) I get 11 in both cases. This suggests to me that it is a floating point issue. Alternatively, it could theoretically be truncating (b - a + c) which, in an integer context would evaluate to (2 - 1 + 0) / .1, which would be 10, whereas in a float context (2.0 - 1.0 + 0.1) / .1 = 1.1 / .1 = 11. That would be weird though.

answered Nov 02 '22 05:11

Simon Broadhead

Related questions
                            
                                Is there a better way to benchmark a C program than timing?
                            
                                Compile-time assert on datatype sizes
                            
                                Parallelizing a for loop in C
                            
                                Function fread not terminating string by \0
                            
                                Should my library handle SIGSEGV on bad pointer input?
                            
                                Zero size struct
                            
                                C size_t and ssize_t negative value
                            
                                Cannot interpret compiler warning
                            
                                How to develop http server with libcurl
                            
                                In C, why don't I get an error when I declare an global variable in different data type in an other file?
                            
                                Regarding typedefs of 1-element arrays in C
                            
                                Create a mask that marks the most significant set bit, using only bitwise operators
                            
                                How to tell lcov to ignore lines in the source files
                            
                                How do I read a user-specified file in an emscripten compiled library?
                            
                                how to show enter password in the form of Asterisks(*) on terminal
                            
                                Why are typedef identifiers allowed to be declared multiple times?
                            
                                Why is there no "sub rsp" instruction in this function prologue and why are function parameters stored at negative rbp offsets?
                            
                                SIMD signed with unsigned multiplication for 64-bit * 64-bit to 128-bit
                            
                                How can a C program poll for user input while simultaneously performing other actions in a Linux environment?
                            
                                Does anyone know of a C/C++ Unix QR-Code library? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is the result of this explicit cast different from the implicit one?

Tags:

c

type-conversion

casting

Dennis Williamson

People also ask

3 Answers

Jon Skeet

user47559

Simon Broadhead

Recent Activity

Donate For Us