Why is the result of this explicit cast different from the implicit one?
#include <stdio.h>
double a;
double b;
double c;
long d;
double e;
int main() {
a = 1.0;
b = 2.0;
c = .1;
d = (b - a + c) / c;
printf("%li\n", d); // 10
e = (b - a + c) / c;
d = (long) e;
printf("%li\n", d); // 11
}
If I do d = (long) ((b - a + c) / c); I also get 10. Why does the assignment to a double make a difference?
In implicit typecasting, the conversion involves a smaller data type to the larger type size. For example, the byte datatype implicitly typecast into short, char, int, long, float, and double. The process of converting the lower data type to that of a higher data type is referred to as Widening.
Implicit Conversions There is no special syntax for this type of conversion, this is the safest type of casting. No data is lost, for example, when converting from smaller to larger integral types or derived classes to base classes.
I suspect the difference is a conversion from an 80-bit floating point value to a long vs a conversion from an 80-bit floating point value to a 64-bit one and then a conversion to a long.
(The reason for 80 bits coming up at all is that that's a typical precision used for actual arithmetic, and the width of floating point registers.)
Suppose the 80-bit result is something like 10.999999999999999 - the conversion from that to a long yields 10. However, the nearest 64-bit floating point value to the 80-bit value is actually 11.0, so the two-stage conversion ends up yielding 11.
EDIT: To give this a bit more weight...
Here's a Java program which uses arbitrary-precision arithmetic to do the same calculation. Note that it converts the double value closest to 0.1 into a BigDecimal - that value is 0.1000000000000000055511151231257827021181583404541015625. (In other words, the exact result of the calculation is not 11 anyway.)
import java.math.*;
public class Test
{
public static void main(String[] args)
{
BigDecimal c = new BigDecimal(0.1d);
BigDecimal a = new BigDecimal(1d);
BigDecimal b = new BigDecimal(2d);
BigDecimal result = b.subtract(a)
.add(c)
.divide(c, 40, RoundingMode.FLOOR);
System.out.println(result);
}
}
Here's the result:
10.9999999999999994448884876874217606030632
In other words, that's correct to about 40 decimal digits (way more than either 64 or 80 bit floating point can handle).
Now, let's consider what this number looks like in binary. I don't have any tools to easily do the conversion, but again we can use Java to help. Assuming a normalised number, the "10" part ends up using three bits (one less than for eleven = 1011). That leaves 60 bits of mantissa for extended precision (80 bits) and 48 bits for double precision (64 bits).
So, what's the closest number to 11 in each precision? Again, let's use Java:
import java.math.*;
public class Test
{
public static void main(String[] args)
{
BigDecimal half = new BigDecimal("0.5");
BigDecimal eleven = new BigDecimal(11);
System.out.println(eleven.subtract(half.pow(60)));
System.out.println(eleven.subtract(half.pow(48)));
}
}
Results:
10.999999999999999999132638262011596452794037759304046630859375
10.999999999999996447286321199499070644378662109375
So, the three numbers we've got are:
Correct value: 10.999999999999999444888487687421760603063...
11-2^(-60): 10.999999999999999999132638262011596452794037759304046630859375
11-2^(-48): 10.999999999999996447286321199499070644378662109375
Now work out the closest value to the correct one for each precision - for extended precision, it's less than 11. Round each of those values to a long, and you end up with 10 and 11 respectively.
Hopefully this is enough evidence to convince the doubters ;)
I get 10 & 11 on my 32-bit x86 linux system running gcc 4.3.2, too.
The relevant C/asm is here:
26:foo.c **** d = (b - a + c) / c;
42 .loc 1 26 0
43 0031 DD050000 fldl b
43 0000
44 0037 DD050000 fldl a
44 0000
45 003d DEE9 fsubrp %st, %st(1)
46 003f DD050000 fldl c
46 0000
47 0045 DEC1 faddp %st, %st(1)
48 0047 DD050000 fldl c
48 0000
49 004d DEF9 fdivrp %st, %st(1)
50 004f D97DFA fnstcw -6(%ebp)
51 0052 0FB745FA movzwl -6(%ebp), %eax
52 0056 B40C movb $12, %ah
53 0058 668945F8 movw %ax, -8(%ebp)
54 005c D96DF8 fldcw -8(%ebp)
55 005f DB5DF4 fistpl -12(%ebp)
56 0062 D96DFA fldcw -6(%ebp)
57 0065 8B45F4 movl -12(%ebp), %eax
58 0068 A3000000 movl %eax, d
58 00
27:foo.c ****
28:foo.c **** printf("%li\n", d);
59 .loc 1 28 0
60 006d A1000000 movl d, %eax
60 00
61 0072 89442404 movl %eax, 4(%esp)
62 0076 C7042400 movl $.LC3, (%esp)
62 000000
63 007d E8FCFFFF call printf
63 FF
29:foo.c **** // 10
30:foo.c ****
31:foo.c **** e = (b - a + c) / c;
64 .loc 1 31 0
65 0082 DD050000 fldl b
65 0000
66 0088 DD050000 fldl a
66 0000
67 008e DEE9 fsubrp %st, %st(1)
68 0090 DD050000 fldl c
68 0000
69 0096 DEC1 faddp %st, %st(1)
70 0098 DD050000 fldl c
70 0000
71 009e DEF9 fdivrp %st, %st(1)
72 00a0 DD1D0000 fstpl e
72 0000
32:foo.c ****
33:foo.c **** d = (long) e;
73 .loc 1 33 0
74 00a6 DD050000 fldl e
74 0000
75 00ac D97DFA fnstcw -6(%ebp)
76 00af 0FB745FA movzwl -6(%ebp), %eax
77 00b3 B40C movb $12, %ah
78 00b5 668945F8 movw %ax, -8(%ebp)
79 00b9 D96DF8 fldcw -8(%ebp)
80 00bc DB5DF4 fistpl -12(%ebp)
81 00bf D96DFA fldcw -6(%ebp)
82 00c2 8B45F4 movl -12(%ebp), %eax
83 00c5 A3000000 movl %eax, d
83 00
The answer is left as an exercise for the interested reader.
codepad.org (gcc 4.1.2) reverses the results of your example, while on my local system (gcc 4.3.2) I get 11 in both cases. This suggests to me that it is a floating point issue. Alternatively, it could theoretically be truncating (b - a + c) which, in an integer context would evaluate to (2 - 1 + 0) / .1, which would be 10, whereas in a float context (2.0 - 1.0 + 0.1) / .1 = 1.1 / .1 = 11. That would be weird though.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With