I stumbled upon a difference in the way floating point arithmetics are done between MS VS 2010 builds for x86 and x64 (both executed on the same 64 bit machine).
This is a reduced code sample:
float a = 50.0f;
float b = 65.0f;
float c = 1.3f;
float d = a*c;
bool bLarger1 = d<b;
bool bLarger2 = (a*c)<b;
The boolean bLarger1 is always false (d is set to 65.0 in both builds). Variable bLarger2 is false for x64 but true for x86!
I am well aware of floating point arithmetics and the rounding effects taking place. I also know that 32 bit sometimes uses different instructions for floating operations than 64 bit builds. But in this case I am missing some information.
Why is there a discrepency between bLarger1 and bLarger2 on the first place? Why is it only present on the 32 bit build?
Floating Point Numbers Floats generally come in two flavours: “single” and “double” precision. Single precision floats are 32-bits in length while “doubles” are 64-bits. Due to the finite size of floats, they cannot represent all of the real numbers - there are limitations on both their precision and range.
Float is a datatype which is used to represent the floating point numbers. It is a 32-bit IEEE 754 single precision floating point number ( 1-bit for the sign, 8-bit for exponent, 23*-bit for the value. It has 6 decimal digits of precision.
The 'int pointer' size can be changed to 64 bits on 64 bits machines, since the memory address size is 64 bits. That means your 'argument' isn't valid. A float is then still a float too: usually we say it is 32 bits, but everyone is free to deviate from it.
Save this answer. Show activity on this post. No, an IEEE 754 double-precision floating point number is always 64 bits. Similarly, a single-precision float is always 32 bits.
The issue hinges on this expression:
bool bLarger2 = (a*c)<b;
I looked at the code generated under VS2008, not having VS2010 to hand. For 64 bit the code is:
000000013FD51100 movss xmm1,dword ptr [a] 000000013FD51106 mulss xmm1,dword ptr [c] 000000013FD5110C movss xmm0,dword ptr [b] 000000013FD51112 comiss xmm0,xmm1
For 32 bit the code is:
00FC14DC fld dword ptr [a] 00FC14DF fmul dword ptr [c] 00FC14E2 fld dword ptr [b] 00FC14E5 fcompp
So under 32 bit the calculation is performed in the x87 unit, and under 64 bit it is performed by the x64 unit.
And the difference here is that the x87 operations are all performed to higher than single precision. By default the calculations are performed to double precision. On the other hand the SSE unit operations are pure single precision calculations.
You can persuade the 32 bit unit to perform all calculations to single precision accuracy like this:
_controlfp(_PC_24, _MCW_PC);
When you add that to your 32 bit program you will find that the booleans are both set to false.
There is a fundamental difference in the way that the x87 and SSE floating point units work. The x87 unit uses the same instructions for both single and double precision types. Data is loaded into registers in the x87 FPU stack, and those registers are always 10 byte Intel extended. You can control the precision using the floating point control word. But the instructions that the compiler writes are ignorant of that state.
On the other hand, the SSE unit uses different instructions for operations on single and double precision. Which means that the compiler can emit code that is in full control of the precision of the calculation.
So, the x87 unit is the bad guy here. You can maybe try to persuade your compiler to emit SSE instructions even for 32 bit targets. And certainly when I compiled your code under VS2013 I found that both 32 and 64 bit targets emitted SSE instructions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With