I stumbled upon a difference in the way floating point arithmetics are done between MS VS 2010 builds for x86 and x64 (both executed on the same 64 bit machine). This is a reduced code sample: <pre class="prettyprint"><code>float a = 50.0f; float b = 65.0f; float c = 1.3f; float d = a*c; bool bLarger1 = d<b; bool bLarger2 = (a*c)<b; </code></pre> The boolean bLarger1 is always false (d is set to 65.0 in both builds). Variable bLarger2 is false for x64 but true for x86! I am well aware of floating point arithmetics and the rounding effects taking place. I also know that 32 bit sometimes uses different instructions for floating operations than 64 bit builds. But in this case I am missing some information. Why is there a discrepency between bLarger1 and bLarger2 on the first place? Why is it only present on the 32 bit build? <img src="https://i.stack.imgur.com/FrnB6.png" alt="Left: x86, Right: x64">

The issue hinges on this expression: <pre class="prettyprint"><code>bool bLarger2 = (a*c)<b; </code></pre> I looked at the code generated under VS2008, not having VS2010 to hand. For 64 bit the code is: <pre class="prettyprint"> 000000013FD51100 movss xmm1,dword ptr [a] 000000013FD51106 mulss xmm1,dword ptr [c] 000000013FD5110C movss xmm0,dword ptr [b] 000000013FD51112 comiss xmm0,xmm1 </pre> For 32 bit the code is: <pre class="prettyprint"> 00FC14DC fld dword ptr [a] 00FC14DF fmul dword ptr [c] 00FC14E2 fld dword ptr [b] 00FC14E5 fcompp </pre> So under 32 bit the calculation is performed in the x87 unit, and under 64 bit it is performed by the x64 unit. And the difference here is that the x87 operations are all performed to higher than single precision. By default the calculations are performed to double precision. On the other hand the SSE unit operations are pure single precision calculations. You can persuade the 32 bit unit to perform all calculations to single precision accuracy like this: <pre class="prettyprint"><code>_controlfp(_PC_24, _MCW_PC); </code></pre> When you add that to your 32 bit program you will find that the booleans are both set to false. There is a fundamental difference in the way that the x87 and SSE floating point units work. The x87 unit uses the same instructions for both single and double precision types. Data is loaded into registers in the x87 FPU stack, and those registers are always 10 byte Intel extended. You can control the precision using the floating point control word. But the instructions that the compiler writes are ignorant of that state. On the other hand, the SSE unit uses different instructions for operations on single and double precision. Which means that the compiler can emit code that is in full control of the precision of the calculation. So, the x87 unit is the bad guy here. You can maybe try to persuade your compiler to emit SSE instructions even for 32 bit targets. And certainly when I compiled your code under VS2013 I found that both 32 and 64 bit targets emitted SSE instructions.

Difference in floating point arithmetics between x86 and x64

Tags:

c++

c

floating-point

visual-studio-2010

64-bit

I stumbled upon a difference in the way floating point arithmetics are done between MS VS 2010 builds for x86 and x64 (both executed on the same 64 bit machine).

This is a reduced code sample:

float a = 50.0f;
float b = 65.0f;
float c =  1.3f;
float d = a*c;
bool bLarger1 = d<b;
bool bLarger2 = (a*c)<b;

The boolean bLarger1 is always false (d is set to 65.0 in both builds). Variable bLarger2 is false for x64 but true for x86!

I am well aware of floating point arithmetics and the rounding effects taking place. I also know that 32 bit sometimes uses different instructions for floating operations than 64 bit builds. But in this case I am missing some information.

Why is there a discrepency between bLarger1 and bLarger2 on the first place? Why is it only present on the 32 bit build?

Left: x86, Right: x64

419

asked Mar 28 '14 10:03

Oliver Zendel

1 Answers

The issue hinges on this expression:

bool bLarger2 = (a*c)<b;

I looked at the code generated under VS2008, not having VS2010 to hand. For 64 bit the code is:

000000013FD51100  movss       xmm1,dword ptr [a] 
000000013FD51106  mulss       xmm1,dword ptr [c] 
000000013FD5110C  movss       xmm0,dword ptr [b] 
000000013FD51112  comiss      xmm0,xmm1

For 32 bit the code is:

00FC14DC  fld         dword ptr [a] 
00FC14DF  fmul        dword ptr [c] 
00FC14E2  fld         dword ptr [b] 
00FC14E5  fcompp

So under 32 bit the calculation is performed in the x87 unit, and under 64 bit it is performed by the x64 unit.

And the difference here is that the x87 operations are all performed to higher than single precision. By default the calculations are performed to double precision. On the other hand the SSE unit operations are pure single precision calculations.

You can persuade the 32 bit unit to perform all calculations to single precision accuracy like this:

_controlfp(_PC_24, _MCW_PC);

When you add that to your 32 bit program you will find that the booleans are both set to false.

There is a fundamental difference in the way that the x87 and SSE floating point units work. The x87 unit uses the same instructions for both single and double precision types. Data is loaded into registers in the x87 FPU stack, and those registers are always 10 byte Intel extended. You can control the precision using the floating point control word. But the instructions that the compiler writes are ignorant of that state.

On the other hand, the SSE unit uses different instructions for operations on single and double precision. Which means that the compiler can emit code that is in full control of the precision of the calculation.

So, the x87 unit is the bad guy here. You can maybe try to persuade your compiler to emit SSE instructions even for 32 bit targets. And certainly when I compiled your code under VS2013 I found that both 32 and 64 bit targets emitted SSE instructions.

answered Oct 18 '22 14:10

David Heffernan

Related questions
                            
                                Compiling Cuda code in Qt Creator on Windows
                            
                                How to compare performance of two pieces of codes
                            
                                How to detect if atof or _wtof failes?
                            
                                Difference of stricmp and _stricmp in Visual Studio?
                            
                                pthreads: thread starvation caused by quick re-locking
                            
                                Are `char16_t` and `char32_t` misnomers?
                            
                                char and char* (pointer)
                            
                                How to implement an atomic (thread-safe) and exception-safe deep copy assignment operator?
                            
                                Move Semantics and Pass-by-Rvalue-Reference in Overloaded Arithmetic
                            
                                Is using a non-32-bit integer reasonable? [duplicate]
                            
                                Maximum size of a bit field in C or C++? [duplicate]
                            
                                double to string without scientific notation or trailing zeros, efficiently
                            
                                How would one push back an empty vector of pairs to another vector?
                            
                                How do you use find_if along with reverse_iterator on a C-style array?
                            
                                Never annotate functions involving dynamic memory allocation as noexcept?
                            
                                cout<< "привет"; or wcout<< L"привет";
                            
                                porting isnan to c++11
                            
                                Type safe enum bit flags
                            
                                Cygwin 64 G++ -fuse-linker-plugin Error
                            
                                Get list of methods in class using clang

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With