I have the following code and am expecting the intrinsic version of the exp()
function to be used. Unfortunately, it is not in an x64 build, making it slower than a similar Win32 (i.e., 32-bit build):
#include "stdafx.h"
#include <cmath>
#include <intrin.h>
#include <iostream>
int main()
{
const int NUM_ITERATIONS=10000000;
double expNum=0.00001;
double result=0.0;
for (double i=0;i<NUM_ITERATIONS;++i)
{
result+=exp(expNum); // <-- The code of interest is here
expNum+=0.00001;
}
// To prevent the above from getting optimized out...
std::cout << result << '\n';
}
I am using the following switches for my build:
/Zi /nologo /W3 /WX-
/Ox /Ob2 /Oi /Ot /Oy /GL /D "WIN32" /D "NDEBUG"
/D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Gm-
/EHsc /GS /Gy /arch:SSE2 /fp:fast /Zc:wchar_t /Zc:forScope
/Yu"StdAfx.h" /Fp"x64\Release\exp.pch" /FAcs /Fa"x64\Release\"
/Fo"x64\Release\" /Fd"x64\Release\vc100.pdb" /Gd /errorReport:queue
As you can see, I do have /Oi
, /O2
and /fp:fast
as required per the MSDN article on intrinsics. Yet, despite my efforts a call to the standard library is made, making exp()
perform slower on x64 builds.
Here is the generated assembly:
for (double i=0;i<NUM_ITERATIONS;++i)
000000013F911030 movsd xmm10,mmword ptr [__real@3ff0000000000000 (13F912248h)]
000000013F911039 movapd xmm8,xmm6
000000013F91103E movapd xmm7,xmm9
000000013F911043 movaps xmmword ptr [rsp+20h],xmm11
000000013F911049 movsd xmm11,mmword ptr [__real@416312d000000000 (13F912240h)]
{
result+=exp(expNum);
000000013F911052 movapd xmm0,xmm7
000000013F911056 call exp (13F911A98h) // ***** exp lib call is here *****
000000013F91105B addsd xmm8,xmm10
expNum+=0.00001;
000000013F911060 addsd xmm7,xmm9
000000013F911065 comisd xmm8,xmm11
000000013F91106A addsd xmm6,xmm0
000000013F91106E jb main+52h (13F911052h)
}
As you can see in the assembly above, there is a call out to the exp()
function. Now, let's look at the code generated for that for
loop with a 32-bit build:
for (double i=0;i<NUM_ITERATIONS;++i)
00101031 xorps xmm1,xmm1
00101034 rdtsc
00101036 push ebx
00101037 push esi
00101038 movsd mmword ptr [esp+1Ch],xmm0
0010103E movsd xmm0,mmword ptr [__real@3ee4f8b588e368f1 (102188h)]
00101046 push edi
00101047 mov ebx,eax
00101049 mov dword ptr [esp+3Ch],edx
0010104D movsd mmword ptr [esp+28h],xmm0
00101053 movsd mmword ptr [esp+30h],xmm1
00101059 lea esp,[esp]
{
result+=exp(expNum);
00101060 call __libm_sse2_exp (101EC0h) // <--- Quite different from 64-bit
00101065 addsd xmm0,mmword ptr [esp+20h]
0010106B movsd xmm1,mmword ptr [esp+30h]
00101071 addsd xmm1,mmword ptr [__real@3ff0000000000000 (102180h)]
00101079 movsd xmm2,mmword ptr [__real@416312d000000000 (102178h)]
00101081 comisd xmm2,xmm1
00101085 movsd mmword ptr [esp+20h],xmm0
expNum+=0.00001;
0010108B movsd xmm0,mmword ptr [esp+28h]
00101091 addsd xmm0,mmword ptr [__real@3ee4f8b588e368f1 (102188h)]
00101099 movsd mmword ptr [esp+28h],xmm0
0010109F movsd mmword ptr [esp+30h],xmm1
001010A5 ja wmain+40h (101060h)
}
Much more code there, yet it's faster. A timing test I did on a 3.3 GHz Nehalem-EP host produced the following results:
32-bit:
For loop body average exec time: 34.849229 cycles / 10.560373 ns
64-bit:
For loop body average exec time: 45.845323 cycles / 13.892522 ns
Very odd behavior, indeed. Why is it happening?
Update:
I have created a Microsoft Connect bug report. Feel free to upvote it to get an authoritative answer from Microsoft itself on the use of floating point intrinsics, especially in x64 code.
On x64, floating point arithmetic is performed using SSE. This does not have a built-in operation for exp()
and so a call to the standard library is inevitable unless you write your own inline manually-vectorized __m128d exp(__m128d)
(Fastest Implementation of Exponential Function Using SSE).
I imagine that the MSDN article you are referring to was written with 32 bit code that uses 8087 FP in mind.
I think the only reason that Microsoft provides an intrinsic version of 32-bit SSE2 exp() is the standard calling conventions. The 32-bit calling conventions require the operand to be pushed on the main stack, and the result to be returned in the top register of the FPU stack. If you have SSE2 code generation enabled, then the return value is likely to be popped from the FPU stack into memory, then loaded from that location into an SSE2 register for whatever maths you want to do on the result. Clearly, it is faster to pass the operand in an SSE2 register and return the result in an SSE2 register. This is what __libm_sse2_exp() does. In 64-bit code, the standard calling convention passes the operand and returns the result in SSE2 registers anyway, so there is no advantage in having an intrinsic version.
The reason for the performance difference between 32-bit SSE2 and 64-bit implementations of exp() is that Microsoft uses different algorithms in the two implementations. I've no idea why they do this, and they produce different results (different by 1ulp) for some operands.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With