This is a question of curiosity more than anything else. I was looking at this code disassembly (C#, 64 bit, Release mode, VS 2012 RC):
double a = 10d * Math.Log(20d, 2d);
000000c8 movsd xmm1,mmword ptr [00000138h]
000000d0 movsd xmm0,mmword ptr [00000140h]
000000d8 call 000000005EDC7F50
000000dd movsd mmword ptr [rsp+58h],xmm0
000000e3 movsd xmm0,mmword ptr [rsp+58h]
000000e9 mulsd xmm0,mmword ptr [00000148h]
000000f1 movsd mmword ptr [rsp+30h],xmm0
a = Math.Pow(a, 6d);
000000f7 movsd xmm1,mmword ptr [00000150h]
000000ff movsd xmm0,mmword ptr [rsp+30h]
00000105 call 000000005F758220
0000010a movsd mmword ptr [rsp+60h],xmm0
00000110 movsd xmm0,mmword ptr [rsp+60h]
00000116 movsd mmword ptr [rsp+30h],xmm0
... and found it odd that the compiler isn't using x87 instructions for the Logs here (Power uses Logs). Of course, I have no idea what code is at the call locations, but I know that SIMD does not have a Log function, which makes this choice all the more odd. Further, nothing is parellelized here, so why SIMD and not simple x87?
On a lesser note, I also found it odd that the x87 FYL2X instruction isn't being used, which is designed specifically for the case shown in the first line of code.
Can anyone shed any light on this?
There are two separate points here. First of all why is the compiler using SSE registers rather than the x87 floating point stack for function arguments, and secondly why the compiler doesn't just use the single instruction that can compute a logarithm.
Not using the logarithm instruction is easiest to explain, the logarithm instruction in x86 is defined to be accurate to 80-bits, whereas you are using a double, which is only 64-bits. Computing a logarithm to 64-bits rather than 80-bits of precision is much faster, and the speed increase more than makes up for having to do it in software rather than in silicon.
The use of SSE registers is more difficult to explain in a way that's satistfactory. The simple answer is that the x64 calling convention requires the first four floating point arguments to a function to be passed at xmm0
through xmm3
.
The next question is, of course, why does the calling convention tell you to do this rather than use the floating point stack. The answer is that native x64 code rarely uses the x87 FPU at all, using SSE in replacement. This is because multiplication and division is faster in SSE (the 80-bit vs 64-bit issue again) and that the SSE registers are faster to manipulate (in the FPU you can only access the top of the stack, and rotating the FPU stack is often the slowest operation on a modern processor, in fact some have an extra pipeline stage solely for this purpose).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With