I stepped into the assembly of the transcendental math functions of the C library with MSVC in fp:strict mode. They all seem to follow the same pattern, here's what happens for sin
.
First there is a dispatch routine from a file called "disp_pentium4.inc". It checks if the variable ___use_sse2_mathfcns
has been set; if so, calls __sin_pentium4
, otherwise calls __sin_default
.
__sin_pentium4
(in "sin_pentium4.asm") starts by transferring the argument from the x87 fpu to the xmm0 register, performs the calculation using SSE2 instructions, and loads the result back in the fpu.
__sin_default
(in "sin.asm") keeps the variable on the x87 stack and simply calls fsin
.
So in both cases, the operand is pushed on the x87 stack and returned on it as well, making it transparent to the caller, but if ___use_sse2_mathfcns
is defined, the operation is actually performed in SSE2 rather than x87.
This behavior is very interesting to me because the x87 transcendental functions are notorious for having slightly different behaviors depending on the implementation, whereas a given piece of SSE2 code should always give reproducible results.
Is there a way to determine for certain, either at compile or run-time, that the SSE2 code path will be used? I am not proficient writing assembly, so if this involves writing any assembly, a code example would be appreciated.
I found the answer through careful investigation of math.h. This is controlled by a method called _set_SSE2_enable
. This is a public symbol documented here:
Enables or disables the use of Streaming SIMD Extensions 2 (SSE2) instructions in CRT math routines. (This function is not available on x64 architectures because SSE2 is enabled by default.)
This causes the aforementionned ___use_sse2_mathfcns flag to be set to the provided value, effectively enabling or disabling use of the _pentium4 SSE2 routines.
The documentation mentions this affects only certain transcendental functions, but looking at the disassembly, this seems to affect everyone of them.
Edit: stepping into every function reveals that they're all available in SSE2 except for the following:
Sqrt is the biggest offender, but it's trivial to implement in SSE2 using intrinsics. For the others, there's no simple solution except perhaps using a third-party library, but I can probably do without.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With