I am writing a feed forward net in VC++ using AVX intrinsics. I am invoking this code via PInvoke in C#. My performance when calling a function that calculates a large loop including the function exp() is ~1000ms for a loopsize of 160M. As soon as I call any function that uses AVX intrinsics, and then subsequently use exp(), my performance drops to about ~8000ms for the same operation. Note that the function calculating the exp() is standard C, and the call that uses the AVX intrinsics can be completely unrelated in terms of data being processed. Some kind of flag is getting tripped somewhere at runtime.
In other words,
A(); // 1000ms calculates 160M exp() 
B(); // completely unrelated but contains AVX
A(); // 8000ms
or, curiously,
C(); // contains 128 bit SSE SIMD expressions
A(); // 1000ms
I am lost as to what possible mechanism is going on here, or how to pursue a sol'n. I'm on an Intel 2500K cpu\Win 7. Express versions of VS.
Thanks.
If you use any AVX256 instruction, the "AVX upper state" becomes "dirty", which results in a large stall if you subsequently use SSE instructions (including scalar floating-point performed in the xmm registers). This is documented in the Intel Optimization Manual, which you can download for free (and is a must-read if you're doing this sort of work):
AVX instruction always modifies the upper bits of YMM registers and SSE instructions do not modify the upper bits. From a hardware perspective, the upper bits of the YMM register collection can be considered to be in one of three states:
• Clean: All upper bits of YMM are zero. This is the state when processor starts from RESET.
• Modified and saved to XSAVE region The content of the upper bits of YMM registers matches saved data in XSAVE region. This happens when after XSAVE/XRSTOR executes.
• Modified and Unsaved: The execution of one AVX instruction (either 256-bit or 128-bit) modifies the upper bits of the destination YMM.
The AVX/SSE transition penalty applies whenever the processor states is “Modified and Unsaved“. Using VZEROUPPER move the processor states to “Clean“ and avoid the transition penalty.
Your routine B( ) dirties the YMM state, so the SSE code in A( ) stalls.  Insert a VZEROUPPER instruction between B and A to avoid the problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With