Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Avoiding AVX-SSE (VEX) Transition Penalties

Our 64-bit application has lots of code (inter alia, in standard libraries) that use xmm0-xmm7 registers in SSE mode.

I would like to implement fast memory copy using ymm registers. I cannot modify all the code that uses xmm registers to add VEX prefix, and I also think that this is not practical, since it will increase the size of the code can make it run slower because of the need for the CPU to decode larger instructions.

I just want to use two ymm registers (and possibly zmm - the affordable processors supporting zmm are promised to be available this year) for fast memory copy.

Question is: how to use the ymm registers but avoid the transition penalties?

Will the penalty occur when I use just ymm8-ymm15 registers (not ymm0-ymm7)? SSE originally had eight 128-bit registers (xmm0-xmm7), but in 64-bit mode there are (xmm8-xmm15) also available for non-VEX-prefixed instructions. However, I have reviewed our 64-bit application and it only use xmm0-xmm7, since it also has a 32-bit version with almost the same code. Does the penalty only occur when the CPU tries in fact to use an xmm register that had been used before as ymm and has one of higher 128 bits non-zero? Isn't it better to just zeroize the ymm registers that I have used after the fast memory copy? For example, I have used an ymm register once to copy 32 bytes of memory - what is the fastest way to zeroize it? Is "vpxor ymm15, ymm15, ymm15" fast enough? (AFAIK, vpxor can be executed on any of the 3 ALU execution ports, p0/p1/p5, while vxorpd can only be execute on p5). Wouldn't be the time to zeroize it more than the gain of using it to just copy 32 bytes of memory?

like image 819
Maxim Masiutin Avatar asked May 09 '17 21:05

Maxim Masiutin


Video Answer


2 Answers

The optimal solution is probably to recompile all the code with VEX prefixes. The VEX coded instructions are mostly the same size as the non-VEX versions of the same instructions because the non-VEX instructions carry a legacy of a lot of prefixes and escape codes (due to a long history of short-sighted patches in the instruction coding scheme). The VEX prefix combines all the old prefixes and escape codes into a single prefix of two or three bytes (four bytes for AVX512).

A VEX/non-VEX transition works in different ways on different processors (see Why is this SSE code 6 times slower without VZEROUPPER on Skylake?):

Older Intel processors: The VZEROUPPER instruction is needed for a clean transition between different internal states in the processor.

On Intel Skylake or later Processors: The VZEROUPPER is needed to avoid a false dependence of a non-VEX instruction on the upper part of the register.

On current AMD processors: A 256-bit register is treated as two 128-bit registers. The VZEROUPPER is not needed, except for compatibility with Intel processors. The cost of VZEROUPPER is approximately 6 clock cycles.

The advantage of using VEX prefixes on all your instructions is that you avoid these transition costs on all processors. Your legacy code can probably benefit from some 256-bit operations here and there in the hot innermost loop.

The disadvantage of VEX prefixes is that the code is incompatible with old processors, so you might need to preserve your old version for running on old processorrs

like image 127
A Fog Avatar answered Oct 18 '22 21:10

A Fog


To avoid the penalties on all architectures just need to issue vzeroall or vzeroupper after the part of your code that uses VEX-encoded instructions, prior to returning to the rest of the code that uses non-VEX instruction.

Issuing those instruction is considered good practice for all AVX-using routines anyway, and is cheap - except perhaps on Knights Landing, but I doubt you are using that architecture. Even if you are, the performance characteristics are quite different from the desktop/Xeon family, so you'll probably want a separate compile there anyway.

These are the only instructions that move from the dirty upper to the clean upper state. You can't simple zero out specific registers that you've used, as the chip isn't tracking the dirty state on a register-by-register basis.

The cost of these vzero* instructions is a few cycles: so if whatever you are doing in AVX is worth it, it will generally be worth it to pay this small cost.

like image 7
BeeOnRope Avatar answered Oct 18 '22 22:10

BeeOnRope