Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do SSE instructions preserve the upper 128-bit of the YMM registers?

It seems to be a recurring problem that many Intel processors (up until Skylake, unless I'm wrong) exhibit poor performance when mixing AVX-256 instructions with SSE instructions.

According to Intel's documentation, this is caused by SSE instructions being defined to preserve the upper 128 bits of the YMM registers, so in order to be able to save power by not using the upper 128 bits of the AVX datapaths, the CPU stores those bits away when executing SSE code and reloads them when entering AVX code, the stores and loads being expensive.

However, I can find no obvious reason or explanation why SSE instructions needed to preserve those upper 128 bits. The corresponding 128-bit VEX instructions (the use of which avoids the performance penalty) work by always clearing the upper 128 bits of the YMM registers instead of preserving them. It seems to me that, when Intel defined the AVX architecture, including the extension of the XMM registers to YMM registers, they could have simply defined that the SSE instructions, too, would clear the upper 128 bits. Obviously, since the YMM registers were new, there could have been no legacy code that would have depended on SSE instructions preserving those bits, and it also appears to me that Intel could have easily seen this coming.

So, what is the reason why Intel defined the SSE instructions to preserve the upper 128 bits of the YMM registers? Is it ever useful?

like image 223
Dolda2000 Avatar asked Jan 24 '17 03:01

Dolda2000


People also ask

What are SSE registers?

SSE stands for Streaming SIMD Extensions. It is essentially the floating-point equivalent of the MMX instructions. The SSE registers are 128 bits, and can be used to perform operations on a variety of data sizes and types. Unlike MMX, the SSE registers do not overlap with the floating point stack.

How many YMM registers are there?

Similar to the XMM registers, there are 16 YMM registers (ymm0 ∼ ymm15) in the CPUS. The size of ymm is twice bigger than xmm. Therefore, YMM registers make it possible to process eight single precision floating point numbers or four double precision floating point numbers, simultaneously.

What is SSE x86?

In computing, Streaming SIMD Extensions (SSE) is a single instruction, multiple data (SIMD) instruction set extension to the x86 architecture, designed by Intel and introduced in 1999 in their Pentium III series of Central processing units (CPUs) shortly after the appearance of Advanced Micro Devices (AMD's) 3DNow!.


1 Answers

In order to move external resources in-site, I've extracted the relevant paragraphs from the link Michael provided in the comments.

All credits go to him.
The link points to a very similar question Agner Fog asked on the Intel's forum.

[Fog in respone to Intel's answer] If I understand you right, you decided that it is necessary to have two versions of all 128-bit instructions in order to avoid destroying the upper part of the YMM registers in case an interrupt calls a device driver using legacy XMM instructions.

Intel were concerned that by making legacy SSE instructions zeroing the upper part of the XMM registers the ISRs would now suddenly affect the new YMM registers.
Without support for saving the new YMM context this would make the use of AVX impossible under any circumstances.

However Fog was not completely satisfied and pointed out that by simply recompiling a driver with an AVX aware compiler (so that VEX instruction were used) would result in the same outcome.

Intel replied that their goal was to avoid forcing existing software to be rewritten.

There is no way we could compel the industry to rewrite/fix all of their existing drivers (for example to use XSAVE) and no way to guarantee they would have done so successfully. Consider for example the pain the industry is still going through on the transition from 32 to 64-bit operating systems! The feedback we have from OS vendors also precluded adding overhead to the ISR servicing to add the state management overhead on every interrupt. We didn't want to inflict either of these costs on portions of the industry that don't even typically use wide vectors.

By having two versions of the instructions, support for AVX in drivers can be achieved like it has been for FPU/SSE:

The example given is similar to the current scenario where a ring-0 driver (ISR) vendor attempts to use floating-point state, or accidentally links it in some library, in OSs that do not automatically manage that context at Ring-0. This is a well known source of bugs and I can suggest only the following:

  • On those OSs, driver developers are discouraged from using floating-point or AVX

  • Driver developers should be encouraged to disable hardware features during driver validation (i.e. AVX state can be disabled by drivers in Ring-0 through XSETBV()

like image 116
Margaret Bloom Avatar answered Sep 18 '22 04:09

Margaret Bloom