Why do SSE instructions preserve the upper 128-bit of the YMM registers?

Tags:

It seems to be a recurring problem that many Intel processors (up until Skylake, unless I'm wrong) exhibit poor performance when mixing AVX-256 instructions with SSE instructions.

According to Intel's documentation, this is caused by SSE instructions being defined to preserve the upper 128 bits of the YMM registers, so in order to be able to save power by not using the upper 128 bits of the AVX datapaths, the CPU stores those bits away when executing SSE code and reloads them when entering AVX code, the stores and loads being expensive.

However, I can find no obvious reason or explanation why SSE instructions needed to preserve those upper 128 bits. The corresponding 128-bit VEX instructions (the use of which avoids the performance penalty) work by always clearing the upper 128 bits of the YMM registers instead of preserving them. It seems to me that, when Intel defined the AVX architecture, including the extension of the XMM registers to YMM registers, they could have simply defined that the SSE instructions, too, would clear the upper 128 bits. Obviously, since the YMM registers were new, there could have been no legacy code that would have depended on SSE instructions preserving those bits, and it also appears to me that Intel could have easily seen this coming.

So, what is the reason why Intel defined the SSE instructions to preserve the upper 128 bits of the YMM registers? Is it ever useful?

223

asked Jan 24 '17 03:01

Dolda2000

1 Answers

In order to move external resources in-site, I've extracted the relevant paragraphs from the link Michael provided in the comments.

All credits go to him.
The link points to a very similar question Agner Fog asked on the Intel's forum.

[Fog in respone to Intel's answer] If I understand you right, you decided that it is necessary to have two versions of all 128-bit instructions in order to avoid destroying the upper part of the YMM registers in case an interrupt calls a device driver using legacy XMM instructions.

Intel were concerned that by making legacy SSE instructions zeroing the upper part of the XMM registers the ISRs would now suddenly affect the new YMM registers.
Without support for saving the new YMM context this would make the use of AVX impossible under any circumstances.

However Fog was not completely satisfied and pointed out that by simply recompiling a driver with an AVX aware compiler (so that VEX instruction were used) would result in the same outcome.

Intel replied that their goal was to avoid forcing existing software to be rewritten.

There is no way we could compel the industry to rewrite/fix all of their existing drivers (for example to use XSAVE) and no way to guarantee they would have done so successfully. Consider for example the pain the industry is still going through on the transition from 32 to 64-bit operating systems! The feedback we have from OS vendors also precluded adding overhead to the ISR servicing to add the state management overhead on every interrupt. We didn't want to inflict either of these costs on portions of the industry that don't even typically use wide vectors.

By having two versions of the instructions, support for AVX in drivers can be achieved like it has been for FPU/SSE:

The example given is similar to the current scenario where a ring-0 driver (ISR) vendor attempts to use floating-point state, or accidentally links it in some library, in OSs that do not automatically manage that context at Ring-0. This is a well known source of bugs and I can suggest only the following:

On those OSs, driver developers are discouraged from using floating-point or AVX

Driver developers should be encouraged to disable hardware features during driver validation (i.e. AVX state can be disabled by drivers in Ring-0 through XSETBV()

116

answered Sep 18 '22 04:09

Margaret Bloom

Related questions
                            
                                Using Roslyn Emit method with a ModuleBuilder instead of a MemoryStream
                            
                                Linq To Sql vs Entity Framework Performance
                            
                                Python equivalent of std::set and std::multimap
                            
                                Compiling Android project from command line is slow
                            
                                PostgreSQL: How to structure and index time-related data for optimal query performance?
                            
                                ListView Resize Columns Performance Issues (Grouping)
                            
                                Apache Drill has bad performance against SQL Server
                            
                                Random mmaped memory access up to 16% slower than heap data access
                            
                                What are the differences between bool() and operator.truth()?
                            
                                WPF RichTextBox Performance
                            
                                Add Expire Headers in php can't make it work
                            
                                Java Mutable BigInteger Class
                            
                                Quickest implementation of Java Map for a small number of entries
                            
                                Why are PostgreSQL Text-Search GiST indexes so much slower than GIN indexes?
                            
                                reuse sql with view or function
                            
                                qemu vs qemu-kvm: some performance measurements
                            
                                Can a human eye perceive a 10 milliseconds latency in image load time
                            
                                System.Web.HttpRequest.FillInFormCollection() and System.Web.HttpRequest.GetEntireRawContent() very slow
                            
                                Why is the first WCF client call slow?
                            
                                Improve performance of 7-sided die roll simulation from a 6-sided die implementation

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why do SSE instructions preserve the upper 128-bit of the YMM registers?

Tags:

performance

x86

avx

Dolda2000

People also ask

1 Answers

Margaret Bloom

Recent Activity

Donate For Us