So far I have managed to find out that:
Are there any caveats regarding using SSE3, SSSE3, SSE4.1, SSE 4.2, AVX2 and AVX-512 on Windows?
Some clarification: I need this to determine what OSs will my program run on if I use instructions from one of the SSE/AVX sets.
To make sure that AVX is enabled, do the following: Open your Windows Power Shell in Admin mode. In the command line type: bcdedit /set xsavedisable 0 (do NOT set this value to a number other than zero!). You should get feedback that the operation was successfully completed.
Not all CPUs from the listed families support AVX. Generally, CPUs with the commercial denomination Core i3/i5/i7/i9 support them, whereas Pentium and Celeron CPUs before Tiger Lake do not. Issues regarding compatibility between future Intel and AMD processors are discussed under XOP instruction set.
SSE & AVX Registers SSE and AVX have 16 registers each. On SSE they are referenced as XMM0-XMM15, and on AVX they are called YMM0-YMM15. XMM registers are 128 bits long, whereas YMM are 256bit. SSE adds three typedefs: __m128 , __m128d and __m128i .
The AVX rate was only 7% faster than SSE while the AVX2 time is 2.4 times faster than SSE and 2.2 times faster than AVX. The poor performance of AVX appears to be the result of having to use SSE integer functions with more shuffling data around between 256 and 128 bit variables.
Extensions that introduce new architectural state require special OS support, because the OS has to save/restore restore more data on context switches. So from the OSes perspective, there's nothing extra it needs to do to let user-space code run SSSE3 instructions, if the OS supports SSE.
SSE, AVX, and AVX512 are the extensions that introduced new architectural state.
xmm
regs (and MXCSR
for rounding modes and FP exception state)ymm
(the lower half of which are the old xmm
regs).zmm
(extending the x/ymm
regs), and also doubled the number of vector regs in 64bit mode: zmm0-zmm31. x/y/zmm16..31 are only accessible with AVX-512 encodings of vector instructions (EVEX prefix), and thus interestingly can be used without requiring vzeroupper
, and aren't affected by it.k0..k7
64-bit mask registers (or 16-bit without AVX-512BW in Xeon Phi) are also new in AVX-512.You check for CPU support for SSE or AVX the usual way, with the CPUID instruction.
To prevent silent data corruption when using a new extension on a multi-tasking OS that doesn't save/restore the new architectural state on context switches, SSE instructions fault as illegal instructions if the OS hasn't set an OS-support bit in a control register. So vector extensions "don't work" on OSes that don't know about saving/restoring the necessary state for that extension.
For SSE, there may not be any clean OS-independent way to detect that the OS has promised to save/restore SSE state on context switches by setting the CR4.OSFXSR
, CR4.OSXMMEXCPT
etc. bits, because even reading a control register is privileged, and there's no CPUID bit that reflects the setting. SSE support is so widespread that you'd have to be using a really ancient version (or homebrew) OS for this to be a problem.
For AVX, we don't need OS support to detect that AVX is usable (supported by hardware and enabled by the OS): User-space can run xgetbv
and check the enabled-feature flags to see if the OS has enabled AVX instructions to run without faulting.
From Intel's intro to AVX:
- Verify that the operating system supports XGETBV using
CPUID.1:ECX.OSXSAVE bit 27 = 1
.- At the same time, verify that
CPUID.1:ECX bit 28=1
(Intel AVX supported) and/or bit 25=1 (AES supported) ... (and other bits for FMA, AES, and PCLMULQDQ)- Issue
XGETBV
, and verify that the feature-enabled mask at bits 1 and 2 are11b
(XMM state and YMM state enabled by the operating system).
It may be easier to call an OS-provided function to detect OS support, instead of using inline asm or a feature-detect library to do all this. For example, Win7SP1 introduced GetEnabledXStateFeatures
along with support for AVX CPUs. (It's unlikely or maybe impossible to find Win7SP1 running on a CPU without SSE, so for SSE you can just check CPUID and OS version.)
This is also understood to be a promise that the OS's context switches will correctly save/restore the full state, although of course a buggy, malicious, or esoteric OS (perhaps cooperative multi-tasking?) could be different. For mainstream OSes including Windows, it does mean YMM registers will keep their values just like you'd expect.
The same is true for AVX512: you can check the CPUID feature bit for the instruction set, and check that the OS has promised to manage the new architectural state on context switches by enabling the right bits in with XSETBV. (So you should check with XGETBV). Check for XGETBV result AND 0xE6 equals to 0xE6.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With