SSE and AVX have 16 registers each. On SSE they are referenced as XMM0-XMM15, and on AVX they are called YMM0-YMM15. XMM registers are 128 bits long, whereas YMM are 256bit. SSE adds three typedefs: __m128 , __m128d and __m128i .
The only difference between AVX and AVX2 for floating point code is availability of new FMA instruction – both AVX and AVX2 have 256-bit FP registers. The main advantage of new ISA of AVX2 is for integer code/data types – there you can expect up to 2x speedup, but 8% for FP code is good speedup of AVX2 over AVX.
The AVX version should be at least as fast as the SSE version even if the program is memory-bound, but it turns out the AVX version is slower. The code is the core in an image processing program, the SSE version processes the image in ~180 ms, but the AVX version takes about ~200 ms.
Intel AVX-512 can accelerate performance for workloads and use cases such as scientific simulations, financial analytics, artificial intelligence (AI)/deep learning, 3D modeling and analysis, image and audio/video processing, cryptography, and data compression.
I recently updated the tag wikis for SSE, AVX, and x86 (and SSE2, avx2). They cover a lot of this. tl;dr summary: AVX rolls up all the previous SSE versions, and provides 3-operand versions of those instructions. Also 256b versions of most FP (AVX) and int (AVX2) insns.
For summaries of the various SSE versions, see wikipedia, or knm241's more-detailed answer.
We don't really think of that making SSE obsolete. More like, think of AVX as a new and better version of the same old SSE instructions. They're still in the ref manual under their non-AVX names (PSHUFB
, not VPSHUFB
, for example.) You can mix AVX and SSE code, as long as you use VZEROUPPER
when needed to avoid the performance problem from mixing VEX with non-VEX insns (on Intel). So there is some annoyance to dealing with cases where you have to call into libraries that might run non-VEX SSE instructions, or where your code uses SSE FP math, but also has some AVX code to be run only if the CPU supports it.
If CPU-compatibility was a non-issue, the legacy-SSE versions of vector instructions would be truly obsolete, like MMX is now. AVX/AVX2 is at least slightly better in every way, if you count the VEX-encoded 128b version an insn as AVX, not SSE. Sometimes you'd still use 128b registers because your data only comes in chunks that big, but more often working with 256b registers to do the same op on twice as much data at once.
SSE/AVX/x87-FP/integer instructions all use the same execution ports. You can't get more done in parallel by mixing them. (except on Haswell, where one of the 4 ALU ports can only handle non-vector insns, like GP reg ops and branches).
They are complementary.
Each new instruction set extension add new instructions and eventually a new programming model (new registers for example).
None are deprecated, deprecating instructions is almost impossible to do for compatibility reasons. However some optional extensions may be absent or removed from newer models (like the FMA4 of AMD) if not very wide spread.
Some are vestigial though, everything that can be done with FPU and MMX for example can be done more efficiently with SSE+.
They are not mutually exclusive in the sense that you can use one or another, after all they are instructions not modes of operation (like real vs protected mode for example).
The only possible "conflict" is between MMX and FPU as they share the lower part of the same set of register but have different programming model.
The new vector registers have grown from 128 bit to 256 bit and to 512 bit, each time the previous registers have become the low part of the newer ones.
You can use all them together, they offer specific hardware support implementing simple operations.
They are like Lego bricks, you are only limited by your imagination (or the imagination of the designers).
Here a simple list of this instruction set extensions.
Only some features are listed, for the complete reference see Intel Manual Vol1 from chapter 9 to 14.
See also https://hjlebbink.github.io/x86doc/ for a table of contents of Intel's volume 2 (instruction set reference) manual, with a list of extensions that added instructions to that manual entry.
MMX
Introduce eight 64 bit registers (MM0-MM7) and instructions to work with eight signed/unsigned bytes, four signed/unsigned words, two signed/unsigned dwords.
3DNow!
Add support for single precision floating point operand to MMX. Few operation supported, for example addition, subtraction, multiplication.
SSE
Introduce eight/sixteen 128 bit registers (XMM0-XMM7/15) and instruction to work with four single precision floating point operands. Add integer operations on MMX registers too. (The MMX-integer part of SSE is sometimes called MMXEXT, and was implemented on a few non-Intel CPUs without xmm registers and the floating point part of SSE.)
SSE2
Introduces instruction to work with 2 double precision floating point operands, and with packed byte/word/dword/qword integers in 128-bit xmm registers.
SSE3
Add a few varied instructions (mostly floating point), including a special kind of unaligned load (lddqu
) that was better on Pentium 4, synchronization instruction, horizontal add/sub.
SSSE3
Again a varied set of instructions, mostly integer. The first shuffle that takes its control operand from a register instead of hard-coded (pshufb
). More horizontal processing, shuffle, packing/unpacking, mul+add on bytes, and some specialized integer add/mul stuff.
SSE4 (SSE4.1, SSE4.2)
Add a lot of instructions: Filling in a lot of the gaps by providing min and max and other operations for all integer data types (especially 32-bit integer had been lacking), where previously integer min was only available for unsigned bytes and signed 16-bit. Also scaling, FP rounding, blending, linear algebra operation, text processing, comparisons. Also a non temporal load for reading video memory, or copying it back to main memory. (Previously only NT stores were available.)
AESNI
Add support for accelerating AES symmetric encryption/decryption.
AVX
Add eight/sixteen 256 bit registers (YMM0-YMM7/15).
Support all previous floating point datatype. Three operand instructions.
FMA
Add Fused Multiply Add and correlated instructions.
AVX2
Add support for integer data types.
AVX512F
Add eight/thirty-two 512 bit registers (ZMM0-ZMM7/31) and eight 64-bit mask register (k0-k7). Promote most previous instruction to 512 bit wide. Optional parts of AVX512 add instruction for exponentials & reciprocals (AVX512ER), scatter/gather prefetching (AVX512PF), scatter conflict detection (AVX512CD), compress, expand.
IMCI (Intel Xeon Phi)
Early development of AVX512 for the first-gen Intel Xeon Phi (Knight's Corner) coprocessor.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With