Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why doesn't Intel design its SIMD ISAs in a more compatible or universal way?

Intel has several SIMD ISAs, such as SSE, AVX, AVX2, AVX-512 and IMCI on Xeon Phi. These ISAs are supported on different processors. For example, AVX-512 BW, AVX-512 DQ and AVX-512 VL are only supported on Skylake, but not on Xeon Phi. AVX-512F, AVX-512 CDI, AVX-512 ERI and AVX-512 PFI are supported on both the Skylake and Xeon Phi.

Why doesn't Intel design a more universal SIMD ISA that can run on all of its advanced processors?

Also, Intel removes some intrinsics and adds new ones when developing ISAs. A lot of intrinsics have many flavours. For example, some work on packed 8-bit while some work on packed 64-bit. Some flavours are not widely supported. For example, Xeon Phi is not going to have the capability to process packed 8-bit values. Skylake, however, will have this.

Why does Intel alter its SIMD intrinsics in such an inconsistent way?

If the SIMD ISAs are more compatible with each other, an existed AVX code may be ported to AVX-512 with much less efforts.

like image 291
thierry Avatar asked Jul 13 '15 09:07

thierry


2 Answers

I see the reason why as three-fold.

(1) When they originally designed MMX they had very little area to work with so made it as simple as possible. They also did it in such a way that was fully compatible with the existing x86 ISA (precise interrupts + some state saving on context switches). They hadn't anticipated that they would continually enlarge the SIMD register widths and add so many instructions. Every generation when they added wider SIMD registers and more sophisticated instructions they had to maintain the old ISA for compatibility.

(2) This weird thing you're seeing with AVX-512 is from the fact that they are trying to unify two disparate product lines. Skylake is from Intel's PC/server line therefore their path can be seen as MMX -> SSE/2/3/4 -> AVX -> AVX2 -> AVX-512. The Xeon Phi was based on an x86-compatible graphics card called Larrabee that used the LRBni instruction set. This is more or less the same as AVX-512, but with less instructions and not officially compatible with MMX/SSE/AVX/etc...

(3) They have different products for different demographics. For example, (as far as I know) the AVX-512 CD instructions won't be available in the regular SkyLake processors for PCs, just in the SkyLake Xeon processors used for servers in addition to the Xeon Phi used for HPC. I can understand this to an extent since the CD extensions are targeted at things like parallel histogram generation; this case is more likely to be a critical hotspot in servers/HPC than in general-purpose PCs.

I do agree it's a bit of mess. Intel are beginning to see the light and planning better for additional expansions; AVX-512 is supposedly ready to scale to 1024 bits in a future generation. Unfortunately it's still not really good enough and Agner Fog discusses this on the Intel Forums.

For me I would have liked to see a model that can be upgraded without the user having to recompile their code each time. For example, instead of defining AVX register as 512-bits in the ISA, this should be a parameter stored in the microarchitecture and retrievable by the programmer at runtime. The user asks what is the maximum SIMD width available on this machine?, the architecture returns XYZ, and the user has generic control flow to cope with whatever that XYZ is. This would be much cleaner and scalable than the current technique which uses several versions of the same function for every possible SIMD version. :-/

like image 105
hayesti Avatar answered Oct 19 '22 02:10

hayesti


There is SIMD ISA convergence between Xeon and Xeon Phi and ultimately they may become identical. I doubt you will ever get the same SIMD ISA across the whole Intel CPU line - bear in mind that it stretches from a tiny Quark SOC to Xeon Phi. There will be a long time, possibly infinite, before AVX-1024 migrates from Xeon Phi to Quark or a low end Atom CPU.

In order to get better portability between different CPU families, including future ones, I advise you to use higher level concepts than bare SIMD instructions or intrinsics. Use OpenCL, OpenMP, Cilk Plus, C++ AMP and autovectorizing compiler. Quite often, they will do a good job generating platform specific SIMD instructions for you.

like image 1
Paul Jurczak Avatar answered Oct 19 '22 02:10

Paul Jurczak