Looking around here and the internet, I can find a lot of posts about modern compilers beating SSE in many real situations, and I have just encountered in some code I inherited that when I disable some SSE code written in 2006 for integer-based image processing and force the code down the standard C branch, it runs faster.
On modern processors with multiple cores and advanced pipelining, etc, does older SSE code underperform gcc -O2
?
You have to be careful with microbenchmarks. It's really easy to measure something other than what you thought you were. Microbenchmarks also usually don't account for code size at all, in terms of pressure on the L1 I-cache / uop-cache and branch-predictor entries.
In most cases, microbenchmarks usually have all the branches predicted as well as they can be, while a routine that's called frequently but not in a tight loop might not do as well in practice.
There have been many additions to SSE over the years. A reasonable baseline for new code is SSSE3 (found in Intel Core2 and later, and AMD Bulldozer and later), as long as there is a scalar fallback. The addition of a fast byte-shuffle (pshufb
) is a game-changer for some things. SSE4.1 adds quite a few nice things for integer code, too. If old code doesn't use it, compiler output, or new hand-written code, could do much better.
Currently we're up to AVX2, which handles two 128b lanes at once, in 256b registers. There are a few 256b shuffle instructions. AVX/AVX2 gives 3-operand (non-destructive dest, src1, src2) versions of all the previous SSE instructions, which helps improve code density even when the two-lane aspect of using 256b ops is a downside (or when targeting AVX1 without AVX2 for integer code).
In a year or two, the first AVX512 desktop hardware will probably be around. That adds a huge amount of powerful features (mask registers, and filling in more gaps in the highly non-orthogonal SSE / AVX instruction set), as well as just wider registers and execution units.
If the old SSE code only gave a marginal speedup over the scalar code back when it was written, or nobody ever benchmarked it, that might be the problem. Compiler advances may lead to the generated code for scalar C beating old SSE that takes a lot of shuffling. Sometimes the cost of shuffling data into vector registers eats up all the speedup of being fast once it's there.
Or depending on your compiler options, the compiler might even be auto-vectorizing. IIRC, gcc -O2
doesn't enable -ftree-vectorize
, so you need -O3
for auto-vec.
Another thing that might hold back old SSE code is that it might assume unaligned loads/stores are slow, and used palignr
or similar techniques to go between unaligned data in registers and aligned loads/stores. So old code might be tuned for an old microarch in a way that's actually slower on recent ones.
So even without using any instructions that weren't available previously, tuning for a different microarchitecture matters.
Compiler output is rarely optimal, esp. if you haven't told it about pointers not aliasing (restrict
), or being aligned. But it often manages to run pretty fast. You can often improve it a bit (esp. for being more hyperthreading-friendly by having fewer uops/insns to do the same work), but you have to know the microarchitecture you're targeting. E.g. Intel Sandybridge and later can only micro-fuse memory operands with one-register addressing mode. Other links at the x86 wiki.
So to answer the title, no the SSE instruction set is in no way redundant or discouraged. Using it directly, with asm, is discouraged for casual use (use intrinsics instead). Using intrinsics is discouraged unless you can actually get a speedup over compiler output. If they're tied now, it will be easier for a future compiler to do even better with your scalar code than to do better with your vector intrinsics.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With