I've been investigating the benefits of SIMD algorithms in C# and C++, and found that in many cases using 128-bit registers on an AVX processor offers a better improvement than using 256-bit registers on a processor with AVX2, but I don't understand why.
By improvement I mean the speed-up of a SIMD algorithm relative to a non-SIMD algorithm on the same machine.
On an AVX processor, the upper half of the 256 bit registers and floating point units are powered down by the CPU when not executing AVX instructons (VEX encoded opcodes). When code does use AVX instructions, the CPU has to power up the FP units - this takes about 70 microseconds, during which time AVX instructions are actually executed using 128 micro-ops twice.
When AVX instructions haven't been used for about 700 microseconds, the CPU powers down the upper half of the circuitry again.
Now it does this because the upper half of the circuitry consumes power (doh!), and so generates heat (double doh!). This means that the CPU runs hotter when AVX instructions are used. So given that CPUs can "turbo boost" when they have thermal headroom, using AVX instructions reduces this chance, and in fact, the CPU actually reduces the "base clock speed". So if you have, for example, a CPU officially clocked at 2.3GHz that can turbo boost to 2.7, when you start using AVX instructions, the chip is clocked down to 2.1 and boosted to only 2.3, and in extreme cases the base clock may be reduced to 1.9 (see pages 2-4 of this).
At this stage, your CPU is executing ALL instructions about 10-15%, maybe even 20% SLOWER than when not using AVX instructions. If you're doing loads of SIMD operations, the 256 bit wide instructions make this worthwhile. But if you're doing a few AVX instructions, then "normal" code, then a bit of AVX again, then this clock speed penalty will cost more than all the gains you can make from AVX alone.
This can be why 128 bit wide SIMD can run faster than 256 bit wide unless you've got lengthy intensive bursts of SIMD-dominated operations. There is a price to using the rest of the silicon... (or perhaps more accurately, a reward for not using it that we sometimes forget we've been getting).
(From the comments on the question)
If arithmetic operations are not the bottle neck in an algorithm's execution then using SIMD will not provide a speed-up. Other bottlenecks could be memory bandwidth, cache-sizes, memory speed, cache-speed. If a processor with AVX out-performs an AVX2 processor in these areas then it will benefit more from using SIMD intrinsics.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With