Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do processors with only AVX out-perform AVX2 processors for many SIMD algorithms?

Tags:

c++

c#

avx

simd

avx2

I've been investigating the benefits of SIMD algorithms in C# and C++, and found that in many cases using 128-bit registers on an AVX processor offers a better improvement than using 256-bit registers on a processor with AVX2, but I don't understand why.

By improvement I mean the speed-up of a SIMD algorithm relative to a non-SIMD algorithm on the same machine.

like image 466
eoinmullan Avatar asked Feb 26 '16 23:02

eoinmullan


2 Answers

On an AVX processor, the upper half of the 256 bit registers and floating point units are powered down by the CPU when not executing AVX instructons (VEX encoded opcodes). When code does use AVX instructions, the CPU has to power up the FP units - this takes about 70 microseconds, during which time AVX instructions are actually executed using 128 micro-ops twice.

When AVX instructions haven't been used for about 700 microseconds, the CPU powers down the upper half of the circuitry again.

Now it does this because the upper half of the circuitry consumes power (doh!), and so generates heat (double doh!). This means that the CPU runs hotter when AVX instructions are used. So given that CPUs can "turbo boost" when they have thermal headroom, using AVX instructions reduces this chance, and in fact, the CPU actually reduces the "base clock speed". So if you have, for example, a CPU officially clocked at 2.3GHz that can turbo boost to 2.7, when you start using AVX instructions, the chip is clocked down to 2.1 and boosted to only 2.3, and in extreme cases the base clock may be reduced to 1.9 (see pages 2-4 of this).

At this stage, your CPU is executing ALL instructions about 10-15%, maybe even 20% SLOWER than when not using AVX instructions. If you're doing loads of SIMD operations, the 256 bit wide instructions make this worthwhile. But if you're doing a few AVX instructions, then "normal" code, then a bit of AVX again, then this clock speed penalty will cost more than all the gains you can make from AVX alone.

This can be why 128 bit wide SIMD can run faster than 256 bit wide unless you've got lengthy intensive bursts of SIMD-dominated operations. There is a price to using the rest of the silicon... (or perhaps more accurately, a reward for not using it that we sometimes forget we've been getting).

like image 170
Tim Avatar answered Sep 21 '22 03:09

Tim


(From the comments on the question)

If arithmetic operations are not the bottle neck in an algorithm's execution then using SIMD will not provide a speed-up. Other bottlenecks could be memory bandwidth, cache-sizes, memory speed, cache-speed. If a processor with AVX out-performs an AVX2 processor in these areas then it will benefit more from using SIMD intrinsics.

like image 34
eoinmullan Avatar answered Sep 23 '22 03:09

eoinmullan