I have an image processing algorithm to calculate a*b+c*d
with AVX. The pseudo code is as follows:
float *a=new float[N];
float *b=new float[N];
float *c=new float[N];
float *d=new float[N];
//assign values to a, b, c and d
__m256 sum;
double start=cv::getTickCount();
for (int i = 0; i < n; i += 8) // assume that n is a multiple of 8
{
__m256 am=_mm256_loadu_ps(a+i);
__m256 bm=_mm256_loadu_ps(b+i);
__m256 cm=_mm256_loadu_ps(c+i);
__m256 dm=_mm256_loadu_ps(d+i);
__m256 abm=_mm256_mul_ps(am, bm);
__m256 cdm=_mm256_mul_ps(cm, dm);
__m256 abcdm=_mm256_add_ps(abm, cdm);
sum=_mm256_add_ps(sum, abcdm);
}
double time1=(cv::getTickCount()-start)/cv::getTickFrequency();
I change _mm256_mul_ps and _mm256_add_ps on the above to _mm256_fmadd_ps as follows:
float *a=new float[N];
float *b=new float[N];
float *c=new float[N];
float *d=new float[N];
//assign values to a, b, c and d
__m256 sum;
double start=cv::getTickCount();
for (int i = 0; i < n; i += 8) // assume that n is a multiple of 8
{
__m256 am=_mm256_loadu_ps(a+i);
__m256 bm=_mm256_loadu_ps(b+i);
__m256 cm=_mm256_loadu_ps(c+i);
__m256 dm=_mm256_loadu_ps(d+i);
sum=_mm256_fmadd_ps(am, bm, sum);
sum=_mm256_fmadd_ps(cm, dm, sum);
}
double time2=(cv::getTickCount()-start)/cv::getTickFrequency();
But the code below is slower than the above! The above code execution time1 is 50ms, the below code execution time2 is 90ms. _mm256_fmadd_ps is slower than _mm256_mul_ps + _mm256_add_ps ???
I use Ubuntu 16.04, GCC 7.5.0 ,compiler flags: -fopenmp -march=native -O3
Your reduction loops both bottleneck on latency, not throughput, because you're only using one FP vector accumulator. The FMA one is slower because you made the critical path longer (a chain of 2 instructions per loop iteration instead of just 1).
In the add
case, the loop carried dependency chain for sum
is only sum=_mm256_add_ps(sum, abcdm);
. The other instructions are independent for each iteration, and can have that abcdm
input ready to go before the previous vaddps
has this iteration's sum
ready.
In the fma
case, the loop-carried dep chain goes through two _mm256_fmadd_ps
operations, both into sum
, so yes, you'd expect it to be about twice as slow.
Unroll with more accumulators to hide FP latency (like normal for a dot product). See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for much more detail about that and how OoO exec works.
Also see Improving performance of floating-point dot-product of an array with SIMD for a much simpler beginner-friendly example of 2 accumulators.
(Adding up those separate __m256 sum0, sum1, sum2, etc
vars should be done after the loop. You can also use __m256 sum[4]
to save typing. You can even use an inner loop over that array; most compilers will fully unroll small fixed-count loops, so you get the desired unrolled asm with each __m256
in a separate YMM register.)
Or let clang auto-vectorize this; it will normally do that unrolling with multiple accumulators for you.
Or if you for some reason didn't want to unroll, you could use FMA while keeping the loop-carried latency low with sum += fma(a, b, c*d);
(one mul, one FMA, one add). Of course assuming your compiler didn't "contract" your mul and add into FMA for you if you compiled with -ffast-math
; GCC will do that aggressively across statements by default, clang won't.
Once you do this, your throughput will bottleneck on 2 loads per clock (best case even with aligned arrays for no cache-line splits, which new
won't give you), so using FMA barely helps except to reduce the front-end bottleneck. (Compared to a multiple accumulator mul/add version that needs to run 1 FP op per load to keep up; using multiple accumulators will let you go faster than either original loop. Like one iteration (4 loads) per 2 cycles, instead of 1 per 3 cycles with the vaddps
latency bottleneck).
On Skylake and later, FMA/add/mul all have the same latency: 4 cycles. On Haswell/Broadwell, vaddps latency is 3 cycles (one dedicated FP add unit) while FMA latency is 5.
Zen2 has 3 cycle vaddps, 5 cycle vfma....ps (https://uops.info/). (2/clock throughput for both, and on different execution ports, so you could in theory run 2 FMAs and 2 vaddps per clock on Zen2.)
With your longer-latency FMA loop being less than twice as slow, I'm guessing you might be on a Skylake-derived CPU. Perhaps the mul/add version was bottlenecking a bit on the front-end or resource conflicts or something and not quite achieving the expected 1 iteration per 3 clocks latency-limited speed.
In general, see https://uops.info/ for latency and uops / port breakdowns. (also https://agner.org/optimize/).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With