Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Loop unrolling to achieve maximum throughput with Ivy Bridge and Haswell

Tags:

c++

x86

avx

intel

sse

I am computing eight dot products at once with AVX. In my current code I do something like this (before unrolling):

Ivy-Bridge/Sandy-Bridge

__m256 areg0 = _mm256_set1_ps(a[m]);
for(int i=0; i<n; i++) {        
    __m256 breg0 = _mm256_load_ps(&b[8*i]);
    tmp0 = _mm256_add_ps(_mm256_mul_ps(arge0,breg0), tmp0); 
}

Haswell

__m256 areg0 = _mm256_set1_ps(a[m]);
for(int i=0; i<n; i++) {      
    __m256 breg0 = _mm256_load_ps(&b[8*i]);
    tmp0 = _mm256_fmadd_ps(arge0, breg0, tmp0);
}

How many times do I need to unroll the loop for each case to ensure maximum throughput?

For Haswell using FMA3 I think the answer is here FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2. I need to unroll the loop 10 times.

For Ivy Bridge I think it's 8. Here is my logic. The AVX addition has a latency of 3 and the multiplication a latency of 5. Ivy Bridge can do one AVX multiplication and one AVX addition at the same time using different ports. Using the notation m for multiplication, a for addition, and x for no operation as well as a number to indicate the partial sum (e.g. m5 means multiplication with the 5th partial sum) I can write:

port0:  m1  m2  m3  m4  m5  m6  m7  m8  m1  m2  m3  m4  m5  ... 
port1:   x   x   x   x   x  a1  a2  a3  a4  a5  a6  a7  a8  ...

So by using 8 partial sums after nine clock cycles (four from the load and five from the multiplication) I can submit one AVX load, one AVX addition and one AVX multiplication every clock cycle.

I guess this means that it's impossible to achieve the maximum throughput for this tasks in 32-bit mode in Ivy Bridge and Haswell since 32-bit mode only has eight AVX registers?

Edit: In regards to the bounty. My main questions still hold. I want to get the maximum throughput of either the Ivy Bridge or Haswell functions above, n can be any value greater than or equal to 64. I think this can only be done using unrolling (eight times for Ivy Bridge and 10 times for Haswell). If you think this can be done with another method then let's see it. In some sense this is a variation of How do I achieve the theoretical maximum of 4 FLOPs per cycle?. But instead of only multiplication and addition I'm looking for one 256-bit load (or two 128-bit loads), one AVX multiplication, and one AVX addition every clock cycle with Ivy Bridge or two 256-bit loads and two FMA3 instructions per clock cycle.

I would also like to know how many registers are necessary. For Ivy Bridge I think it's 10. One for the broadcast, one for the load (only one due to register renaming), and eight for the eight partial sums. So I don't think this can be done in 32-bit mode (and indeed when I run in 32-bit mode the performance drops significantly).

I should point out that the compiler can give misleading results Difference in performance between MSVC and GCC for highly optimized matrix multplication code

The current function I'm using for Ivy Bridge is below. This basically multiplies one row of a 64x64 matrix a with all of a 64x64 matrix b (I run this function 64 times on each row of a to get the full matrix multiply in matrix c).

#include <immintrin.h>
extern "C" void row_m64x64(const float *a, const float *b, float *c) {      
    const int vec_size = 8;
    const int n = 64;
    __m256 tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7;
    tmp0 = _mm256_loadu_ps(&c[0*vec_size]);
    tmp1 = _mm256_loadu_ps(&c[1*vec_size]);
    tmp2 = _mm256_loadu_ps(&c[2*vec_size]);
    tmp3 = _mm256_loadu_ps(&c[3*vec_size]);
    tmp4 = _mm256_loadu_ps(&c[4*vec_size]);
    tmp5 = _mm256_loadu_ps(&c[5*vec_size]);
    tmp6 = _mm256_loadu_ps(&c[6*vec_size]);
    tmp7 = _mm256_loadu_ps(&c[7*vec_size]);

    for(int i=0; i<n; i++) {
        __m256 areg0 = _mm256_set1_ps(a[i]);

        __m256 breg0 = _mm256_loadu_ps(&b[vec_size*(8*i + 0)]);
        tmp0 = _mm256_add_ps(_mm256_mul_ps(areg0,breg0), tmp0);    
        __m256 breg1 = _mm256_loadu_ps(&b[vec_size*(8*i + 1)]);
        tmp1 = _mm256_add_ps(_mm256_mul_ps(areg0,breg1), tmp1);
        __m256 breg2 = _mm256_loadu_ps(&b[vec_size*(8*i + 2)]);
        tmp2 = _mm256_add_ps(_mm256_mul_ps(areg0,breg2), tmp2);    
        __m256 breg3 = _mm256_loadu_ps(&b[vec_size*(8*i + 3)]);
        tmp3 = _mm256_add_ps(_mm256_mul_ps(areg0,breg3), tmp3);   
        __m256 breg4 = _mm256_loadu_ps(&b[vec_size*(8*i + 4)]);
        tmp4 = _mm256_add_ps(_mm256_mul_ps(areg0,breg4), tmp4);    
        __m256 breg5 = _mm256_loadu_ps(&b[vec_size*(8*i + 5)]);
        tmp5 = _mm256_add_ps(_mm256_mul_ps(areg0,breg5), tmp5);    
        __m256 breg6 = _mm256_loadu_ps(&b[vec_size*(8*i + 6)]);
        tmp6 = _mm256_add_ps(_mm256_mul_ps(areg0,breg6), tmp6);    
        __m256 breg7 = _mm256_loadu_ps(&b[vec_size*(8*i + 7)]);
        tmp7 = _mm256_add_ps(_mm256_mul_ps(areg0,breg7), tmp7);    
    }
    _mm256_storeu_ps(&c[0*vec_size], tmp0);
    _mm256_storeu_ps(&c[1*vec_size], tmp1);
    _mm256_storeu_ps(&c[2*vec_size], tmp2);
    _mm256_storeu_ps(&c[3*vec_size], tmp3);
    _mm256_storeu_ps(&c[4*vec_size], tmp4);
    _mm256_storeu_ps(&c[5*vec_size], tmp5);
    _mm256_storeu_ps(&c[6*vec_size], tmp6);
    _mm256_storeu_ps(&c[7*vec_size], tmp7);
}
like image 857
Z boson Avatar asked Jan 13 '14 12:01

Z boson


People also ask

How can loop unrolling improve program performance?

Improved floating-point performance - loop unrolling can improve performance by providing the compiler more instructions to schedule across the unrolled iterations. This reduces the number of NOPs generated and also provides the compiler with a greater opportunity to generate parallel instructions.

Is unrolling a loop always more efficient?

When there are no loops in your code? When the loop body is large, the loop-control overhead is trivial, and the larger code size due to unrolling can cause instruction cache misses. Loop unrolling almost always results in slower code in most large applications.

What is loop unrolling in compiler design?

Loop unrolling is a compiler optimization applied to certain kinds of loops to reduce the frequency of branches and loop maintenance instructions. It is easily applied to sequential array processing loops where the number of iterations is known prior to execution of the loop.

How does loop unrolling increase the program speed?

Loop unrolling increases the program’s speed by eliminating loop control instruction and loop test instructions. Program 2 is more efficient than program 1 because in program 1 there is a need to check the value of i and increment the value of i every time round the loop.

What does it mean to unroll a loop?

We basically remove or reduce iterations. Loop unrolling increases the program’s speed by eliminating loop control instruction and loop test instructions. Program 2 is more efficient than program 1 because in program 1 there is a need to check the value of i and increment the value of i every time round the loop.

What are the disadvantages of unrolled loops?

Increased program code size, which can be undesirable. Possible increased usage of register in a single iteration to store temporary variables which may reduce performance. Apart from very small and simple codes, unrolled loops that contain branches are even slower than recursions.

Why are unrolled loops slower than recursion?

Possible increased usage of register in a single iteration to store temporary variables which may reduce performance. Apart from very small and simple codes, unrolled loops that contain branches are even slower than recursions. This article is contributed by Harsh Agarwal.


2 Answers

For Sandy/Ivy Bridge you need to unroll by 3:

  • Only FP Add has dependency on the previous iteration of the loop
  • FP Add can issue every cycle
  • FP Add takes three cycles to complete
  • Thus unrolling by 3/1 = 3 completely hides the latency
  • FP Mul and FP Load do not have a dependency on the previous iteration and you can rely on the OoO core to issue them in the near-optimal order. These instructions could affect the unroll factor only if they lowered the throughput of FP Add (not the case here, FP Load + FP Add + FP Mul can issue every cycle).

For Haswell you need to unroll by 10:

  • Only FMA has dependency on the previous iteration of the loop
  • FMA can double-issue every cycle (i.e. on average independent instructions take 0.5 cycles)
  • FMA has latency of 5
  • Thus unrolling by 5/0.5 = 10 completely hides FMA latency
  • The two FP Load microoperations do not have a dependency on the previous iteration, and can co-issue with 2x FMA, so they don't affect the unroll factor.
like image 182
Marat Dukhan Avatar answered Sep 20 '22 11:09

Marat Dukhan


I'm only answering my own question here to add information.

I went ahead and profiled the Ivy Bridge code. When I first tested this in MSVC2012 unrolling by more than two did not help much. However, I suspected that MSVC did not implement the intrinsics optimally based on my observation at Difference in performance between MSVC and GCC for highly optimized matrix multplication code. So I compiled the kernel in GCC with g++ -c -mavx -O3 -mabi=ms, converted the object to COFF64 and dropped it into MSVC and I now get that unrolling by three gives the best result confirming Marat Dunkhan's answer.

Here are the times in seconds, Xeon E5 1620 @3.6GHz MSVC2012

unroll    time default            time with GCC kernel
     1    3.7                     3.2
     2    1.8 (2.0x faster)       1.6 (2.0x faster)
     3    1.6 (2.3x faster)       1.2 (2.7x faster)
     4    1.6 (2.3x faster)       1.2 (2.7x faster)

Here are the times on i5-4250U using fma with GCC in Linux (g++ -mavx -mfma -fopenmp -O3 main.cpp kernel_fma.cpp -o sum_fma)

unroll    time
     1    20.3
     2    10.2 (2.0x faster)
     3     6.7 (3.0x faster) 
     4     5.2 (4.0x faster)
     8     2.9 (7.0x faster)
    10     2.6 (7.8x faster)

The code below is for Sandy-Bridge/Ivy Bridge. For Haswell use e.g. tmp0 = _mm256_fmadd_ps(a8,b8_1,tmp0) instead.

kernel.cpp

#include <immintrin.h>

extern "C" void foo_unroll1(const int n, const float *b, float *c) {      
    __m256 tmp0 = _mm256_set1_ps(0.0f);
    __m256 a8 = _mm256_set1_ps(1.0f);
    for(int i=0; i<n; i+=8) {
        __m256 b8 = _mm256_loadu_ps(&b[i + 0]);
        tmp0 = _mm256_add_ps(_mm256_mul_ps(a8,b8), tmp0);
    }
    _mm256_storeu_ps(c, tmp0);
}

extern "C" void foo_unroll2(const int n, const float *b, float *c) {
    __m256 tmp0 = _mm256_set1_ps(0.0f);
    __m256 tmp1 = _mm256_set1_ps(0.0f);
    __m256 a8 = _mm256_set1_ps(1.0f);
    for(int i=0; i<n; i+=16) {
        __m256 b8_1 = _mm256_loadu_ps(&b[i + 0]);
        tmp0 = _mm256_add_ps(_mm256_mul_ps(a8,b8_1), tmp0);
        __m256 b8_2 = _mm256_loadu_ps(&b[i + 8]);
        tmp1 = _mm256_add_ps(_mm256_mul_ps(a8,b8_2), tmp1);
    }
    tmp0 = _mm256_add_ps(tmp0,tmp1);
    _mm256_storeu_ps(c, tmp0);
}

extern "C" void foo_unroll3(const int n, const float *b, float *c) { 
    __m256 tmp0 = _mm256_set1_ps(0.0f);
    __m256 tmp1 = _mm256_set1_ps(0.0f);
    __m256 tmp2 = _mm256_set1_ps(0.0f);
    __m256 a8 = _mm256_set1_ps(1.0f);
    for(int i=0; i<n; i+=24) {
        __m256 b8_1 = _mm256_loadu_ps(&b[i + 0]);
        tmp0 = _mm256_add_ps(_mm256_mul_ps(a8,b8_1), tmp0);
        __m256 b8_2 = _mm256_loadu_ps(&b[i + 8]);
        tmp1 = _mm256_add_ps(_mm256_mul_ps(a8,b8_2), tmp1);
        __m256 b8_3 = _mm256_loadu_ps(&b[i + 16]);
        tmp2 = _mm256_add_ps(_mm256_mul_ps(a8,b8_3), tmp2);
    }
    tmp0 = _mm256_add_ps(tmp0,_mm256_add_ps(tmp1,tmp2));
    _mm256_storeu_ps(c, tmp0);
}

extern "C" void foo_unroll4(const int n, const float *b, float *c) {      
    __m256 tmp0 = _mm256_set1_ps(0.0f);
    __m256 tmp1 = _mm256_set1_ps(0.0f);
    __m256 tmp2 = _mm256_set1_ps(0.0f);
    __m256 tmp3 = _mm256_set1_ps(0.0f);
    __m256 a8 = _mm256_set1_ps(1.0f);
    for(int i=0; i<n; i+=32) {
        __m256 b8_1 = _mm256_loadu_ps(&b[i + 0]);
        tmp0 = _mm256_add_ps(_mm256_mul_ps(a8,b8_1), tmp0);
        __m256 b8_2 = _mm256_loadu_ps(&b[i + 8]);
        tmp1 = _mm256_add_ps(_mm256_mul_ps(a8,b8_2), tmp1);
        __m256 b8_3 = _mm256_loadu_ps(&b[i + 16]);
        tmp2 = _mm256_add_ps(_mm256_mul_ps(a8,b8_3), tmp2);
        __m256 b8_4 = _mm256_loadu_ps(&b[i + 24]);
        tmp3 = _mm256_add_ps(_mm256_mul_ps(a8,b8_4), tmp3);
    }
    tmp0 = _mm256_add_ps(_mm256_add_ps(tmp0,tmp1),_mm256_add_ps(tmp2,tmp3));
    _mm256_storeu_ps(c, tmp0);
}

main.cpp

#include <stdio.h>
#include <omp.h>
#include <immintrin.h>

extern "C" void foo_unroll1(const int n, const float *b, float *c);
extern "C" void foo_unroll2(const int n, const float *b, float *c);
extern "C" void foo_unroll3(const int n, const float *b, float *c);
extern "C" void foo_unroll4(const int n, const float *b, float *c);

int main() {
    const int n = 3*1<<10;
    const int r = 10000000;
    double dtime;
    float *b = (float*)_mm_malloc(sizeof(float)*n, 64);
    float *c = (float*)_mm_malloc(8, 64);
    for(int i=0; i<n; i++) b[i] = 1.0f;

    __m256 out;
    dtime = omp_get_wtime();    
    for(int i=0; i<r; i++) foo_unroll1(n, b, c);
    dtime = omp_get_wtime() - dtime;
    printf("%f, ", dtime); for(int i=0; i<8; i++) printf("%f ", c[i]); printf("\n");

    dtime = omp_get_wtime();    
    for(int i=0; i<r; i++) foo_unroll2(n, b, c);
    dtime = omp_get_wtime() - dtime;
    printf("%f, ", dtime); for(int i=0; i<8; i++) printf("%f ", c[i]); printf("\n");

    dtime = omp_get_wtime();    
    for(int i=0; i<r; i++) foo_unroll3(n, b, c);
    dtime = omp_get_wtime() - dtime;
    printf("%f, ", dtime); for(int i=0; i<8; i++) printf("%f ", c[i]); printf("\n");

    dtime = omp_get_wtime();    
    for(int i=0; i<r; i++) foo_unroll4(n, b, c);
    dtime = omp_get_wtime() - dtime;
    printf("%f, ", dtime); for(int i=0; i<8; i++) printf("%f ", c[i]); printf("\n");
}
like image 24
Z boson Avatar answered Sep 18 '22 11:09

Z boson