Loop unrolling to achieve maximum throughput with Ivy Bridge and Haswell

Tags:

I am computing eight dot products at once with AVX. In my current code I do something like this (before unrolling):

Ivy-Bridge/Sandy-Bridge

__m256 areg0 = _mm256_set1_ps(a[m]);
for(int i=0; i<n; i++) {        
    __m256 breg0 = _mm256_load_ps(&b[8*i]);
    tmp0 = _mm256_add_ps(_mm256_mul_ps(arge0,breg0), tmp0); 
}

Haswell

__m256 areg0 = _mm256_set1_ps(a[m]);
for(int i=0; i<n; i++) {      
    __m256 breg0 = _mm256_load_ps(&b[8*i]);
    tmp0 = _mm256_fmadd_ps(arge0, breg0, tmp0);
}

How many times do I need to unroll the loop for each case to ensure maximum throughput?

For Haswell using FMA3 I think the answer is here FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2. I need to unroll the loop 10 times.

For Ivy Bridge I think it's 8. Here is my logic. The AVX addition has a latency of 3 and the multiplication a latency of 5. Ivy Bridge can do one AVX multiplication and one AVX addition at the same time using different ports. Using the notation m for multiplication, a for addition, and x for no operation as well as a number to indicate the partial sum (e.g. m5 means multiplication with the 5th partial sum) I can write:

port0:  m1  m2  m3  m4  m5  m6  m7  m8  m1  m2  m3  m4  m5  ... 
port1:   x   x   x   x   x  a1  a2  a3  a4  a5  a6  a7  a8  ...

So by using 8 partial sums after nine clock cycles (four from the load and five from the multiplication) I can submit one AVX load, one AVX addition and one AVX multiplication every clock cycle.

I guess this means that it's impossible to achieve the maximum throughput for this tasks in 32-bit mode in Ivy Bridge and Haswell since 32-bit mode only has eight AVX registers?

Edit: In regards to the bounty. My main questions still hold. I want to get the maximum throughput of either the Ivy Bridge or Haswell functions above, n can be any value greater than or equal to 64. I think this can only be done using unrolling (eight times for Ivy Bridge and 10 times for Haswell). If you think this can be done with another method then let's see it. In some sense this is a variation of How do I achieve the theoretical maximum of 4 FLOPs per cycle?. But instead of only multiplication and addition I'm looking for one 256-bit load (or two 128-bit loads), one AVX multiplication, and one AVX addition every clock cycle with Ivy Bridge or two 256-bit loads and two FMA3 instructions per clock cycle.

I would also like to know how many registers are necessary. For Ivy Bridge I think it's 10. One for the broadcast, one for the load (only one due to register renaming), and eight for the eight partial sums. So I don't think this can be done in 32-bit mode (and indeed when I run in 32-bit mode the performance drops significantly).

I should point out that the compiler can give misleading results Difference in performance between MSVC and GCC for highly optimized matrix multplication code

The current function I'm using for Ivy Bridge is below. This basically multiplies one row of a 64x64 matrix a with all of a 64x64 matrix b (I run this function 64 times on each row of a to get the full matrix multiply in matrix c).

#include <immintrin.h>
extern "C" void row_m64x64(const float *a, const float *b, float *c) {      
    const int vec_size = 8;
    const int n = 64;
    __m256 tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7;
    tmp0 = _mm256_loadu_ps(&c[0*vec_size]);
    tmp1 = _mm256_loadu_ps(&c[1*vec_size]);
    tmp2 = _mm256_loadu_ps(&c[2*vec_size]);
    tmp3 = _mm256_loadu_ps(&c[3*vec_size]);
    tmp4 = _mm256_loadu_ps(&c[4*vec_size]);
    tmp5 = _mm256_loadu_ps(&c[5*vec_size]);
    tmp6 = _mm256_loadu_ps(&c[6*vec_size]);
    tmp7 = _mm256_loadu_ps(&c[7*vec_size]);

    for(int i=0; i<n; i++) {
        __m256 areg0 = _mm256_set1_ps(a[i]);

        __m256 breg0 = _mm256_loadu_ps(&b[vec_size*(8*i + 0)]);
        tmp0 = _mm256_add_ps(_mm256_mul_ps(areg0,breg0), tmp0);    
        __m256 breg1 = _mm256_loadu_ps(&b[vec_size*(8*i + 1)]);
        tmp1 = _mm256_add_ps(_mm256_mul_ps(areg0,breg1), tmp1);
        __m256 breg2 = _mm256_loadu_ps(&b[vec_size*(8*i + 2)]);
        tmp2 = _mm256_add_ps(_mm256_mul_ps(areg0,breg2), tmp2);    
        __m256 breg3 = _mm256_loadu_ps(&b[vec_size*(8*i + 3)]);
        tmp3 = _mm256_add_ps(_mm256_mul_ps(areg0,breg3), tmp3);   
        __m256 breg4 = _mm256_loadu_ps(&b[vec_size*(8*i + 4)]);
        tmp4 = _mm256_add_ps(_mm256_mul_ps(areg0,breg4), tmp4);    
        __m256 breg5 = _mm256_loadu_ps(&b[vec_size*(8*i + 5)]);
        tmp5 = _mm256_add_ps(_mm256_mul_ps(areg0,breg5), tmp5);    
        __m256 breg6 = _mm256_loadu_ps(&b[vec_size*(8*i + 6)]);
        tmp6 = _mm256_add_ps(_mm256_mul_ps(areg0,breg6), tmp6);    
        __m256 breg7 = _mm256_loadu_ps(&b[vec_size*(8*i + 7)]);
        tmp7 = _mm256_add_ps(_mm256_mul_ps(areg0,breg7), tmp7);    
    }
    _mm256_storeu_ps(&c[0*vec_size], tmp0);
    _mm256_storeu_ps(&c[1*vec_size], tmp1);
    _mm256_storeu_ps(&c[2*vec_size], tmp2);
    _mm256_storeu_ps(&c[3*vec_size], tmp3);
    _mm256_storeu_ps(&c[4*vec_size], tmp4);
    _mm256_storeu_ps(&c[5*vec_size], tmp5);
    _mm256_storeu_ps(&c[6*vec_size], tmp6);
    _mm256_storeu_ps(&c[7*vec_size], tmp7);
}

857

asked Jan 13 '14 12:01

Z boson

2 Answers

For Sandy/Ivy Bridge you need to unroll by 3:

Only FP Add has dependency on the previous iteration of the loop
FP Add can issue every cycle
FP Add takes three cycles to complete
Thus unrolling by 3/1 = 3 completely hides the latency
FP Mul and FP Load do not have a dependency on the previous iteration and you can rely on the OoO core to issue them in the near-optimal order. These instructions could affect the unroll factor only if they lowered the throughput of FP Add (not the case here, FP Load + FP Add + FP Mul can issue every cycle).

For Haswell you need to unroll by 10:

Only FMA has dependency on the previous iteration of the loop
FMA can double-issue every cycle (i.e. on average independent instructions take 0.5 cycles)
FMA has latency of 5
Thus unrolling by 5/0.5 = 10 completely hides FMA latency
The two FP Load microoperations do not have a dependency on the previous iteration, and can co-issue with 2x FMA, so they don't affect the unroll factor.

182

answered Sep 20 '22 11:09

Marat Dukhan

I'm only answering my own question here to add information.

I went ahead and profiled the Ivy Bridge code. When I first tested this in MSVC2012 unrolling by more than two did not help much. However, I suspected that MSVC did not implement the intrinsics optimally based on my observation at Difference in performance between MSVC and GCC for highly optimized matrix multplication code. So I compiled the kernel in GCC with g++ -c -mavx -O3 -mabi=ms, converted the object to COFF64 and dropped it into MSVC and I now get that unrolling by three gives the best result confirming Marat Dunkhan's answer.

Here are the times in seconds, Xeon E5 1620 @3.6GHz MSVC2012

unroll    time default            time with GCC kernel
     1    3.7                     3.2
     2    1.8 (2.0x faster)       1.6 (2.0x faster)
     3    1.6 (2.3x faster)       1.2 (2.7x faster)
     4    1.6 (2.3x faster)       1.2 (2.7x faster)

Here are the times on i5-4250U using fma with GCC in Linux (g++ -mavx -mfma -fopenmp -O3 main.cpp kernel_fma.cpp -o sum_fma)

unroll    time
     1    20.3
     2    10.2 (2.0x faster)
     3     6.7 (3.0x faster) 
     4     5.2 (4.0x faster)
     8     2.9 (7.0x faster)
    10     2.6 (7.8x faster)

The code below is for Sandy-Bridge/Ivy Bridge. For Haswell use e.g. tmp0 = _mm256_fmadd_ps(a8,b8_1,tmp0) instead.

kernel.cpp

#include <immintrin.h>

extern "C" void foo_unroll1(const int n, const float *b, float *c) {      
    __m256 tmp0 = _mm256_set1_ps(0.0f);
    __m256 a8 = _mm256_set1_ps(1.0f);
    for(int i=0; i<n; i+=8) {
        __m256 b8 = _mm256_loadu_ps(&b[i + 0]);
        tmp0 = _mm256_add_ps(_mm256_mul_ps(a8,b8), tmp0);
    }
    _mm256_storeu_ps(c, tmp0);
}

extern "C" void foo_unroll2(const int n, const float *b, float *c) {
    __m256 tmp0 = _mm256_set1_ps(0.0f);
    __m256 tmp1 = _mm256_set1_ps(0.0f);
    __m256 a8 = _mm256_set1_ps(1.0f);
    for(int i=0; i<n; i+=16) {
        __m256 b8_1 = _mm256_loadu_ps(&b[i + 0]);
        tmp0 = _mm256_add_ps(_mm256_mul_ps(a8,b8_1), tmp0);
        __m256 b8_2 = _mm256_loadu_ps(&b[i + 8]);
        tmp1 = _mm256_add_ps(_mm256_mul_ps(a8,b8_2), tmp1);
    }
    tmp0 = _mm256_add_ps(tmp0,tmp1);
    _mm256_storeu_ps(c, tmp0);
}

extern "C" void foo_unroll3(const int n, const float *b, float *c) { 
    __m256 tmp0 = _mm256_set1_ps(0.0f);
    __m256 tmp1 = _mm256_set1_ps(0.0f);
    __m256 tmp2 = _mm256_set1_ps(0.0f);
    __m256 a8 = _mm256_set1_ps(1.0f);
    for(int i=0; i<n; i+=24) {
        __m256 b8_1 = _mm256_loadu_ps(&b[i + 0]);
        tmp0 = _mm256_add_ps(_mm256_mul_ps(a8,b8_1), tmp0);
        __m256 b8_2 = _mm256_loadu_ps(&b[i + 8]);
        tmp1 = _mm256_add_ps(_mm256_mul_ps(a8,b8_2), tmp1);
        __m256 b8_3 = _mm256_loadu_ps(&b[i + 16]);
        tmp2 = _mm256_add_ps(_mm256_mul_ps(a8,b8_3), tmp2);
    }
    tmp0 = _mm256_add_ps(tmp0,_mm256_add_ps(tmp1,tmp2));
    _mm256_storeu_ps(c, tmp0);
}

extern "C" void foo_unroll4(const int n, const float *b, float *c) {      
    __m256 tmp0 = _mm256_set1_ps(0.0f);
    __m256 tmp1 = _mm256_set1_ps(0.0f);
    __m256 tmp2 = _mm256_set1_ps(0.0f);
    __m256 tmp3 = _mm256_set1_ps(0.0f);
    __m256 a8 = _mm256_set1_ps(1.0f);
    for(int i=0; i<n; i+=32) {
        __m256 b8_1 = _mm256_loadu_ps(&b[i + 0]);
        tmp0 = _mm256_add_ps(_mm256_mul_ps(a8,b8_1), tmp0);
        __m256 b8_2 = _mm256_loadu_ps(&b[i + 8]);
        tmp1 = _mm256_add_ps(_mm256_mul_ps(a8,b8_2), tmp1);
        __m256 b8_3 = _mm256_loadu_ps(&b[i + 16]);
        tmp2 = _mm256_add_ps(_mm256_mul_ps(a8,b8_3), tmp2);
        __m256 b8_4 = _mm256_loadu_ps(&b[i + 24]);
        tmp3 = _mm256_add_ps(_mm256_mul_ps(a8,b8_4), tmp3);
    }
    tmp0 = _mm256_add_ps(_mm256_add_ps(tmp0,tmp1),_mm256_add_ps(tmp2,tmp3));
    _mm256_storeu_ps(c, tmp0);
}

main.cpp

#include <stdio.h>
#include <omp.h>
#include <immintrin.h>

extern "C" void foo_unroll1(const int n, const float *b, float *c);
extern "C" void foo_unroll2(const int n, const float *b, float *c);
extern "C" void foo_unroll3(const int n, const float *b, float *c);
extern "C" void foo_unroll4(const int n, const float *b, float *c);

int main() {
    const int n = 3*1<<10;
    const int r = 10000000;
    double dtime;
    float *b = (float*)_mm_malloc(sizeof(float)*n, 64);
    float *c = (float*)_mm_malloc(8, 64);
    for(int i=0; i<n; i++) b[i] = 1.0f;

    __m256 out;
    dtime = omp_get_wtime();    
    for(int i=0; i<r; i++) foo_unroll1(n, b, c);
    dtime = omp_get_wtime() - dtime;
    printf("%f, ", dtime); for(int i=0; i<8; i++) printf("%f ", c[i]); printf("\n");

    dtime = omp_get_wtime();    
    for(int i=0; i<r; i++) foo_unroll2(n, b, c);
    dtime = omp_get_wtime() - dtime;
    printf("%f, ", dtime); for(int i=0; i<8; i++) printf("%f ", c[i]); printf("\n");

    dtime = omp_get_wtime();    
    for(int i=0; i<r; i++) foo_unroll3(n, b, c);
    dtime = omp_get_wtime() - dtime;
    printf("%f, ", dtime); for(int i=0; i<8; i++) printf("%f ", c[i]); printf("\n");

    dtime = omp_get_wtime();    
    for(int i=0; i<r; i++) foo_unroll4(n, b, c);
    dtime = omp_get_wtime() - dtime;
    printf("%f, ", dtime); for(int i=0; i<8; i++) printf("%f ", c[i]); printf("\n");
}

answered Sep 18 '22 11:09

Z boson

Related questions
                            
                                return a scope in C++
                            
                                C++ auto detection of template arguments?
                            
                                Why is the size not a template argument of std::initializer_list?
                            
                                Speed of accessing local vs. global variables in gcc/g++ at different optimization levels
                            
                                How deep do compilers inline functions?
                            
                                Read binary data from std::cin
                            
                                converting cv::Mat for tesseract
                            
                                How do you exit X11 program without Error
                            
                                Workaround for template argument deduction in non-deduced context
                            
                                Number suffix for short [duplicate]
                            
                                Websocket Client in C++ [closed]
                            
                                Is the data in nested std::arrays guaranteed to be contiguous?
                            
                                In what scenarios should I expect to explicitly need to implement a move constructor and move assignment operator?
                            
                                How to use OpenSSL in POCO C++ library correctly
                            
                                undefined reference to `inflate'
                            
                                Partial application with a C++ lambda?
                            
                                C++: Initialization of member variables
                            
                                How to execute a class member function in a separate thread using C++11 thread class?
                            
                                Why would a compiler generate this assembly?
                            
                                Law of Demeter in API Design for C++

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Loop unrolling to achieve maximum throughput with Ivy Bridge and Haswell

Tags:

c++

x86

avx

intel

sse

Z boson

People also ask

2 Answers

Marat Dukhan

Z boson

Recent Activity

Donate For Us