Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why gcc autovectorization does not work on convolution matrix biger than 3x3?

I've implemented the following program for convolution matrix

#include <stdio.h>
#include <time.h>

#define NUM_LOOP 1000
#define N 128   //input or output dimention 1
#define M N     //input or output dimention 2
#define P 5 //convolution matrix dimention 1 if you want a 3x3 convolution matrix it must be 3
#define Q P     //convolution matrix dimention 2
#define Csize P*Q   
#define Cdiv  1     //div for filter 
#define Coffset 0   //offset 

//functions
void unusual(); //unusual implementation of convolution
void naive();
//data
unsigned short int input[N][M] __attribute__(( aligned(32))); // input data
unsigned short int output[N][M] __attribute__(( aligned(32))); // out put data
unsigned short int kernel[P][Q] __attribute__(( aligned(32)));//convolution coefficients

int main(){
    struct timespec tStart, tEnd;//used to record the processiing time
    double tTotal , tBest=10000;//minimum of toltal time will asign to the best time

    int w=0;
    do{// this loop repeat the body to record the best time
        clock_gettime(CLOCK_MONOTONIC,&tStart);

        //function to be executed here :

        unusual();

        clock_gettime(CLOCK_MONOTONIC,&tEnd);
        tTotal = (tEnd.tv_sec - tStart.tv_sec);
        tTotal += (tEnd.tv_nsec - tStart.tv_nsec) / 1000000000.0;

        if(tTotal<tBest)
            tBest=tTotal;
    } while(w++ < NUM_LOOP);

    printf(" The best time: %lf sec in %d repetition for %dX%d matrix\n",tBest,w, MAX1, MAX2);

    return 0;
}

//unusual sequential convolution
void unusual(){
    int i, j,k,temp;

    for (i=P/2; i< N-P/2; i++){
        for(j=Q/2; j< M-Q/2; j++){
            temp=0;
            for(k=0; k< Csize; k++){
                temp += (kernel[k/P][k%Q]) * (input[i - (P/2) + (k/Q)][j - (Q/2) + (k%Q)]);

            }
            output[i][j]=((temp/(Cdiv))+Coffset);
        }
    }
}
//The naive implementation
inline void naive(){
    int i, j,k,l,temp;
    for (i=P/2; i< N-P/2; i++){
        for(j=Q/2; j< M-Q/2; j++){
            temp=0;

            for(k = 0; k <  P; k++){ 
                for(l = 0; l <  Q; l++){
                    temp += (kernel[k][l]) * (input[i - (P/2)+k][j - (Q/2)+l]);
                }
            }
            output[i][j]=((temp/(Cdiv))+Coffset);
        }
    }
}

The problem is when I use -O3 for auto vectorizing, it just works for an 3x3 convolution matrix. I've seen the Assembly output and auto vectorization just make some changes for 3x3 kernel and improve the performance reasonably (20 time faster note: scalar version of unusual func is slower than naive fun) but there is no improvement for 5x5 convolution matrix

UPDATE: I added the naive implementation to the question and changed the picture size to NxM, conv matrix to kernel, Cdim1xCdim2 to PxQ, and seqConv function to unusual for clarification. The question is not to improve the implementation of the unusual function. The question is while all elements are in the same places of the memory, gcc uses heuristic, etc. why gcc fails to improve this unusual implementation? NOTE: the problem is not about the naive implementation. gcc -O3 improve the naive implementation for 3x3, 5x5 kernels by ~7 speedup. and it also does for 7x7 and 9x9 by ~1.5 speedup. To improve the convolution I used intrinsics and speedup is more than 40x over the naive implementation which is ~ 2x faster than unusual convolution. So my vectorization is ~80x faster than my unusual one. Hand tuning optimization is not the problem. Auto-vectorizer optimization is the problem, and the reason of the fails.

GCC command : gcc -Wall -march=native -O3 -o "%e" "%f"

Platform: Linux mint, Skylake, gcc 6.2

Thanks in advance

like image 648
Hossein Amiri Avatar asked Dec 04 '16 23:12

Hossein Amiri


2 Answers

It seems no one is interested in answering this question. So I will share my findings and update my answer in future.

First update: In my experience, gcc -fopt-info-vec reports vectorizing for Csize <= 16 It is because the vectorization factor is 16 and It is one of the reasons that gcc do not vectorize the unusual implementation for other kernel sizes. Vectorization factor refers to the number of elements which can be put in a vector. In this case short integer is equal to 16-bit element.

From wikipedia:

In the first step, the compiler looks for obstacles that can prevent vectorization. A major obstacle for vectorization is true data dependency shorter than the vector length. Other obstacles include function calls and short iteration counts.

like image 170
Hossein Amiri Avatar answered Sep 22 '22 07:09

Hossein Amiri


The main obstacle for auto-vectorizer is non-constant loop variant. In your Implementation if you use int Csize = P*Q; It wont be vectorized. So for help the auto vector you should consider this. It's not the problem because you declared the Csize like #define Csize. But note it in your works. Then your unusual implementation is a loop-transformation of the nave implementation which is an optimization method in compilers. It seems you ruined the naive implementation. Your finding says that its restricted because of the 16 so I unrolled your unusual function and auto-vectorizer says it has been vectorized.

for(k=0; k< P*Q; k+=2){
                temp += (kernel[k/Q][k%Q]) * (input[i - (P/2) + (k/Q)][j - (Q/2) + (k%Q)]);
                temp += (kernel[k/Q][k%Q]) * (input[i - (P/2) + ((k+1)/Q)][j - (Q/2) + ((k+1)%Q)]);
}

It also work for 7x7 kernel:

for(k=0; k< P*Q; k+=4){//IACA_START
                temp += (kernel[k/Q][k%Q]) * (input[i - (P/2) + (k/Q)][j - (Q/2) + (k%Q)]);
                temp += (kernel[k/Q][k%Q]) * (input[i - (P/2) + ((k+1)/Q)][j - (Q/2) + ((k+1)%Q)]);
                temp += (kernel[k/Q][k%Q]) * (input[i - (P/2) + ((k+2)/Q)][j - (Q/2) + ((k+2)%Q)]);
                temp += (kernel[k/Q][k%Q]) * (input[i - (P/2) + ((k+3)/Q)][j - (Q/2) + ((k+3)%Q)]);
}

you dont need to unroll it by your self you might be able to force the compiler to unroll or change the loop structure by #pragma attributes. It is because of the SLP concept which compilers use for auto-vectorizing and interestingly SLP is based on unrolling!.

like image 37
ADMS Avatar answered Sep 24 '22 07:09

ADMS