Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OpenMP paralelization inhibits vectorization

I am new to OpenMP and I am trying to paralelize following code using OpenMP:

#pragma omp parallel for
for(int k=0;k<m;k++)
{
   for(int j=n-1;j>=0;j--)
   {
       outX[k+j*m] = inB2[j+n * k] / inA2[j*n + j];

       for(int i=0;i<j;i++)
       {
           inB2[k*n+i] -= inA2[i+n * j] * outX[k + m*j];
       }
   }
}

Paralelize the outer cycle is pretty straight-forward, but to optimize it, I wanted to paralelize the inner-most cycle (the one iterating over i) as well. But when I try to do that like this:

#pragma omp parallel for
for(int i=0;i<j;i++)
{
    inB2[k*n+i] -= inA2[i+n * j] * outX[k + m*j];
}

the compiler does not vectorize the inner cycle ("loop versioned for vectorization because of possible aliasing"), which makes it run slower. I compiled it using gcc -ffast-math -std=c++11 -fopenmp -O3 -msse2 -funroll-loops -g -fopt-info-vec prog.cpp

Thanks for any advice!

EDIT: I am using __restrict keyword for the arrays.

EDIT2: Interesting is, that when I keep only the pragma in the inner cycle and remove it from the outer, gcc will vectorize it. So the problem only happens, when I try to paralelize both cycles.

EDIT3: The compiler will vectorize the loop when I use #pragma omp parallel for simd. But it's still slower than without parallelizing the inner loop at all.

like image 308
Honza Dejdar Avatar asked Nov 16 '16 12:11

Honza Dejdar


2 Answers

My guess is that after you parallelized the inner loop, your compiler lost the track of inA2, inB2 and outX. By default, it assumes that any regions of memory pointed by any pointers may overlap with each other. In C language the C99 Standard introduced restrict keyword, which informs the compiler that a pointer points to a memory block which is not pointed by any other pointer. C++ haven't got such a keyword, but, fortunately, g++ has an appropriate extension. So try to add __restrict to declarations of the pointers touched by the loop. For example, replace

double* outX;

with

double* __restrict outX;
like image 103
Sergey Avatar answered Sep 25 '22 23:09

Sergey


Have you tried making the inner loop vecotorzed first? and then adding the parallel part (which might result in slower performance depending on cache misses)

#pragma omp parallel for
for(int k=0;k<m;k++)
{
   for(int j=n-1;j>=0;j--)
   {
       outX[k+j*m] = inB2[j+n * k] / inA2[j*n + j];
Q1 = k*n
Q2 = n*j
Q3 = m*j + k
#pragma omp declare simd private(i,j,k,m,Q1,Q2,Q3) linear(i) uniform(outX,inA2,inB2) shared(inB2,inA2,outX)
       for(int i=0;i<j;i++)
       {
           inB2[Q1+i] -= inA2[Q2+i] * outX[Q3];
       }
   }
}

It always take me some time getting the #pragma right with the shared, public etc... And I did not test this.

like image 27
Holmz Avatar answered Sep 23 '22 23:09

Holmz