OpenMP paralelization inhibits vectorization

Question

I am new to OpenMP and I am trying to paralelize following code using OpenMP:

#pragma omp parallel for
for(int k=0;k<m;k++)
{
   for(int j=n-1;j>=0;j--)
   {
       outX[k+j*m] = inB2[j+n * k] / inA2[j*n + j];

       for(int i=0;i<j;i++)
       {
           inB2[k*n+i] -= inA2[i+n * j] * outX[k + m*j];
       }
   }
}

Paralelize the outer cycle is pretty straight-forward, but to optimize it, I wanted to paralelize the inner-most cycle (the one iterating over i) as well. But when I try to do that like this:

#pragma omp parallel for
for(int i=0;i<j;i++)
{
    inB2[k*n+i] -= inA2[i+n * j] * outX[k + m*j];
}

the compiler does not vectorize the inner cycle ("loop versioned for vectorization because of possible aliasing"), which makes it run slower. I compiled it using gcc -ffast-math -std=c++11 -fopenmp -O3 -msse2 -funroll-loops -g -fopt-info-vec prog.cpp

Thanks for any advice!

EDIT: I am using __restrict keyword for the arrays.

EDIT2: Interesting is, that when I keep only the pragma in the inner cycle and remove it from the outer, gcc will vectorize it. So the problem only happens, when I try to paralelize both cycles.

EDIT3: The compiler will vectorize the loop when I use #pragma omp parallel for simd. But it's still slower than without parallelizing the inner loop at all.

Sergey · Accepted Answer

My guess is that after you parallelized the inner loop, your compiler lost the track of inA2, inB2 and outX. By default, it assumes that any regions of memory pointed by any pointers may overlap with each other. In C language the C99 Standard introduced restrict keyword, which informs the compiler that a pointer points to a memory block which is not pointed by any other pointer. C++ haven't got such a keyword, but, fortunately, g++ has an appropriate extension. So try to add __restrict to declarations of the pointers touched by the loop. For example, replace

double* outX;

with

double* __restrict outX;

double* outX;

with

double* __restrict outX;

Holmz · Answer

Have you tried making the inner loop vecotorzed first? and then adding the parallel part (which might result in slower performance depending on cache misses)

#pragma omp parallel for
for(int k=0;k<m;k++)
{
   for(int j=n-1;j>=0;j--)
   {
       outX[k+j*m] = inB2[j+n * k] / inA2[j*n + j];
Q1 = k*n
Q2 = n*j
Q3 = m*j + k
#pragma omp declare simd private(i,j,k,m,Q1,Q2,Q3) linear(i) uniform(outX,inA2,inB2) shared(inB2,inA2,outX)
       for(int i=0;i<j;i++)
       {
           inB2[Q1+i] -= inA2[Q2+i] * outX[Q3];
       }
   }
}

It always take me some time getting the #pragma right with the shared, public etc... And I did not test this.

OpenMP paralelization inhibits vectorization

Tags:

c++

vectorization

openmp

Honza Dejdar

2 Answers

Sergey

Holmz

Recent Activity

Donate For Us

OpenMP paralelization inhibits vectorization

Tags:

c++

vectorization

openmp

Honza Dejdar

2 Answers

Sergey

Holmz

Related questions

Recent Activity

Donate For Us