I am new to OpenMP and I am trying to paralelize following code using OpenMP:
#pragma omp parallel for
for(int k=0;k<m;k++)
{
for(int j=n-1;j>=0;j--)
{
outX[k+j*m] = inB2[j+n * k] / inA2[j*n + j];
for(int i=0;i<j;i++)
{
inB2[k*n+i] -= inA2[i+n * j] * outX[k + m*j];
}
}
}
Paralelize the outer cycle is pretty straight-forward, but to optimize it, I wanted to paralelize the inner-most cycle (the one iterating over i) as well. But when I try to do that like this:
#pragma omp parallel for
for(int i=0;i<j;i++)
{
inB2[k*n+i] -= inA2[i+n * j] * outX[k + m*j];
}
the compiler does not vectorize the inner cycle ("loop versioned for vectorization because of possible aliasing"), which makes it run slower. I compiled it using gcc -ffast-math -std=c++11 -fopenmp -O3 -msse2 -funroll-loops -g -fopt-info-vec prog.cpp
Thanks for any advice!
EDIT: I am using __restrict keyword for the arrays.
EDIT2: Interesting is, that when I keep only the pragma in the inner cycle and remove it from the outer, gcc will vectorize it. So the problem only happens, when I try to paralelize both cycles.
EDIT3: The compiler will vectorize the loop when I use #pragma omp parallel for simd. But it's still slower than without parallelizing the inner loop at all.
My guess is that after you parallelized the inner loop, your compiler lost the track of inA2
, inB2
and outX
. By default, it assumes that any regions of memory pointed by any pointers may overlap with each other. In C language the C99 Standard introduced restrict
keyword, which informs the compiler that a pointer points to a memory block which is not pointed by any other pointer. C++ haven't got such a keyword, but, fortunately, g++
has an appropriate extension. So try to add __restrict
to declarations of the pointers touched by the loop. For example, replace
double* outX;
with
double* __restrict outX;
Have you tried making the inner loop vecotorzed first? and then adding the parallel part (which might result in slower performance depending on cache misses)
#pragma omp parallel for
for(int k=0;k<m;k++)
{
for(int j=n-1;j>=0;j--)
{
outX[k+j*m] = inB2[j+n * k] / inA2[j*n + j];
Q1 = k*n
Q2 = n*j
Q3 = m*j + k
#pragma omp declare simd private(i,j,k,m,Q1,Q2,Q3) linear(i) uniform(outX,inA2,inB2) shared(inB2,inA2,outX)
for(int i=0;i<j;i++)
{
inB2[Q1+i] -= inA2[Q2+i] * outX[Q3];
}
}
}
It always take me some time getting the #pragma right with the shared, public etc... And I did not test this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With