I have been working on making my code able to be auto vectorised by GCC, however, when I include the the -fopenmp
flag it seems to stop all attempts at auto vectorisation. I am using the ftree-vectorize -ftree-vectorizer-verbose=5
to vectorise and monitor it.
If I do not include the flag, it starts to give me a lot of information about each loop, if it is vectorised and why not. The compiler stops when I try to use the omp_get_wtime()
function, since it can't be linked. Once the flag is included, it simply lists every function and tells me it vectorised 0 loops in it.
I've read a few other places the issue has been mentioned, but they don't really come to any solutions: http://software.intel.com/en-us/forums/topic/295858 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032. Does OpenMP have its own way of handling vectorisation? Does I need to explicitly tell it to?
There is a shortcoming in the GCC vectoriser which appears to have been resolved in recent GCC versions. In my test case GCC 4.7.2 vectorises successfully the following simple loop:
#pragma omp parallel for schedule(static)
for (int i = 0; i < N; i++)
a[i] = b[i] + c[i] * d;
In the same time GCC 4.6.1 does not and it complains, that the loop contains function calls or data references that cannot be analysed. The bug in the vectoriser is triggered by the way parallel for
loops are implemented by GCC. When the OpenMP constructs are processed and expanded, the simple loop code is transformed into something akin to this:
struct omp_fn_0_s
{
int N;
double *a;
double *b;
double *c;
double d;
};
void omp_fn_0(struct omp_fn_0_s *data)
{
int start, end;
int nthreads = omp_get_num_threads();
int threadid = omp_get_thread_num();
// This is just to illustrate the case - GCC uses a bit different formulas
start = (data->N * threadid) / nthreads;
end = (data->N * (threadid+1)) / nthreads;
for (int i = start; i < end; i++)
data->a[i] = data->b[i] + data->c[i] * data->d;
}
...
struct omp_fn_0_s omp_data_o;
omp_data_o.N = N;
omp_data_o.a = a;
omp_data_o.b = b;
omp_data_o.c = c;
omp_data_o.d = d;
GOMP_parallel_start(omp_fn_0, &omp_data_o, 0);
omp_fn_0(&omp_data_o);
GOMP_parallel_end();
N = omp_data_o.N;
a = omp_data_o.a;
b = omp_data_o.b;
c = omp_data_o.c;
d = omp_data_o.d;
The vectoriser in GCC before 4.7 fails to vectorise that loop. This is NOT OpenMP-specific problem. One can easily reproduce it with no OpenMP code at all. To confirm this I wrote the following simple test:
struct fun_s
{
double *restrict a;
double *restrict b;
double *restrict c;
double d;
int n;
};
void fun1(double *restrict a,
double *restrict b,
double *restrict c,
double d,
int n)
{
int i;
for (i = 0; i < n; i++)
a[i] = b[i] + c[i] * d;
}
void fun2(struct fun_s *par)
{
int i;
for (i = 0; i < par->n; i++)
par->a[i] = par->b[i] + par->c[i] * par->d;
}
One would expect that both codes (notice - no OpenMP here!) should vectorise equally well because of the restrict
keywords used to specify that no aliasing can happen. Unfortunately this is not the case with GCC < 4.7 - it successfully vectorises the loop in fun1
but fails to vectorise that in fun2
citing the same reason as when it compiles the OpenMP code.
The reason for this is that the vectoriser is unable to prove that par->d
does not lie within the memory that par->a
, par->b
, and par->c
point to. This is not always the case with fun1
, where two cases are possible:
d
is passed as a value argument in a register;d
is passed as a value argument on the stack.On x64 systems the System V ABI mandates that the first several floating-point arguments get passed in the XMM registers (YMM on AVX-enabled CPUs). That's how d
gets passed in this case and hence no pointer can ever point to it - the loop gets vectorised. On x86 systems the ABI mandates that arguments are passed onto the stack, hence d
might be aliased by any of the three pointers. Indeed, GCC refuses to vectorise the loop in fun1
if instructed to generate 32-bit x86 code with the -m32
option.
GCC 4.7 gets around this by inserting run-time checks which ensure that neither d
nor par->d
get aliased.
Getting rid of d
removes the unprovable non-aliasing and the following OpenMP code gets vectorised by GCC 4.6.1:
#pragma omp parallel for schedule(static)
for (int i = 0; i < N; i++)
a[i] = b[i] + c[i];
I'll try to briefly answer your question.
Yes... but starting from the incoming OpenMP 4.0. The link posted above provides a good insight on this construct. The current OpenMP 3.1, on the other hand, is not "aware" of the SIMD concept. What happens therefore in practice (or, at least, in my experience) is that auto-vectorization mechanisms are inhibited whenever an openmp worksharing construct is used on a loop. Anyhow the two concepts are orthogonal and you can still benefit from both (see this other answer).
I am afraid yes, at least at present. I would start rewriting the loops under consideration in a way that makes vectorization explicit (i.e. I will use intrinsics on Intel platform, Altivec on IBM and so on).
You are asking "why GCC can't do vectorization when OpenMP is enabled?".
It seems that this may be a bug of GCC :) http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032
Otherwise, an OpenMP API may introduce dependency (either control or data) that prevents automatic vectorization. To auto-vertorize, a given code must be data/control-dependency free. It's possible that using OpenMP may cause some spurious dependency.
Note: OpenMP (prior to 4.0) is to use thread-level parallelism, which is orthogonal to SIMD/vectorization. A program can use both OpenMP and SIMD parallelism at the same time.
I ran across this post while searching for comments about the gcc 4.9 option openmp-simd, which should activate OpenMP 4 #pragma omp simd without activating omp parallel (threading). gcc bugzilla pr60117 (confirmed) shows a case where the pragma omp prevents auto-vectorization which occurred without the pragma.
gcc doesn't vectorize omp parallel for even with the simd clause (parallel regions can auto-vectorize only the inner loop nested under a parallel for). I don't know any compiler other than icc 14.0.2 which could be recommended for implementation of #pragma omp parallel for simd; with other compilers, SSE intrinsics coding would be required to get this effect.
The Microsoft compiler doesn't perform any auto-vectorization inside parallel regions in my tests, which show clear superiority of gcc for such cases.
Combined parallelization and vectorization of a single loop has several difficulties, even with the best implementation. I seldom see more than 2x or 3x speedup by adding vectorization to a parallel loop. Vectorization with AVX double data type, for example, effectively cuts the chunk size by a factor of 4. Typical implementation can achieve aligned data chunks only for the case where the entire array is aligned, and the chunks also are exact multiples of the vector width. When the chunks are not all aligned, there is inherent work imbalance due to the varying alignments.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With