I want to test #pragma omp parallel for
and #pragma omp simd
for a simple matrix addition program. When I use each of them separately, I get no error and it seems fine. But, I want to test how much performance can be gained using both of them. If I use #pragma omp parallel for
before the outer loop and #pragma omp simd
before the inner loop I get no error as well. The error occures when I use both of them before the outer loop. I get an error at runtime not compile time. ICC
and GCC
return error but Clang
doesn't. It might be because Clang
regect the parallelization. In my experiments, Clang does not parallelize and run the program with only one thread.
The program is here:
#include <stdio.h>
//#include <x86intrin.h>
#define N 512
#define M N
int __attribute__(( aligned(32))) a[N][M],
__attribute__(( aligned(32))) b[N][M],
__attribute__(( aligned(32))) c_result[N][M];
int main()
{
int i, j;
#pragma omp parallel for
#pragma omp simd
for( i=0;i<N;i++){
for(j=0;j<M;j++){
c_result[i][j]= a[i][j] + b[i][j];
}
}
return 0;
}
The error for: ICC:
IMP1.c(20): error: omp directive is not followed by a parallelizable for loop #pragma omp parallel for ^
compilation aborted for IMP1.c (code 2)
GCC:
IMP1.c: In function ‘main’:
IMP1.c:21:10: error: for statement expected before ‘#pragma’ #pragma omp simd
Because in my other testes pragma omp simd
for outer loop gets better performance I need to put that there (don't I?).
Platform: Intel Core i7 6700 HQ, Fedora 27
Tested compilers: ICC 18, GCC 7.2, Clang 5
Compiler command line:
icc -O3 -qopenmp -xHOST -no-vec
gcc -O3 -fopenmp -march=native -fno-tree-vectorize -fno-tree-slp-vectorize
clang -O3 -fopenmp=libgomp -march=native -fno-vectorize -fno-slp-vectorize
#pragma omp parallel spawns a group of threads, while #pragma omp for divides loop iterations between the spawned threads.
OpenMP SIMD, introduced in the OpenMP 4.0 standard, targets making vector-friendly loops. By using the simd directive before a loop, the compiler can ignore vector dependencies, make the loop as vector-friendly as possible, and respect the users' intention to have multiple loop iterations executed simultaneously.
Yes, "There is an implicit barrier at the end of the parallel construct." OpenMP Standard 4.5, 1.3 Execution Model, page 15.
OpenMP will: Allow a programmer to separate a program into serial regions and parallel regions, rather than T concurrently-executing threads.
From OpenMP 4.5 Specification:
2.11.4 Parallel Loop SIMD Construct
The parallel loop SIMD construct is a shortcut for specifying a parallel construct containing one loop SIMD construct and no other statement.
The syntax of the parallel loop SIMD construct is as follows:
#pragma omp parallel for simd
...
You can also write:
#pragma omp parallel
{
#pragma omp for simd
for ...
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With