I have a very weird problem using OpenMP in my C++ code:
void update(double *source, double *target, int n)
{
target[0] = source[0];
target[n-1] = source[n-1];
#pragma omp parallel for
for(int i = 1; i < n-1; ++i)
target[i] = (1.0/3.0) * (source[i-1] + source[i] + source[i+1]);
}
Both source and target are double arrays with n elements. The code works fine when using it without OpenMP. But as soon as I use the pragma, the code seems to get stuck in this loop. The thing is: I have absolutely NO IDEA why. Hope anyone can help me
How large is n?
The default scheduling for a OpenMP parallel for
directive is implementation specific. It looks like in GOMP (the OpenMP implementation used by gcc), the default is (dynamic,1)
according to the documentation here. This means that each thread is accessing (at i-1
and i+1
) memory locations that are loaded by neighboring threads, which could lead to poor cache utilization. On modern CPU architectures, stencil operations like this are frequently memory-bound and sensitive to caching. You could try specifying a schedule with larger chunks, for instance with:
#pragma omp parallel for schedule(dynamic,1024)
I'm just using 1024 here as an example. In practice, you should experiment to find the optimal chunking factor (or systematically search with a parameter sweep, a process often called "auto-tuning"). Or you could choose a value based more in theory, for instance by deriving it from the L1 or L2 cache size of your CPU.
Or you could instead try static scheduling, since the amount of computation inside the for loop is uniform across threads, and the overhead of the dynamic scheduler may be causing a bottleneck. If you specify
#pragma omp parallel for schedule(static)
without a chunk size, then each thread will be assigned a single chunk of roughly the same size.
Finally, you may also want to pin the OpenMP threads to their own CPU cores. You can do this using the GOMP_CPU_AFFINITY environment variable.
Edit:
I was just playing around with the following test program compiled with gcc 4.2.1, and I think the documentation I linked to above is incorrect. It looks like GOMP defaults to schedule(static)
.
#include <stdio.h>
#include <omp.h>
int main(int argc, char** argv)
{
int i;
#pragma omp parallel for
for (i=0; i<15; i++) {
int id = omp_get_thread_num();
printf("%d assigned to thread %d\n", i, id);
}
}
And the output with two threads is:
$ ./test_sched | sort -n
0 assigned to thread 0
1 assigned to thread 0
2 assigned to thread 0
3 assigned to thread 0
4 assigned to thread 0
5 assigned to thread 0
6 assigned to thread 0
7 assigned to thread 0
8 assigned to thread 1
9 assigned to thread 1
10 assigned to thread 1
11 assigned to thread 1
12 assigned to thread 1
13 assigned to thread 1
14 assigned to thread 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With