OpenMP

Question

I have a very weird problem using OpenMP in my C++ code:

void update(double *source, double *target, int n)
{
    target[0] = source[0];
    target[n-1] = source[n-1];
    #pragma omp parallel for 
    for(int i = 1; i < n-1; ++i) 
        target[i] = (1.0/3.0) * (source[i-1] + source[i] + source[i+1]);
}

Both source and target are double arrays with n elements. The code works fine when using it without OpenMP. But as soon as I use the pragma, the code seems to get stuck in this loop. The thing is: I have absolutely NO IDEA why. Hope anyone can help me

mhowison · Accepted Answer

How large is n?

The default scheduling for a OpenMP parallel for directive is implementation specific. It looks like in GOMP (the OpenMP implementation used by gcc), the default is (dynamic,1) according to the documentation here. This means that each thread is accessing (at i-1 and i+1) memory locations that are loaded by neighboring threads, which could lead to poor cache utilization. On modern CPU architectures, stencil operations like this are frequently memory-bound and sensitive to caching. You could try specifying a schedule with larger chunks, for instance with:

#pragma omp parallel for schedule(dynamic,1024)

I'm just using 1024 here as an example. In practice, you should experiment to find the optimal chunking factor (or systematically search with a parameter sweep, a process often called "auto-tuning"). Or you could choose a value based more in theory, for instance by deriving it from the L1 or L2 cache size of your CPU.

Or you could instead try static scheduling, since the amount of computation inside the for loop is uniform across threads, and the overhead of the dynamic scheduler may be causing a bottleneck. If you specify

#pragma omp parallel for schedule(static)

without a chunk size, then each thread will be assigned a single chunk of roughly the same size.

Finally, you may also want to pin the OpenMP threads to their own CPU cores. You can do this using the GOMP_CPU_AFFINITY environment variable.

Edit:

I was just playing around with the following test program compiled with gcc 4.2.1, and I think the documentation I linked to above is incorrect. It looks like GOMP defaults to schedule(static).

#include <stdio.h>
#include <omp.h>

int main(int argc, char** argv)
{
    int i;
    #pragma omp parallel for
    for (i=0; i<15; i++) {
        int id = omp_get_thread_num();
        printf("%d assigned to thread %d
", i, id);
    }
}

And the output with two threads is:

$ ./test_sched | sort -n
0 assigned to thread 0
1 assigned to thread 0
2 assigned to thread 0
3 assigned to thread 0
4 assigned to thread 0
5 assigned to thread 0
6 assigned to thread 0
7 assigned to thread 0
8 assigned to thread 1
9 assigned to thread 1
10 assigned to thread 1
11 assigned to thread 1
12 assigned to thread 1
13 assigned to thread 1
14 assigned to thread 1

OpenMP - Easy Loop, but still infinite?

Tags:

c++

parallel-processing

Chris

1 Answers

mhowison

Recent Activity

Donate For Us