Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OpenMP: nowait and reduction clauses on the same pragma

I am studying OpenMP, and came across the following example:

#pragma omp parallel shared(n,a,b,c,d,sum) private(i)
{
    #pragma omp for nowait
    for (i=0; i<n; i++)
        a[i] += b[i];

    #pragma omp for nowait
    for (i=0; i<n; i++)
        c[i] += d[i];
    #pragma omp barrier

    #pragma omp for nowait reduction(+:sum)
    for (i=0; i<n; i++)
        sum += a[i] + c[i];
} /*-- End of parallel region --*/

In the last for loop, there is a nowait and a reduction clause. Is this correct? Doesn't the reduction clause need to be syncronized?

like image 897
aperez Avatar asked Jun 11 '11 12:06

aperez


2 Answers

The nowaits in the second and last loop are somewhat redundant. The OpenMP spec mentions nowait before the end of the region so perhaps this can stay in.

But the nowait before the second loop and the explicit barrier after it cancel each other out.

Lastly, about the shared and private clauses. In your code, shared has no effect, and private simply shouldn’t be used at all: If you need a thread-private variable, just declare it inside the parallel region. In particular, you should declare loop variables inside the loop, not before.

To make shared useful, you need to tell OpenMP that it shouldn’t share anything by default. You should do this to avoid bugs due to accidentally shared variables. This is done by specifying default(none). This leaves us with:

#pragma omp parallel default(none) shared(n, a, b, c, d, sum)
{
    #pragma omp for nowait
    for (int i = 0; i < n; ++i)
        a[i] += b[i];

    #pragma omp for
    for (int i = 0; i < n; ++i)
        c[i] += d[i];

    #pragma omp for nowait reduction(+:sum)
    for (int i = 0; i < n; ++i)
        sum += a[i] + c[i];
} // End of parallel region
like image 103
Konrad Rudolph Avatar answered Oct 05 '22 23:10

Konrad Rudolph


In some regards this seems like a homework problem, which I hate to do for people. On the other hand, the answers above are not totally accurate and I feel should be corrected.

First, while in this example both the shared and private clauses are not needed, I disagree with Konrad that they shouldn't be used. One of the most common problems with people parallelizing code, is that they don't take the time to understand how the variables are being used. Not privatizing and/or protecting shared variables that should be, accounts for the largest number of problems that I see. Going through the exercise of examining how variables are used and putting them into the appropriate shared, private, etc. clauses will greatly reduce the number of problems you have.

As for the question about the barriers, the first loop can have a nowait clause, because there is no use of the value computed (a) in the second loop. The second loop can have a nowait clause only if the value computed (c) is not used before the values are calculated (i.e., there is no dependency). In the original example code there is a nowait on the second loop, but an explicit barrier before the third loop. This is fine, since your professor was trying to show the use of an explicit barrier - though leaving off the nowait on the second loop would make the explicit barrier redundant (since there is an implicit barrier at the end of a loop).

On the other hand, the nowait on the second loop and the explicit barrier may not be needed at all. Prior to the OpenMP V3.0 specification, many people assumed that something was true that was not clarified in the specification. With the OpenMP V3.0 specification the following was added to section 2.5.1 Loop Construct, Table 2-1 schedule clause kind values, static (schedule):

A compliant implementation of static schedule must ensure that the same assignment of logical iteration numbers to threads will be used in two loop regions if the following conditions are satisfied: 1) both loop regions have the same number of loop iterations, 2) both loop regions have the same value of chunk_size specified, or both loop regions have no chunk_size specified, and 3) both loop regions bind to the same parallel region. A data dependence between the same logical iterations in two such loops is guaranteed to be satisfied allowing safe use of the nowait clause (see Section A.9 on page 170 for examples).

Now in your example, no schedule was shown on any of the loops, so this may or may not hold. The reason is, that the default schedule is implementation defined and while most implementations currently define the default schedule to be static, there is no guarantee of that. If your professor had put on a schedule type of static without a chunk-size on all three loops, then nowait could be used on the first and second loop and no barrier (either implicit or explicit) would be needed between the second and third loops at all.

Now we get to the third loop and your question about nowait and reduction. As Michy pointed out, the OpenMP specification allows both (reduction and nowait) to be specified. However, it is not true that no synchronization is needed for the reduction to be complete. In the example, the implicit barrier (at the end of the third loop) can be removed with the nowait. This is because the reduction (sum) is not being used before the implicit barrier of the parallel region has been encountered.

If you look at the OpenMP V3.0 specification, section 2.9.3.6 reduction clause, you will find the following:

If nowait is not used, the reduction computation will be complete at the end of the construct; however, if the reduction clause is used on a construct to which nowait is also applied, accesses to the original list item will create a race and, thus, have unspecified effect unless synchronization ensures that they occur after all threads have executed all of their iterations or section constructs, and the reduction computation has completed and stored the computed value of that list item. This can most simply be ensured through a barrier synchronization.

This means that if you wanted to use the sum variable in the parallel region after the third loop, then you would need a barrier (either implicit or explicit) before you used it. As the example stands now, it is correct.

like image 42
ejd Avatar answered Oct 06 '22 00:10

ejd