Implicit barrier vs nowait in case of two successive pragma omp for

Tags:

openmp

Looking at the document here, the following construct is well defined:

#pragma omp parallel          //Line 1
{
#pragma omp for nowait        //Line 3
  for (i=0; i<N; i++)
    a[i] = // some expression
#pragma omp for               //Line 6
  for (i=0; i<N; i++)
    b[i] = ...... a[i] ......
}

since

Here the nowait clause implies that threads can start on the second loop while other threads are still working on the first. Since the two loops use the same schedule here, an iteration that uses a[i] can indeed rely on it that that value has been computed.

I am having a tough time understanding why this would be. Suppose Line 3 were:

#pragma omp for

then, since there is an implicit barrier just before Line 6, the next for loop will have values at all indices of a fully computed. But, with the no wait in Line 3, how would it work?

Suppose, Line 1 triggers 4 threads, t1, t2, t3 and t4. Suppose N is 8 and the partition of indices in the first for loop is thus:

t1: 0, 4
t2: 1, 5
t3: 2, 6
t4: 3, 7

Suppose t1 completes indices 0 and 4 first and lands up at Line 6 What exactly happens now? How is it guaranteed that it now gets to operate on the same indices 0 and 4, for which the a values are correctly computed by it in the previous iteration? What if the second for loop accesses a[i+1]?

870

asked Nov 13 '18 08:11

Tryer

2 Answers

The material you quote is wrong. It becomes correct if you add schedule(static) to both loops - this guarantees the same distribution of indices among threads for successive loops. The default schedule is implementation defined, you cannot assume it to be static. To quote the standard:

Different loop regions with the same schedule and iteration count, even if they occur in the same parallel region, can distribute iterations among threads differently. The only exception is for the static schedule as specified in Table 2.5. Programs that depend on which thread executes a particular iteration under any other circumstances are non-conforming.

If the second for loop accesses a[i+1] you must absolutely leave the barrier there.

173

answered Sep 20 '22 13:09

Zulan

To me the statement that there is no potential problem in the example is wrong.

Indeed, scheduling will be the same as it is not explicitly defined. It will be the default one. Furthermore, if the scheduling was of static type, then indeed, there wouldn't be any issue since the thread that would handle any given data in array a inside the second loop would be the same as the one which would have written it in the first loop.

But the actual problem here is that the default scheduling is not defined by the OpenMP standard. This is implementation defined... For the (many) implementations where the default scheduling is static, there cannot be any race condition in the snippet. But if the default scheduling is dynamic, then, as you notice, a race condition can happen and the result is undefined.

answered Sep 17 '22 13:09

Gilles

Related questions
                            
                                How does const auto and auto const apply to pointers?
                            
                                Why does std::move take rvalue reference as argument?
                            
                                llvm JIT add library to module
                            
                                '&&' in function parameter pack
                            
                                Compile with c++17 mac
                            
                                Cython: Compile a Standalone Static Executable
                            
                                Automatically detect C++14 "return should use std::move" situation
                            
                                Is there any way for a C++ template function to take exactly N arguments?
                            
                                How to combine back_inserter with a transformation, C++
                            
                                Printing bits in long long number (C++)
                            
                                Why u8'A' can be a char type while UTF-8 can be up to 4 bytes and char is normally 1 byte?
                            
                                Copy Elision for Returned Temporaries
                            
                                C++ template - using "std::is_same_v" instead of specializing and avoid compilation error?
                            
                                Find a shared_ptr in an unordered_set with only a const shared_ptr?
                            
                                Using a constexpr static member of a reference as template argument
                            
                                How to return std::lock_guard in a std::pair
                            
                                QString in Persian
                            
                                Is it safe to emit signal from another thread?
                            
                                Can STL algorithms and back_inserter preallocate space?
                            
                                Is there a way to check if std::future state is ready in a guaranteed wait-free manner?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With