From what I understand, <code>#pragma omp parallel</code> and its variations basically execute the following block in a number of concurrent threads, which corresponds to the number of CPUs. When having nested parallelizations - parallel for within parallel for, parallel function within parallel function etc. - what happens on the inner parallelization? I'm new to OpenMP, and the case I have in mind is probably rather trivial - multiplying a vector with a matrix. This is done in two nested for loops. Assuming the number of CPUs is smaller than the number of elements in the vector, is there any benefit in trying to run the inner loop in parallel? Will the total number of threads be larger than the number of CPUs, or will the inner loop be executed sequentially?

(1) Nested parallelism in OpenMP: http://docs.oracle.com/cd/E19205-01/819-5270/aewbc/index.html You need to turn on nested parallelism by setting <code>OMP_NESTED</code> or <code>omp_set_nested</code> because many implementations turn off this feature by default, even some implementations didn't support nested parallelism fully. If turned on, whenever you meet <code>parallel for</code>, OpenMP will create the number of threads as defined in <code>OMP_NUM_THREADS</code>. So, if 2-level parallelism, the total number of threads would be N^2, where N = <code>OMP_NUM_THREADS</code>. Such nested parallelism will cause oversubscription, (i.e., the number of busy threads is greater than the cores), which may degrade the speedup. In an extreme case, where nested parallelism is called recursively, threads could be bloated (e.g., creating 1000s threads), and computer just wastes time for context switching. In such case, you may control the number of threads dynamically by setting <code>omp_set_dynamic</code>. (2) An example of matrix-vector multiplication: the code would look like: <pre class="prettyprint"><code>// Input: A(N by M), B(M by 1) // Output: C(N by 1) for (int i = 0; i < N; ++i) for (int j = 0; j < M; ++j) C[i] += A[i][j] * B[j]; </code></pre> In general, parallelizing inner loops while outer loops are possible is bad because of forking/joining overhead of threads. (though many OpenMP implementations pre-create threads, it still requires some to dispatch tasks to threads and to call implicit barrier at the end of parallel-for) Your concern is the case of where N < # of CPU. Yes, right, in this case, the speedup would be limited by N, and letting nested parallelism will definitely have benefits. However, then the code would cause oversubscription if N is sufficiently large. I'm just thinking the following solutions: <ul> <li>Changing the loop structure so that only 1-level loop exists. (It looks doable)</li> <li>Specializing the code: if N is small, then do nested parallelism, otherwise don't do that.</li> <li>Nested parallelism with <code>omp_set_dynamic</code>. But, please make it sure how <code>omp_set_dynamic</code> controls the number of threads and the activity of threads. Implementations may vary.</li> </ul>

OpenMP: What is the benefit of nesting parallelizations?

Q: What is #pragma OMP parallel sections?

Purpose. The omp parallel sections directive effectively combines the omp parallel and omp sections directives. This directive lets you define a parallel region containing a single sections directive in one step.

Q: What kind of parallelism does OpenMP use?

Incremental parallelism: can work on one part of the program at one time, no dramatic change to code is needed. Unified code for both serial and parallel applications: OpenMP constructs are treated as comments when sequential compilers are used.

Tags:

c++

multithreading

parallel-processing

nested

openmp

From what I understand, #pragma omp parallel and its variations basically execute the following block in a number of concurrent threads, which corresponds to the number of CPUs. When having nested parallelizations - parallel for within parallel for, parallel function within parallel function etc. - what happens on the inner parallelization?

I'm new to OpenMP, and the case I have in mind is probably rather trivial - multiplying a vector with a matrix. This is done in two nested for loops. Assuming the number of CPUs is smaller than the number of elements in the vector, is there any benefit in trying to run the inner loop in parallel? Will the total number of threads be larger than the number of CPUs, or will the inner loop be executed sequentially?

321

asked Nov 30 '10 19:11

eran

1 Answers

(1) Nested parallelism in OpenMP: http://docs.oracle.com/cd/E19205-01/819-5270/aewbc/index.html

You need to turn on nested parallelism by setting OMP_NESTED or omp_set_nested because many implementations turn off this feature by default, even some implementations didn't support nested parallelism fully. If turned on, whenever you meet parallel for, OpenMP will create the number of threads as defined in OMP_NUM_THREADS. So, if 2-level parallelism, the total number of threads would be N^2, where N = OMP_NUM_THREADS.

Such nested parallelism will cause oversubscription, (i.e., the number of busy threads is greater than the cores), which may degrade the speedup. In an extreme case, where nested parallelism is called recursively, threads could be bloated (e.g., creating 1000s threads), and computer just wastes time for context switching. In such case, you may control the number of threads dynamically by setting omp_set_dynamic.

(2) An example of matrix-vector multiplication: the code would look like:

Click to copy

// Input:  A(N by M), B(M by 1)
// Output: C(N by 1)
for (int i = 0; i < N; ++i)
  for (int j = 0; j < M; ++j)
     C[i] += A[i][j] * B[j];

In general, parallelizing inner loops while outer loops are possible is bad because of forking/joining overhead of threads. (though many OpenMP implementations pre-create threads, it still requires some to dispatch tasks to threads and to call implicit barrier at the end of parallel-for)

Your concern is the case of where N < # of CPU. Yes, right, in this case, the speedup would be limited by N, and letting nested parallelism will definitely have benefits.

However, then the code would cause oversubscription if N is sufficiently large. I'm just thinking the following solutions:

Changing the loop structure so that only 1-level loop exists. (It looks doable)
Specializing the code: if N is small, then do nested parallelism, otherwise don't do that.
Nested parallelism with omp_set_dynamic. But, please make it sure how omp_set_dynamic controls the number of threads and the activity of threads. Implementations may vary.

117

answered Oct 07 '22 14:10

minjang

Related questions
                            
                                Why can't the runtime environment decide to apply delete or delete[] instead of the programmer?
                            
                                Why a virtual call to a pure virtual function from a constructor is UB and a call to a non-pure virtual function is allowed by the Standard?
                            
                                Profiling a C++ project in terms of execution time
                            
                                Using ternary operator to initialize a reference variable?
                            
                                What is this C++ language construct: # (i.e. hash) integer "path_to_header_or_cpp_file" <integer>?
                            
                                Assigning one array to another array c++ [duplicate]
                            
                                map/unordered_map: Prefer find() and then at() or try at() catch out_of_range?
                            
                                Difference between `deadline_timer`and `waitable_timer` in `boost asio`
                            
                                pointer non-type template parameter
                            
                                Dereferencing a temporary unique_ptr
                            
                                Converting a Cubemap into Equirectangular Panorama
                            
                                Same typeid name but not std::is_same
                            
                                structure tag in C vs C++
                            
                                Call operator with auto return type being chosen instead of constructor when using std::function
                            
                                Lambda closure lvalues can be passed as rvalue reference parameters
                            
                                How much memory must be reserved for a C++20 coroutine frame?
                            
                                Comparing polymorphic types in c++20
                            
                                _wfopen equivalent under Mac OS X
                            
                                What is the best HTML Rendering Engine to embed in an application? [closed]
                            
                                Best approach to QT UI testing [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With