Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiple levels of parallelism using OpenMP - Possible? Smart? Practical?

I am currently working on a C++ sparse matrix/math/iterative solver library, for a simulation tool I manage. I would have preferred to use an existing package, however, after extensive investigation, none were found that were appropriate for our simulator (we looked at flens, it++, PetSC, eigen, and several others). The good news is my solvers and sparse matrix structures are now very efficient and robust. The bad news is, I am now looking into parallelization using OpenMP, and the learning curve is a bit steep.

The domain we solve can be broken into sub-domains, which come together in a block-diagonal format. So our storage scheme ends up looking like an array of smaller square matrices (blocks[]) each with a format appropriate to the sub-domain (e.g. Compressed Row Storage: CRS, Compressed Diagonal Storage: CDS, Dense, etc..), and a background matrix (currently using CRS) that accounts for the connectivity between sub-domains.

The "hot spot" in most (all?) iterative solvers is the Matrix Vector multiplication operation, and this is true of my library. Thus, I've been focusing my efforts on optimizing my MxV routines. For the block diagonal structure, the pseudo code for M*x=b would be as follows:

b=background_matrix*x
start_index = 1;
end_index = 0;
for(i=1:number of blocks) {
    end_index=start_index+blocks[i].numRows();
    b.range(start_index, end_index) += blocks[i] * x.range(start_index, end_index);
    start_index = end_index+1;
}

where background_matrix is the background (CRS) matrix, blocks is the array of sub-domain matrices, and .range returns the portion of the vector from a starting index to an ending index.

Obviously the loop can be (and has been) parallelized, as the operations are independent of other iterations of the loop (the ranges are non-overlapping). Since we have 10-15 blocks in a typical system, 4+ threads actually makes a significant difference.

The other place where parallelization has been seen to be a good option in is in the MxV operation for each sub-domain storage scheme (calls in lines 1 and 6 in the above code). There is plenty out there on parallelizing CRS, CDS, and dense matrix MxV operations. Typically a nice boost is seen with 2 threads, with greatly diminishing returns as more threads are added.

I am envisioning a scheme, where 4 threads would be used in the block loop for the above code, and each of those threads would use 2 threads for the sub-domain solves. However, I am not sure how, using OpenMP, one would manage the pool of threads- is it possible to limit the number of threads in an openmp for loop? Is this multi-level parallelism something that in practice makes sense? Any other thoughts on what I've proposed here would be appreciated (and thanks for reading all the way to the end!)

like image 488
MarkD Avatar asked Jul 01 '10 16:07

MarkD


People also ask

What kind of parallelism does OpenMP use?

Incremental parallelism: can work on one part of the program at one time, no dramatic change to code is needed. Unified code for both serial and parallel applications: OpenMP constructs are treated as comments when sequential compilers are used.

What is OpenMP and its application in parallel computing?

OpenMP is a set of compiler directives as well as an API for programs written in C, C++, or FORTRAN that provides support for parallel programming in shared-memory environments. OpenMP identifies parallel regions as blocks of code that may run in parallel.

What does #pragma OMP parallel for do?

#pragma omp parallel spawns a group of threads, while #pragma omp for divides loop iterations between the spawned threads. You can do both things at once with the fused #pragma omp parallel for directive.


1 Answers

Please note that everything I describe is implementation dependent.

Is it possible to limit the number of threads in an openmp for loop?

Yes. There are different ways to do so. Set omp_set_nested(1); and use something like #pragma omp parallel for num_threads(4) or similar in your outer loop and #pragma omp parallel for num_threads(2) directive in your inner loop. This should give you 8 threads (depending on the implementation, you possibly also have to set OMP_THREAD_LIMIT if you have less than 8 cores)

Alternatively, you can manually unroll your loops, e.g. using something like

#pragma omp parallel sections {
     #pragma omp section 
     do your stuff for the first part, nest parallel region again
     #pragma omp section 
     and so on for the other parts
}

You can do the same thing sometimes more efficiently in OpenMP 3.0 with #pragma omp task.

Or you start 8 threads and get the current thread number within the parallel section and schedule manually based on the thread number.

Finally, if you have a perfectly nested loop (a loop is perfectly nested, if the actual assignment only happens in the innermost loop), you can rewrite everything into a single loop. Basically pack your two iterators i and j into one big iterator (i, j). Note that this might reduce locality and therefore decrease performance

Is this multi-level parallelism something that in practice makes sense?

It depends, and you have to find out yourself. In general, multi-level parallelism makes your problem more scalable. Scheduling, however, can be more complicated. This paper might be interesting.

Regarding setting the number of threads manually: the main advantage of setting the numbers of threads is that you can use specific knowledge about your problem when scheduling. Thereby, you can reduce overhead and get higher locality of the code being executed and hence more cache hits and less main memory I/O.

The main disadvantage of manually setting the number of threads in nested parallelism is that threads in the innermost loop might wait idly at the implicit barrier, while additional work could be done (example). Also, coarse-grained parallelism doesn't scale well. So if your outer loop has very different run-time within the loop, you want to to schedule more flexibly than simply splitting into 4 threads.

Any other thoughts

Have you though about doing the MxV with SIMD. Depending on the architecture, this can give a speed-up of 2-4. I quickly googled this presentation for you.

For MxV, loop tiling, register and cache blocking and related techniques can increase data locality and reduce other issues, e.g. false sharing. This book, chapter 11 (you can preview it), might give you some additional ideas on how to restructure data access.

like image 153
stephan Avatar answered Sep 24 '22 04:09

stephan