Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OpenMP schedule(static) with no chunk size specified: chunk size and order of assignment

Tags:

openmp

I have a few questions regarding #pragma omp for schedule(static) where the chunk size is not specified.

One way to parallelize a loop in OpenMP is to do it manually like this:

#pragma omp parallel 
{
    const int nthreads = omp_get_num_threads();
    const int ithread = omp_get_thread_num();
    const int start = ithread*N/nthreads;
    const int finish = (ithread+1)*N/nthreads;
    for(int i = start; i<finish; i++) {
        //          
    }
}

Is there a good reason not to do manually parallelize a loop like this in OpenMP? If I compare the values with #pragma omp for schedule(static) I see that the chunk sizes for a given thread don't always agree so OpenMP (in GCC) is implementing the chuck sizes different than as defined in start and finish. Why is this?

The start and finish values I defined have several convenient properties.

  1. Each thread gets at most one chunk.
  2. The range of values for iterations increase directly with thread number (i.e. for 100 threads with two threads the first thread will process iterations 1-50 and the second thread 51-100 and not the other way around).
  3. For two for loops over exactly the same range each thread will run over exactly the same iterations.

Edit: Original I said exactly one chunk but after thinking about it it's possible for the size of the chunk to be zero if the number of threads is much larger than N (ithread*N/nthreads = (ithread*1)*N/nthreads). The property I really want is at most one chunk.

Are all these properties guaranteed when using #pragma omp for schedule(static)?

According to the OpenMP specifications:

Programs that depend on which thread executes a particular iteration under any other circumstances are non-conforming.

and

Different loop regions with the same schedule and iteration count, even if they occur in the same parallel region, can distribute ite rations among threads differently. The only exception is for the static schedule

For schedule(static) the specification says:

chunks are assigned to the threads in the team in a round-robin fashion in the order of the thread number.

Additionally the specification says for `schedule(static):

When no chunk_size is specified, the iteration space is divided into chunks that are approximately equal in size, and at most one chunk is distributed to each thread.

Finally, the specification says for schedule(static):

A compliant implementation of the static schedule must ensure that the same assignment of logical iteration numbers to threads will be used in two loop regions if the following conditions are satisfied: 1) both loop regions have the same number of loop iterations, 2) both loop regions have the same value of chunk_size specified, or both loop regions have no chunk_size specified, 3) both loop regions bind to the same parallel region.

So if I read this correctly schedule(static) will have the same convenient properties I listed as start and finish even though my code relies on thread executes a particular iteration. Do I interpret this correctly? This seems to be a special case for schedule(static) when the chunk size is not specified.

It's easier to just define start and finish like I did then try and interrupt the specification for this case.

like image 208
Z boson Avatar asked Sep 11 '13 16:09

Z boson


People also ask

What is schedule static in OpenMP?

#pragma omp parallel for schedule(static) for(i=0; i<n; i++) { invariant_amount_of_work(i); } The static schedule is characterized by the properties that each thread gets approximately the same number of iterations as any other thread, and each thread can independently determine the iterations assigned to it.

What is chunk size OpenMP?

OpenMP divides the iterations into chunks of size chunk-size and it distributes the chunks to threads in a circular order. When no chunk-size is specified, OpenMP divides iterations into chunks that are approximately equal in size and it distributes at most one chunk to each thread.

What is default scheduling in OpenMP?

Static Schedules By default, OpenMP statically assigns loop iterations to threads. When the parallel for block is entered, it assigns each thread the set of loop iterations it is to execute. The following program illustrates this is done by default: #include <stdlib.h>

How does pragma OMP parallel for work?

#pragma omp parallel spawns a group of threads, while #pragma omp for divides loop iterations between the spawned threads. You can do both things at once with the fused #pragma omp parallel for directive.


1 Answers

Is there a good reason not to do manually parallelize a loop like this in OpenMP?

First things that came to my mind:

  1. You added a dependency to the OpenMP library, and thus you either have to duplicate part of your code if you want to maintain a serial compilation or you have to provide stubs for the library function calls.
  2. A worksharing construct requires less code and is more convenient to read than an explicitly parallelized loop. If you need to maintain a large code-base, that would matter.
  3. You miss all the scheduling opportunity apart schedule(static).
  4. Messing with low-level details can easily back-fire.

Are all these properties are guaranteed when using #pragam omp for schedule(static)?

Let's see one by one:

1.) Each thread gets exactly one chunk

When no chunk_size is specified, the iteration space is divided into chunks that are approximately equal in size, and at most one chunk is distributed to each thread. Note that the size of the chunks is unspecified in this case.

At most one chunk is not exactly one chunk. So property one is not fulfilled. Besides this is why the size of the chunk is unspecified.

2.) The range of values for iterations increase directly with thread number (i.e. for 100 iterations with two threads the first thread will process iterations 1-50 and the second thread 51-100 and not the other way around)

A compliant implementation of the static schedule must ensure that the same assignment of logical iteration numbers to threads will be used in two loop regions if the following conditions are satisfied:

  1. both loop regions have the same number of loop iterations
  2. both loop regions have the same value of chunk_size specified, or both loop regions have no chunk_size specified
  3. both loop regions bind to the same parallel region.

A data dependence between the same logical iterations in two such loops is guaranteed to be satisfied allowing safe use of the nowait clause (see Section A.10 on page 182 for examples).

Even though I never saw something different from what you say, I dare say that even property two is not fulfilled, at least not for schedule(static). It seems to me that in an iteration space of a certain cardinality the only guarantee is that the same "logical iteration numbers" will be given to the same thread if condition 1, 2 and 3 are respected.

It is indeed granted if you specify the chunk size:

When schedule(static, chunk_size) is specified, iterations are divided into chunks of size chunk_size, and the chunks are assigned to the threads in the team in a round-robin fashion in the order of the thread number.

3.) For two for loops over exactly the same range each thread will run over exactly the same iterations

This is indeed granted, and is even more general: for two loops with an iteration space of the same cardinality, the same "logical iteration numbers" will be given to each thread. Example A.10.2c of the OpenMP 3.1 standard should clarify this point.

like image 177
Massimiliano Avatar answered Sep 22 '22 05:09

Massimiliano