Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

All OpenMP Tasks running on the same thread

I have wrote a recursive parallel function using tasks in OpenMP. While it gives me the correct answer and runs fine I think there is an issue with the parallelism.The run-time in comparison with a serial solution does not scale in the same other parallel problem I have solved without tasks have. When printing each thread for the tasks they are all running on thread 0. I am compiling and running on Visual Studio Express 2013.

int parallelOMP(int n)
{

    int a, b, sum = 0;
    int alpha = 0, beta = 0;

    for (int k = 1; k < n; k++)
    {

        a = n - (k*(3 * k - 1) / 2);
        b = n - (k*(3 * k + 1) / 2);


        if (a < 0 && b < 0)
            break;


        if (a < 0)
            alpha = 0;

        else if (p[a] != -1)
            alpha = p[a];

        if (b < 0)
            beta = 0;

        else if (p[b] != -1)
            beta = p[b];


        if (a > 0 && b > 0 && p[a] == -1 && p[b] == -1)
        {
            #pragma omp parallel
            {
                #pragma omp single
                {
                    #pragma omp task shared(p), untied
                    {
                        cout << omp_get_thread_num();
                        p[a] = parallelOMP(a);
                    }
                    #pragma omp task shared(p), untied
                    {
                        cout << omp_get_thread_num();
                        p[b] = parallelOMP(b);
                    }
                    #pragma omp taskwait
                }
            }

            alpha = p[a];
            beta = p[b];
        }

        else if (a > 0 && p[a] == -1)
        {
            #pragma omp parallel
            {
                #pragma omp single
                {
                    #pragma omp task shared(p), untied
                    {
                        cout << omp_get_thread_num();
                        p[a] = parallelOMP(a);
                    }

                    #pragma omp taskwait
                }
            }

            alpha = p[a];
        }

        else if (b > 0 && p[b] == -1)
        {
            #pragma omp parallel
            {
                #pragma omp single
                {
                    #pragma omp task shared(p), untied
                    {
                        cout << omp_get_thread_num();
                        p[b] = parallelOMP(b);
                    }

                    #pragma omp taskwait
                }
            }

            beta = p[b];
        }


        if (k % 2 == 0)
            sum += -1 * (alpha + beta);
        else
            sum += alpha + beta;


    }

    if (sum > 0)
        return sum%m;
    else
        return (m + (sum % m)) % m;
}
like image 319
Ben McAlindin Avatar asked Dec 05 '22 06:12

Ben McAlindin


2 Answers

Sometimes I wish comments on SO could be as richly formatted as the answers, but alas that's not the case. Therefore, here comes a long comment disguised as an answer.

It appears that a very common mistake in writing recursive OpenMP code is not understanding how exactly parallel regions work. Consider the following code (uses explicit tasks, therefore support for OpenMP 3.0 or newer required):

void par_rec_func (int arg)
{
   if (arg <= 0) return;

   #pragma omp parallel num_threads(2)
   {
      #pragma omp task
      par_rec_func(arg-1);

      #pragma omp task
      par_rec_func(arg-1);
   }
}

// somewhere in the main function
par_rec_func(10);

There is a problem with this code. The problem is that, except for the top-level invocation of par_rec_func(), in all other invocations the parallel region will be created in the context of an enclosing outer parallel region. This is called nested parallelism and by default is disabled, which means that all parallel regions beneath the top-level one are going to be inactive, i.e. they will execute serially. Since tasks bind to the innermost parallel region, they will also get executed in serial. What will happen with this code is that it will spawn one additional thread (for a total of two) at the top-level invocation of par_rec_func() and each thread will then execute a whole branch of the recursion tree (i.e. one half of the whole tree). If one runs that code on a machine with 64 cores, 62 of them will idle. In order for the nested parallelism to be enabled, one has to either set the environment variable OMP_NESTED to true or call omp_set_nested() and pass it a true flag:

omp_set_nested(1);

Once nested parallelism has been enabled, one faces a new problem. Every time a nested parallel region is encountered, the encountering thread will either spawn an additional one (because of num_threads(2)) or acquire an idle thread from the runtime's thread pool. At every deeper level of recursion, this program will require twice as many threads as at the previous level. Though an upper limit of the total number of threads could be set via OMP_THREAD_LIMIT (another OpenMP 3.0 feature) and with the overhead aside, this is not what one really wants in such cases.

The correct solution in that case is to use orphaned tasks in the dynamic scope of a single parallel region:

void par_rec_func (int arg)
{
   if (arg <= 0) return;

   #pragma omp task
   par_rec_func(arg-1);

   #pragma omp task
   par_rec_func(arg-1);

   // Wait for the child tasks to complete if necessary
   #pragma omp taskwait
}

// somewhere in the main function
#pragma omp parallel
{
   #pragma omp single
   par_rec_func(10);
}

The advantages of this method are many. First of all, only a single parallel region is created with as many threads as specified (e.g. by setting OMP_NUM_THREADS or by any other means). When the child tasks call recursively into par_rec_func(), that simply adds new tasks to the parallel region without spawning new threads. This greatly helps in the case where the recursion tree is not balanced, since many quality OpenMP runtimes implement task stealing, e.g. thread i could execute child tasks of a task that executes in thread j, where i != j.

Given an OpenMP 2.0 compiler like VC++, one cannot do much except to approximate the above idea by using nested parallelism and explicitly disabling it at a certain level:

void par_rec_func (int arg)
{
   if (arg <= 0) return;

   int level = omp_get_level();

   #pragma omp parallel sections num_threads(2) if(level < 4)
   {
      #pragma omp section
      par_rec_func(arg-1);

      #pragma omp section
      par_rec_func(arg-1);
   }
}

// somewhere in the main function
int saved_nested = omp_get_nested();
omp_set_nested(1);

par_rec_func(10);

omp_set_nested(saved_nested);

omp_get_level() is used to determine the level of nesting and the if clause is used to selectively deactivate parallel regions at fourth or deeper level of nesting. This solution is dumb and won't work well when the recursion tree is unbalanced.

like image 184
Hristo Iliev Avatar answered Dec 09 '22 15:12

Hristo Iliev


Actual Problem:

You are using Visual Studio 2013.

Visual Studio has never supported OMP versions beyond 2.0 (see here).

OMP Tasks are a feature of OMP 3.0 (see spec).

Ergo, using VS at all means no OMP tasks for you.

If OMP Tasks are an essential requirement, use a different compiler. If OMP is not an essential requirement, you should consider an alternative parallel task handling library. Visual Studio includes the MS Concurrency Runtime, and the Parallel Patterns Library built on top of it. I have recently moved from OMP to PPL due to the fact I'm using VS for work; it isn't quite a drop-in replacement but it is quite capable.


My second attempt at solving this, again preserved for historical reasons:

So, the problem is almost certainly that you're defining your omp tasks outside of a omp parallel region.

Here's a contrived example:

void work()
{
    #pragma omp parallel
    {
        #pragma omp single nowait
        for (int i = 0; i < 5; i++)
        {
            #pragma omp task untied
            {
                std::cout << 
                    "starting task " << i << 
                    " on thread " << omp_get_thread_num() << "\n";

                sleep(1);
            }
        }
    }
}

If you omit the parallel declaration, the job runs serially:

starting task 0 on thread 0
starting task 1 on thread 0
starting task 2 on thread 0
starting task 3 on thread 0
starting task 4 on thread 0

But if you leave it in:

starting task starting task 3 on thread 1
starting task 0 on thread 3
2 on thread 0
starting task 1 on thread 2
starting task 4 on thread 2

Success, complete with authentic misuse of shared output resources.

(for reference, if you omit the single declaration, each thread will run the loop, resulting in 20 tasks being run on my 4 cpu VM).


Original answer included below for completeness, but no longer relevant!

In every case, your omp task is a single, simple thing. It probably runs and completes immediately:

#pragma omp task shared(p), untied
cout << omp_get_thread_num();

#pragma omp task shared(p), untied
cout << omp_get_thread_num();

#pragma omp task shared(p), untied
cout << omp_get_thread_num();

#pragma omp task shared(p), untied
cout << omp_get_thread_num();

Because you never start one long-running task before firing off the next task, everything will probably run on the first allocated thread.

Perhaps you meant to do something like this?

if (a > 0 && b > 0 && p[a] == -1 && p[b] == -1)
{
    #pragma omp task shared(p), untied
    {
        cout << omp_get_thread_num();
        p[a] = parallelOMP(a);
    }

    #pragma omp task shared(p), untied
    {
        cout << omp_get_thread_num();
        p[b] = parallelOMP(b);
    }

    #pragma omp taskwait

    alpha = p[a];
    beta = p[b];
}
like image 34
Rook Avatar answered Dec 09 '22 14:12

Rook