Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Partially parallel loops using openmp tasks

Prerequisites:

  • parallel engine: OpenMP 3.1+ (can be OpenMP 4.0 if needed)
  • parallel constructs: OpenMP tasks
  • compiler: gcc 4.9.x (supports OpenMP 4.0)

Input:

  • C code with loops
  • loop have cross-iteration data dependency(ies): “i+1“ iteration needs data from “i” iteration (only such kind of dependency, nothing else)
  • loop body can be partially dependent
  • loop cannot be split in two loops; loop body should remain solid
  • anything reasonable can be added to loop or loop body function definition

Code sample:

(Here conf/config/configData variables are used for illustration purposes only, the main interest is within value/valueData variables.)

void loopFunc(const char* config, int* value)
{
    int conf;
    conf = prepare(config);         // independent, does not change “config”
    *value = process(conf, *value); // dependent, takes prev., produce next
    return;
}

int main()
{
    int N = 100;
    char* configData;           // never changes
    int valueData = 0;          // initial value
    …
    for (int i = 0; i < N; i++)
    {
        loopFunc(configData, &valueData);
    }
    …
}

Need to:

  • parallelise loop using omp tasks (omp for / omp sections cannot be used)
  • “prepare” functions should be executed in parallel with other “prepare” or “process” functions
  • “process” functions should be ordered according to data dependency

What have been proposed and implemented:

  • define integer flag
  • assign to it a number of first iteration
  • every iteration when it needs data waits for flag to be equal to it’s iteration
  • update flag value when data for next iteration is ready

Like this:

(I reminds that conf/config/configData variables are used for illustration purposes only, the main interest is within value/valueData variables.)

void loopFunc(const char* config, int* value, volatile int *parSync, int iteration)
{
    int conf;
    conf = prepare(config);         // independent, do not change “config”
    while (*parSync != iteration)   // wait for previous to be ready
    {
        #pragma omp taskyield
    }
    *value = process(conf, *value); // dependent, takes prev., produce next
    *parSync = iteration + 1;       // inform next about readiness
    return;
}

int main()
{
    int N = 100;
    char* configData;           // never changes
    int valueData = 0;          // initial value
    volatile int parallelSync = 0;
    …
    omp_set_num_threads(5);
    #pragma omp parallel
    #pragma omp single
    for (int i = 0; i < N; i++)
    {
        #pragma omp task shared(configData, valueData, parallelSync) firstprivate(i)
            loopFunc(configData, &valueData, &parallelSync, i);
    }
    #pragma omp taskwait
    …
}

What happened:

It fails. :)

The reason was that openmp task occupies openmp thread. For example, if we define 5 openmp threads (as in the code above).

  • “For” loop generates 100 tasks.
  • OpenMP runtime assign 5 arbitrary tasks to 5 threads and starts these tasks.

If there will be no task with i=0 among started tasks (it happens time to time), executing tasks wait forever, occupy threads forever and the task with i=0 never being started.

What's next?

I have no other ideas how to implement the required mode of computation.

Current solution

Thanks for the idea to @parallelgeek below

int main()
{
    int N = 10;
    char* configData;           // never changes
    int valueData = 0;          // initial value
    volatile int parallelSync = 0;
    int workers;
    volatile int workingTasks = 0;
    ...
    omp_set_num_threads(5);
    #pragma omp parallel
    #pragma omp single
    {
        workers = omp_get_num_threads()-1;  // reserve 1 thread for task generation

        for (int i = 0; i < N; i++)
        {
            while (workingTasks >= workers)
            {
                #pragma omp taskyield
            }

            #pragma omp atomic update
                workingTasks++;

            #pragma omp task shared(configData, valueData, parallelSync, workingTasks) firstprivate(i)
            {
                loopFunc(configData, &valueData, &parallelSync, i);

                #pragma omp atomic update
                    workingTasks--;
            }
        }
        #pragma omp taskwait
    }
}
like image 954
Badiboy Avatar asked Sep 24 '15 20:09

Badiboy


1 Answers

  1. AFAIK volatiles don't prevent hardware reordering, that's why you could end up with a mess in memory, because data is not written yet, while flag is already seen by the consuming thread as true.
  2. That's why little piece of advise: use C11 atomics instead in order to ensure visibility of data. As I can see, gcc 4.9 supports c11 C11Status in GCC
  3. You could try to divide generated tasks to groups by K tasks, where K == ThreadNum and start generating subsequent task (after the tasks in the first group are generated) only after any of running tasks is finished. Thus you have an invariant that each time you have only K tasks running and scheduled on K threads.
  4. Intertask dependencies could also be met by using atomic flags from C11.
like image 61
parallelgeek Avatar answered Nov 04 '22 07:11

parallelgeek