Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Producer-Consumer using OpenMP-Tasks

I'm trying to implement a parallel algorithm using tasks within OpenMP. The parallel programming pattern is based on the producer-consumer idea, but since the consumer process is slower than the producer, I want to use a few producers and several consumers. The main idea is to create as many OS threads as producers and then each of these will create tasks to be done in parallel (by the consumers). Every producer will be associated with a proportional number of consumers (i.e numCheckers/numSeekers). I'm running the algorithm on an Intel Dual-chip server with 6 cores per chip. The thing is that when I use only one producer(seeker) and an increasing number of consumers(checkers) the performance decays very fast as the number of consumers grows (see table below), even though the correct number of cores are working at a 100%. On the other hand, if I increase the number of producers, the average time decreases or at least stays stable, even with a proportional number of consumers. It seems to me that all the improvement is made by the division of the input among the producers, and the tasks are only bugging. But again, I don't have any explanation for the behavior with one producers. Am I missing something in the OpenMP-Task logic? Am I doing something wrong?

-------------------------------------------------------------------------
|   producers   |   consumers   |   time        |
-------------------------------------------------------------------------
|       1       |       1       |   0.642935    |
|       1       |       2       |   3.004023    |
|       1       |       3       |   5.332524    |
|       1       |       4       |   7.222009    |
|       1       |       5       |   9.472093    |
|       1       |       6       |   10.372389   |
|       1       |       7       |   12.671839   |
|       1       |       8       |   14.631013   |
|       1       |       9       |   14.500603   |
|       1       |      10       |   18.034931   |
|       1       |      11       |   17.835978   |
-------------------------------------------------------------------------
|       2       |       2       |   0.357881    |
|       2       |       4       |   0.361383    |
|       2       |       6       |   0.362556    |
|       2       |       8       |   0.359722    |
|       2       |      10       |   0.358816    |
-------------------------------------------------------------------------

The main section of my code is as fallow:

int main( int argc, char** argv) {

  // ... process the input (read from file, etc...)

  const char *buffer_start[numSeekers];
  int buffer_len[numSeekers];

  //populate these arrays dividing the input
  //I need to do this because I need to overlap the buffers for
  //correctness, so I simple parallel-for it's not enough 

  //Here is where I create the producers
  int num = 0;
  #pragma omp parallel for num_threads(numSeekers) reduction(+:num)
  for (int i = 0; i < numSeekers; i++) {
      num += seek(buffer_start[i], buffer_len[i]);
  }

  return (int*)num;
}

int seek(const char* buffer, int n){

  int num = 0;

  //asign the same number of consumers for each producer 
  #pragma omp parallel num_threads(numCheckers/numSeekers) shared(num)
  {
    //only one time for every producer
    #pragma omp single
    {
      for(int pos = 0; pos < n; pos += STEP){
    if (condition(buffer[pos])){
      #pragma omp task shared(num)
      {
        //check() is a sequential function
        num += check(buffer[pos]);
      }
    }
      }
      #pragma omp taskwait
    }
  return num;
}
like image 830
ees_cu Avatar asked Oct 23 '12 16:10

ees_cu


1 Answers

The observed behaviour is due to the fact that you do not have nested parallel regions enabled. What happens is that in the first case you are actually experiencing the HUGE overhead of OpenMP tasks. This is most likely due to the fact that check() is not doing enough work compared to the overhead the OpenMP run-time introduces. Why it behaves like so with 1 and with 2 producers?

When running with just one producer, the outer parallel region is executing with one thread only. Such parallel regions are inactive according to the OpenMP API specification and they just execute the code inside serially (the only overhead being an additional function call and accessing shared variables via pointers). In this case, the inner parallel region, although being nested while nested parallelism is disabled, becomes active and spurs a lot of tasks. Tasks introduce relatively high overhead and this overhead increases with the number of threads. With 1 consumer the inner parallel region is also inactive and hence runs serially without tasks overhead.

When running with two producers, the outer parallel region is active and hence the inner parallel region is rendered inactive (remember - no nested parallelism is enabled) and as a consequence, no tasks are being created at all - seek() simply runs serially. There is no tasking overhead and the code runs almost twice as fast than the 1 producer / 1 consumer case. The run time is not dependent on the number of consumers because the inner parallel region is always inactive, no matter how many threads are specified.

How big is the overhead that tasking and concerted access to shared variables introduce? I've created a simple synthetic benchmark that executes the following code:

for (int i = 0; i < 10000000; i++) {
   ssum += sin(i*0.001);
}

Serially this executes for less than a second on a Westmere CPU with the default optimisation level of GCC 4.7.2. Then I've introduced tasks using simple atomic constructs to protect the access to the shared variable ssum:

#pragma omp parallel
{
   #pragma omp single
   for (int i = 0; i < 10000000; i++) {
      #pragma omp task
      {
         #pragma omp atomic
         ssum += sin(i*0.001);
      }
   }
}

(no need for taskwait here since there is a scheduling point at the implicit barrier at the end of the parallel region)

I've also created a more complex variant that performs reduction the same way that Massimiliano has proposed:

#define STRIDE 8

#pragma omp parallel
{
   #pragma omp single
   for (int i = 0; i < 10000000; i++) {
      #pragma omp task
      {
         const int idx = omp_get_thread_num();
         ssumt[idx*STRIDE] += sin(i*0.001);
      }
   }
   #pragma omp taskwait

   const int idx = omp_get_thread_num();
   #pragma omp atomic
   ssum += ssumt[idx*STRIDE];
}

Code was compiled with GCC 4.7.2 like:

g++ -fopenmp -o test.exe test.cc

Running it in batch mode (so no other processes could intervene) on a dual-socket Westmere system (12 cores in total) with a different number of threads and different threads placement on the sockets, one obtains the following run times for both codes:

Configuration   ATOMIC       Reduction    ATOMIC slowdown
2 + 0            2,79 ±0,15   2,74 ±0,19   1,8%
1 + 1            2,72 ±0,21   2,51 ±0,22   8,4% <-----
6 + 0           10,14 ±0,01  10,12 ±0,01   0,2%
3 + 3           22,60 ±0,24  22,69 ±0,33  -0,4%
6 + 6           37,85 ±0,67  38,90 ±0,89  -2,7%

(run times are given in seconds as measured by omp_get_wtime(), averaged over 10 runs /std. deviation also shown/; x + y in the Configuration column means x threads on the first socket and y threads on the second socket)

As you can see, the overhead from the tasks is enormous. It is much higher than the overhead from using atomic instead of applying reduction to thread-private accumulators. Besides, the assignment part of an atomic with += compiles to a locked compare-and-exchange instruction (LOCK CMPXCHG) - not much higher overhead than calling omp_get_thread_num() each time.

One should also note that the dual-socket Westmere system is NUMA since each CPU has its own memory and accesses to the memory of the other CPU go through the QPI link and are hence with increased latency (and possibly lower bandwidth). As the ssum variable is shared in the atomic case, threads that run on the second processor are essentially making remote requests. Still, the difference between both codes is negligible (except for the marked two-threads case - I have to investigate why) and the atomic code even starts to outperform the one with the reduction when the number of threads gets higher.

On multiscale NUMA systems the synchronisation in the atomic approach might become more of a burden, since it adds locking overhead to the already slower remote accesses. One such system is one of our BCS-coupled nodes. BCS (Bull Coherence Switch) is a proprietary solution from Bull that uses XQPI (eXternal QPI) to link together several Nehalem-EX boards into a single system, introducing three levels of NUMA in the way (local memory; remote memory on the same board; remote memory on a remote board). When running on one such system, consisting of 4 boards with 4 octocore Nehalem-EX CPUs each (128 cores in total), the atomic executable runs for 1036 s (!!), while the reduction approach runs for 1047 s, i.e. both still execute for about the same time (my previous statement that the atomic approach is 21,5% slower was due to OS services jitter during the measurement). Both numbers are from single runs and hence are not quite representative. Note that on this system the XQPI link introduces very high latency for inter-board QPI messages and thus locking is very expensive, but not that expensive. Part of the overhead can be taken away by using reduction, but it has to be implemented properly. First, local copies of the reduction variable should be also local to the NUMA node where the thread executes and second, one should find a way to not make calls to omp_get_thread_num(). These two could be achieved in many different ways but the easiest one is just to use threadprivate variables:

static double ssumt;
#pragma omp threadprivate(ssumt)

#pragma omp parallel
{
   ssumt = 0.0;

   #pragma omp single
   for (int i = 0; i < 10000000; i++) {
      #pragma omp task
      {
         ssumt += sin(i*0.001);
      }
   }
   #pragma omp taskwait

   #pragma omp atomic
   ssum += ssumt;
}

Access to ssumt needs no protection as two tasks seldom execute simultaneously in the same thread (have to further investigate if this conforms to the OpenMP specs). This version of the code executes for 972 s. Once again this is not that far from 1036 s and comes from a single measurement only (i.e. it could be simply a statistical fluctuation), but in theory, it should be faster.

Lessons to take home:

  • Read the OpenMP specification about nesting parallel regions. Nested parallelism is enabled either by setting the environment variable OMP_NESTED to true or by calling omp_set_nested(1);. If enabled, the level of active nesting can be controlled by the OMP_MAX_ACTIVE_LEVELS as pointed out by Massimiliano.
  • Look out for data races and try to prevent them using the most simple approach. Not every time using a more complex approach gives you better performance.
  • Special systems usually require special programming.
  • When in doubt use a thread-checker tool if available. Intel has one (commercial) and Solaris Studio (previously known as Sun Studio) from Oracle has one too (free; has Linux version despite the product name).
  • Mind the overhead! Try to split the job in big enough chunks so that the overhead from having millions of tasks created does not negate the obtained parallel gain.
like image 115
Hristo Iliev Avatar answered Oct 18 '22 05:10

Hristo Iliev