Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance of concurrent code using dispatch_group_async is MUCH slower than single-threaded version

I've been doing some experimentation lately on using large numbers of random numbers to generate "normal distribution" bell curves.

The approach is simple:

  • Create an array of integers and zero it out. (I'm using 2001 integers)
  • Repeatedly calculate indexes in this array and index that entry in the array, as follows
    • Loop either 999 or 1000 times. On each iteration:
      • Seed an array index with the center value (1000)
      • Generate a random number = +1/-1. and add it to the array index
      • at the end of the loop, increment the value at the calculated array index.

Since random values 0/1 tend to occur about as frequently, the end index value from the inner loop above tend to stay close to the center value. Index values much larger/smaller than the starting value are increasingly unusual.

After a large number of repetitions, the values in the array take on the shape of a normal distribution bell curve. However, the high-quality random function arc4random_uniform() that I'm using is fairly slow, and it takes a LOT of iterations to generate a smooth curve.

I wanted to plot 1,000,000,000 (one billion) points. Running on the main thread, that takes about 16 hours.

I decided to rewrite the calculation code to use dispatch_async, and run it on my 8-core Mac Pro.

I ended up using dispatch_group_async() to submit 8 blocks, with a dispatch_group_notify() to notify the program when all the blocks have finished processing.

For simplicity on the first pass, all 8 blocks write to the same array of NSUInteger values. There is a small chance of a race condition on a read/modify write to one of the array entries, but in that case, that would simply cause one value to be lost. I was planning on adding a lock to the array increment later (or perhaps even creating separate arrays in each block, and then summing them after.)

Anyway, I refactored the code to use dispatch_group_async() and calculate 1/8 of the total values in each block, and set my code off to run. To my utter befuddlement, the concurrent code, while it maxes out all of the cores on my Mac, runs MUCH slower than the single-threaded code.

When run on a single thread, I get about 17,800 points plotted per second. When run using dispatch_group_async, the performance drops to more like 665 points/second, or about 1/26 as fast. I've varied the number of blocks I submit - 2, 4, or 8, it doesn't matter. Performance is awful. I've also tried simply submitting all 8 blocks using dispatch_async with no dispatch_group. That doesn't matter either.

There's currently no blocking/locking going on: All the blocks run at full speed. I am utterly mystified as to why the concurrent code runs slower.

The code is a little muddled now because I refactored it to work either single-threaded or concurrently so I could test.

Here's the code that runs the calculations:

randCount = 2;
#define K_USE_ASYNC 1

#if K_USE_ASYNC
    dispatch_queue_t highQ = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_HIGH, 0);
    //dispatch_group_async

    dispatch_group_t aGroup = dispatch_group_create();
    int totalJobs = 8;
    for (int i = 0; i<totalJobs; i++)
    {
      dispatch_group_async(aGroup,
                           highQ,
                           ^{
                             [self calculateNArrayPoints: KNumberOfEntries /totalJobs
                                          usingRandCount: randCount];
                           });
    }


    dispatch_group_notify(aGroup,
                          dispatch_get_main_queue(),
                          allTasksDoneBlock
                          );
#else
    [self calculateNArrayPoints: KNumberOfEntries
                 usingRandCount: randCount];
    allTasksDoneBlock();
#endif

And the common calculation method, which is used by both the single-threaded and the concurrent version:

+ (void) calculateNArrayPoints: (NSInteger) pointCount usingRandCount: (int) randCount;
{
  int entry;
  int random_index;

  for (entry =0; entry<pointCount; entry++)
  {
    static int processed = 0;
    if (entry != 0 && entry%100000 == 0)
    {
      [self addTotTotalProcessed: processed];
      processed = 0;
    }

    //Start with a value of 1000 (center value)
    int value = 0;

    //For each entry, add +/- 1 to the value 1000 times.
    int limit  = KPinCount;
    if (randCount==2)
      if (arc4random_uniform(2) !=0)
        limit--;
    for (random_index = 0; random_index<limit; random_index++)
    {
      int random_value = arc4random_uniform(randCount);
      /*
       if 0, value--
       if 1, no change
       if 2, value++
       */
      if (random_value == 0)
        value--;
      else if (random_value == randCount-1)
        value++;
    }
    value += 1000;
    _bellCurveData[value] += 1;
    //printf("\n\nfinal value = %d\n", value);
    processed++;
  }
}

This is a quick-and-dirty learning project. It runs on both Mac and iOS, so it uses a shared utilities class. The utilities class is nothing but class methods. There is no instance of the utilities method created. It has an embarrassing number of global variables. If I end up doing anything useful with the code, I'll refactor it to create a utilities singleton, and convert all the globals to instance variables on the singleton.

For now, it works, and the hideous use of globals doesn't affect the outcome, so I'm leaving it.

The code that uses the "processed" variable is just a way of figuring out how many points have been calculated when run in concurrent mode. I added that code after I discovered the horrid performance of the concurrent version, so I'm confident it isn't a cause of the slowdown.

I'm stumped here. I've written a fair amount of concurrent code, and this task is an "embarrassingly parallel" problem, so there's no reason it shouldn't run at full tilt on all available cores.

Does anybody else see anything that might cause this, or have any other insights to offer?

like image 571
Duncan C Avatar asked Jan 11 '23 14:01

Duncan C


1 Answers

arc4random uses a critical section while modifying its state. The critical section is super-fast in the non-contended case (when changing from unlocked to locked), but in the contended case (when trying to lock a mutex that's already locked) it has to call the operating system and put the thread to sleep, which decreases performance much.

u_int32_t
arc4random()
{
    u_int32_t rnd;

    THREAD_LOCK();
    arc4_check_init();
    arc4_check_stir();
    rnd = arc4_getword(&rs);
    THREAD_UNLOCK();

    return (rnd);
}

where THREAD_LOCK() is defined as

#define THREAD_LOCK()                       \
    do {                            \
        if (__isthreaded)               \
            _pthread_mutex_lock(&arc4random_mtx);   \
    } while (0)

Source: Arc4 random number generator for OpenBSD

to make it faster

you could create a Arc4Random class that is a wrapper around the static arc4_* functions from arc4random.c . Then you have a random-number generator that is no longer threadsafe, but you could create one generator for each thread.

like image 145
Michael Avatar answered Jan 18 '23 23:01

Michael