Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is this approach of barriers right?

I have found that pthread_barrier_wait is quite slow, so at one place in my code I replaced pthread_barrier_wait with my version of barrier (my_barrier), which uses an atomic variable. I found it to be much faster than pthread_barrier_wait. Is there any flaw of using this approach? Is it correct? Also, I don't know why it is faster than pthread_barrier_wait? Any clue?

EDIT

  • I am primarily interested in cases where there are equal number of threads as cores.

    atomic<int> thread_count = 0;
    
    void my_barrier()
    {     
      thread_count++;
    
      while( thread_count % NUM_OF_THREADS )
       sched_yield();
    }
    
like image 476
MetallicPriest Avatar asked Jan 26 '12 14:01

MetallicPriest


People also ask

What approaches are barriers to effective communication?

Lack of attention, interest, distractions, or irrelevance to the receiver. (See our page Barriers to Effective Listening for more information). Differences in perception and viewpoint. Physical disabilities such as hearing problems or speech difficulties.

What are examples of barriers?

Physical and physiological barriers These include distance, background noise, poor or malfunctioning equipment, bad hearing, poor eyesight, speech impediments.


3 Answers

Your barrier implementation does not work, at least not if the barrier will be used more than once. Consider this case:

  1. NUM_OF_THREADS-1 threads are waiting at the barrier, spinning.
  2. Last thread arrives and passes through the barrier.
  3. Last thread exits barrier, continues processing, finishes its next task, and reenters the barrier wait.
  4. Only now do the other waiting threads get scheduled, and they can't exit the barrier because the counter was incremented again. Deadlock.

In addition, one often-overlooked but nasty issue to deal with using dynamically allocated barriers is destroying/freeing them. You'd like any one of the threads to be able to perform the destroy/free after the barrier wait returns as long as you know nobody will be trying to wait on it again, but this requires making sure all waiters have finished touching memory in the barrier object before any waiters wake up - not an easy problem to solve. See my past questions on implementing barriers...

How can barriers be destroyable as soon as pthread_barrier_wait returns?

Can a correct fail-safe process-shared barrier be implemented on Linux?

And unless you know you have a special-case where none of the difficult problems apply, don't try implementing your own for an application.

like image 184
R.. GitHub STOP HELPING ICE Avatar answered Sep 29 '22 21:09

R.. GitHub STOP HELPING ICE


AFAICT it's correct, and it looks like it's faster, but in the high contended case it'll be a lot worse. The hight contended case being when you have lots of threads, way more than CPUs.

There's a way to make fast barriers though, using eventcounts (look at it through google).

struct barrier {
    atomic<int>       count;
    struct eventcount ec;
};

void my_barrier_wait(struct barrier *b)
{
    eventcount_key_t key;

    if (--b->count == 0) {
        eventcount_broadcast(&b->ec);
        return;
    }
    for (;;) {
        key = eventcount_get(&b->ec);
        if (!b->count)
            return;
        eventcount_wait(&b->ec);
    }
}

This should scale way better.

Though frankly, when you use barriers, I don't think performance matters much, it's not supposed to be an operation that needs to be fast, it looks a lot like too early optimization.

like image 32
Pierre Habouzit Avatar answered Sep 29 '22 21:09

Pierre Habouzit


Your barrier should be correct from what I can see, as long as you don't use the barrier to often or your thread number is a power of two. Theoretically your atomic will overflow somewhere (after hundreds of millions of uses for typical core counts, but still), so you might want to add some functionality to reset that somewhere.

Now to why it is faster: I'm not entirely sure, but I think pthread_barrier_wait will let the thread sleep till it is time to wake up. Yours is spinning on the condition, yielding in each iteration. However if there is no other application/thread which needs the processing time the thread will likely be scheduled again directly after the yield, so the wait time is shorter. At least thats what playing around with that kind of barriers seemed to indicate on my system.

As a side note: since you use atomic<int> I assume you use C++11. Wouldn't it make sense to use std::this_thread::yield() instead of sched_yield() in that case to remove the dependency on pthreads?

This link might also be intressting for you, it measures the performance of various barrier implementations (yours is rougly the lock xadd+while(i<NCPU) case, except for the yielding)

like image 42
Grizzly Avatar answered Sep 29 '22 21:09

Grizzly