Is this approach of barriers right?

Tags:

I have found that pthread_barrier_wait is quite slow, so at one place in my code I replaced pthread_barrier_wait with my version of barrier (my_barrier), which uses an atomic variable. I found it to be much faster than pthread_barrier_wait. Is there any flaw of using this approach? Is it correct? Also, I don't know why it is faster than pthread_barrier_wait? Any clue?

EDIT

I am primarily interested in cases where there are equal number of threads as cores.

atomic<int> thread_count = 0;

void my_barrier()
{     
  thread_count++;

  while( thread_count % NUM_OF_THREADS )
   sched_yield();
}

476

asked Jan 26 '12 14:01

MetallicPriest

3 Answers

Your barrier implementation does not work, at least not if the barrier will be used more than once. Consider this case:

NUM_OF_THREADS-1 threads are waiting at the barrier, spinning.
Last thread arrives and passes through the barrier.
Last thread exits barrier, continues processing, finishes its next task, and reenters the barrier wait.
Only now do the other waiting threads get scheduled, and they can't exit the barrier because the counter was incremented again. Deadlock.

In addition, one often-overlooked but nasty issue to deal with using dynamically allocated barriers is destroying/freeing them. You'd like any one of the threads to be able to perform the destroy/free after the barrier wait returns as long as you know nobody will be trying to wait on it again, but this requires making sure all waiters have finished touching memory in the barrier object before any waiters wake up - not an easy problem to solve. See my past questions on implementing barriers...

How can barriers be destroyable as soon as pthread_barrier_wait returns?

Can a correct fail-safe process-shared barrier be implemented on Linux?

And unless you know you have a special-case where none of the difficult problems apply, don't try implementing your own for an application.

184

answered Sep 29 '22 21:09

R.. GitHub STOP HELPING ICE

AFAICT it's correct, and it looks like it's faster, but in the high contended case it'll be a lot worse. The hight contended case being when you have lots of threads, way more than CPUs.

There's a way to make fast barriers though, using eventcounts (look at it through google).

struct barrier {
    atomic<int>       count;
    struct eventcount ec;
};

void my_barrier_wait(struct barrier *b)
{
    eventcount_key_t key;

    if (--b->count == 0) {
        eventcount_broadcast(&b->ec);
        return;
    }
    for (;;) {
        key = eventcount_get(&b->ec);
        if (!b->count)
            return;
        eventcount_wait(&b->ec);
    }
}

This should scale way better.

Though frankly, when you use barriers, I don't think performance matters much, it's not supposed to be an operation that needs to be fast, it looks a lot like too early optimization.

answered Sep 29 '22 21:09

Pierre Habouzit

Your barrier should be correct from what I can see, as long as you don't use the barrier to often or your thread number is a power of two. Theoretically your atomic will overflow somewhere (after hundreds of millions of uses for typical core counts, but still), so you might want to add some functionality to reset that somewhere.

Now to why it is faster: I'm not entirely sure, but I think pthread_barrier_wait will let the thread sleep till it is time to wake up. Yours is spinning on the condition, yielding in each iteration. However if there is no other application/thread which needs the processing time the thread will likely be scheduled again directly after the yield, so the wait time is shorter. At least thats what playing around with that kind of barriers seemed to indicate on my system.

As a side note: since you use atomic<int> I assume you use C++11. Wouldn't it make sense to use std::this_thread::yield() instead of sched_yield() in that case to remove the dependency on pthreads?

This link might also be intressting for you, it measures the performance of various barrier implementations (yours is rougly the lock xadd+while(i<NCPU) case, except for the yielding)

answered Sep 29 '22 21:09

Grizzly

Related questions
                            
                                How is SndVol able to change the volume level of a given audio session?
                            
                                Using pHash from .NET
                            
                                Can you begin programming OpenCL without downloading an SDK?
                            
                                On what architectures is calculating invalid pointers unsafe?
                            
                                how do C++ professional programmers implement common abstractions?
                            
                                When do constructors of static members of template classes get called in C++?
                            
                                How to inherit from a list of types and then call a member on the list of inherited members?
                            
                                what are Parametric and Inclusion polymorphism in C++
                            
                                Placement new breaks consts and references?
                            
                                C++, ignore exception and continue code?
                            
                                Is it possible to test "internal" class from a c++ dll using MSTest?
                            
                                Find rank of a number on basis of number of 1's
                            
                                Using fopen() in Objective-C
                            
                                Virtual memory and alignment - how do they factor together?
                            
                                application exits (no Exception) when referencing 64bit dll from C#
                            
                                Creating static library and linking to it with premake
                            
                                Is putting std::move inside a lambda really necessary here?
                            
                                How to draw cylinder in y or x axis in opengl
                            
                                The array is static, but the array size isn't know until runtime. How is this possible?
                            
                                GCC compiler error when extracting a char from a temporary stream

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is this approach of barriers right?

Tags:

c++

c

multithreading

c++11

EDIT

MetallicPriest

People also ask

3 Answers

R.. GitHub STOP HELPING ICE

Pierre Habouzit

Grizzly

Recent Activity

Donate For Us