I have found that pthread_barrier_wait is quite slow, so at one place in my code I replaced pthread_barrier_wait with my version of barrier (my_barrier), which uses an atomic variable. I found it to be much faster than pthread_barrier_wait. Is there any flaw of using this approach? Is it correct? Also, I don't know why it is faster than pthread_barrier_wait? Any clue?
I am primarily interested in cases where there are equal number of threads as cores.
atomic<int> thread_count = 0;
void my_barrier()
{
thread_count++;
while( thread_count % NUM_OF_THREADS )
sched_yield();
}
Lack of attention, interest, distractions, or irrelevance to the receiver. (See our page Barriers to Effective Listening for more information). Differences in perception and viewpoint. Physical disabilities such as hearing problems or speech difficulties.
Physical and physiological barriers These include distance, background noise, poor or malfunctioning equipment, bad hearing, poor eyesight, speech impediments.
Your barrier implementation does not work, at least not if the barrier will be used more than once. Consider this case:
NUM_OF_THREADS-1
threads are waiting at the barrier, spinning.In addition, one often-overlooked but nasty issue to deal with using dynamically allocated barriers is destroying/freeing them. You'd like any one of the threads to be able to perform the destroy/free after the barrier wait returns as long as you know nobody will be trying to wait on it again, but this requires making sure all waiters have finished touching memory in the barrier object before any waiters wake up - not an easy problem to solve. See my past questions on implementing barriers...
How can barriers be destroyable as soon as pthread_barrier_wait returns?
Can a correct fail-safe process-shared barrier be implemented on Linux?
And unless you know you have a special-case where none of the difficult problems apply, don't try implementing your own for an application.
AFAICT it's correct, and it looks like it's faster, but in the high contended case it'll be a lot worse. The hight contended case being when you have lots of threads, way more than CPUs.
There's a way to make fast barriers though, using eventcounts (look at it through google).
struct barrier {
atomic<int> count;
struct eventcount ec;
};
void my_barrier_wait(struct barrier *b)
{
eventcount_key_t key;
if (--b->count == 0) {
eventcount_broadcast(&b->ec);
return;
}
for (;;) {
key = eventcount_get(&b->ec);
if (!b->count)
return;
eventcount_wait(&b->ec);
}
}
This should scale way better.
Though frankly, when you use barriers, I don't think performance matters much, it's not supposed to be an operation that needs to be fast, it looks a lot like too early optimization.
Your barrier should be correct from what I can see, as long as you don't use the barrier to often or your thread number is a power of two. Theoretically your atomic will overflow somewhere (after hundreds of millions of uses for typical core counts, but still), so you might want to add some functionality to reset that somewhere.
Now to why it is faster: I'm not entirely sure, but I think pthread_barrier_wait
will let the thread sleep till it is time to wake up. Yours is spinning on the condition, yielding in each iteration. However if there is no other application/thread which needs the processing time the thread will likely be scheduled again directly after the yield
, so the wait time is shorter. At least thats what playing around with that kind of barriers seemed to indicate on my system.
As a side note: since you use atomic<int>
I assume you use C++11. Wouldn't it make sense to use std::this_thread::yield()
instead of sched_yield()
in that case to remove the dependency on pthreads?
This link might also be intressting for you, it measures the performance of various barrier implementations (yours is rougly the lock xadd+while(i<NCPU)
case, except for the yielding)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With