Even for a simple 2-thread communication example, I have difficulty to express this in the C11 atomic and memory_fence style to obtain proper memory ordering:
shared data:
volatile int flag, bucket;
producer thread:
while (true) {
int value = producer_work();
while (atomic_load_explicit(&flag, memory_order_acquire))
; // busy wait
bucket = value;
atomic_store_explicit(&flag, 1, memory_order_release);
}
consumer thread:
while (true) {
while (!atomic_load_explicit(&flag, memory_order_acquire))
; // busy wait
int data = bucket;
atomic_thread_fence(/* memory_order ??? */);
atomic_store_explicit(&flag, 0, memory_order_release);
consumer_work(data);
}
As far as I understand, above code would properly order the store-in-bucket -> flag-store -> flag-load -> load-from-bucket. However, I think that there remains a race condition between load-from-bucket and re-write the bucket again with new data. To force an order following the bucket-read, I guess I would need an explicit atomic_thread_fence()
between the bucket read and the following atomic_store. Unfortunately, there seems to be no memory_order
argument to enforce anything on preceding loads, not even the memory_order_seq_cst
.
A really dirty solution could be to re-assign bucket
in the consumer thread with a dummy value: that contradicts the consumer read-only notion.
In the older C99/GCC world I could use the traditional __sync_synchronize()
which I believe would be strong enough.
What would be the nicer C11-style solution to synchronize this so-called anti-dependency?
(Of course I am aware that I should better avoid such low-level coding and use available higher-level constructs, but I would like to understand...)
To force an order following the bucket-read, I guess I would need an explicit atomic_thread_fence() between the bucket read and the following atomic_store.
I do not believe the atomic_thread_fence()
call is necessary: the flag update has release semantics, preventing any preceding load or store operations from being reordered across it. See the formal definition by Herb Sutter:
A write-release executes after all reads and writes by the same thread that precede it in program order.
This should prevent the read of bucket
from being reordered to occur after the flag
update, regardless of where the compiler chooses to store data
.
That brings me to your comment about another answer:
The
volatile
ensures that there are ld/st operations generated, which can subsequently be ordered with fences. However, data is a local variable, not volatile. The compiler will probably put it in register, avoiding a store operation. That leaves the load from bucket to be ordered with the subsequent reset of flag.
It would seem that is not an issue if the bucket
read cannot be reordered past the flag
write-release, so volatile
should not be necessary (though it probably doesn't hurt to have it, either). It's also unnecessary because most function calls (in this case, atomic_store_explicit(&flag)
) serve as compile-time memory barriers. The compiler would not reorder the read of a global variable past a non-inlined function call because that function could modify the same variable.
I would also agree with @MaximYegorushkin that you could improve your busy-waiting with pause
instructions when targeting compatible architectures. GCC and ICC both appear to have _mm_pause(void)
intrinsics (probably equivalent to __asm__ ("pause;")
).
I agree with what @MikeStrobel says in his comment.
You don't need atomic_thread_fence()
here because your critical sections start with acquire and end with release semantics. Hence, reads within your critical sections can not be reordered prior to the acquire and writes post the release. And this is why volatile
is unnecessary here as well.
In addition, I don't see a reason why (pthread) spinlock is not used here instead. spinlock does a similar busy spin for you but it also uses pause
instruction:
The pause intrinsic is used in spin-wait loops with the processors implementing dynamic execution (especially out-of-order execution). In the spin-wait loop, the pause intrinsic improves the speed at which the code detects the release of the lock and provides especially significant performance gain. The execution of the next instruction is delayed for an implementation-specific amount of time. The PAUSE instruction does not modify the architectural state. For dynamic scheduling, the PAUSE instruction reduces the penalty of exiting from the spin-loop.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With