I have been looking at a lock free single producer/single consumer circular buffer on this website when I couldn't figure out why a specific memory barrier was needed. I have carefully read a hundread time the standard rules about memory order but I don't understand what I'm missing.
With this implementation, there is only a unique thread which can call the push()
function and another unique thread which can call the pop()
function.
Here is the Producer
code:
bool push(const Element& item)
{
const auto current_tail = _tail.load(std::memory_order_relaxed); //(1)
const auto next_tail = increment(current_tail);
if(next_tail != _head.load(std::memory_order_acquire)) //(2)
{
_array[current_tail] = item; //(3)
_tail.store(next_tail, std::memory_order_release); //(4)
return true;
}
return false; // full queue
}
Here is the Consumer
code:
bool pop(Element& item)
{
const auto current_head = _head.load(std::memory_order_relaxed); //(1)
if(current_head == _tail.load(std::memory_order_acquire)) //(2)
return false; // empty queue
item = _array[current_head]; //(3)
_head.store(increment(current_head), std::memory_order_release); //(4)
return true;
}
I understand why the Producer (4)
and the Consumer (2)
statements are absolutly needed, this is because we have to make sure that all writes that happened before the (4) released store
by the Producer
will be visible side effects once the consumer
will see the stored value.
I also understand why the Consumer (4)
statement is needed, this is to make sure that the Consumer (3)
load will be performed before the Consumer (4)
store will be performed.
The question
Producer (2)
load needs to be performed with acquire semantic (instead of relaxed)? Is it to prevent Producer (3) or (4)
to be reodered before the condition (at compile time or at runtime)?we need prove that
_array[current_tail] = item; // push(3)
excecuted after conforming (current_head == current_tail
)
item = _array[current_head]; // pop(3)
is completed. we can overwrite cell, only after data from it already copied to item
the
_head.load(std::memory_order_acquire) // push(2)
synchronized with
_head.store(increment(current_head), std::memory_order_release); //pop(4)
via Release-Acquire ordering:
all memory writes ( pop(3) ) that happened-before the atomic store-release ( pop(4) ) on _head
become visible side-effects once the atomic load-acquire ( push(2) ) is completed on _head
.
so Producer code after push(2) is completed, guaranteed to see result of pop(3). this mean that data from _array[current_head]
is copied to item and result of this operation is visible for Producer code after push(2), so _array[current_head]
already free.
from another side from memory_order_acquire
load description - no reads or writes ( push(3) ) in the current thread can be reordered before this load. so push(3) will be execute already after push(2) load complete, but at this point pop(3) already completed
item = _array[current_head]; //pop(3)
_head.store(increment(current_head), std::memory_order_release); //pop(4)
-----
_head.load(std::memory_order_acquire); //push(2)
_array[current_tail] = item; //push(3)
The memory barriers prevent the CPU from reordering accesses to the Element
object, which are not using interlocks, across access to the queue structure (here implemented using indexes, but pointers are equally viable).
Using your numbering, it is essential that (3) is performed between (2) and (4), and a memory barrier provides that.
The exact case you ask about of (2)-vs-(3) in the Producer prevents speculatively overwriting valid data when the queue is full (the proposed site overlaps with valid data). Without the barrier, even though when the condition failed the original data would be restored from the perspective of the Producer thread, intermediate values might be briefly visible to the Consumer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With