Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

std::memory_order_relaxed atomicity with respect to the same atomic variable

The cppreference documentation about memory orders says

Typical use for relaxed memory ordering is incrementing counters, such as the reference counters of std::shared_ptr, since this only requires atomicity, but not ordering or synchronization (note that decrementing the shared_ptr counters requires acquire-release synchronization with the destructor)

Does this mean that relaxed memory ordering don't actually result in atomicity with respect to the same variable? But rather just results in eventual consistency with respect to other relaxed memory loads and/or compare_exchanges? Using std::memory_order_seq_cst would be the only way to see consistent results when paired with std::memory_order_relaxed?

I was under the assumption that std::memory_order_relaxed is still atomic with respect to the same variable but does not provide any other constraints about loads and stores with respect to other data.

like image 587
Curious Avatar asked Jan 06 '18 03:01

Curious


1 Answers

You are asking a few things, but I'll focus on the ordering constraints used by a typical shared_ptr implementation because I think that covers the key part of your question.

An atomic operation is always atomic with respect to the variable (or POD) it applies to; modifications to a single variable will become visible to all threads in a consistent order.
The way relaxed atomic operations work is described in your question:

std::memory_order_relaxed is still atomic with respect to the same variable but does not provide any other constraints about loads and stores with respect to other data

The following are 2 typical scenario's whereby ordering constraints on an atomic operation can be omitted (i.e. by using std::memory_order_relaxed):

  1. Memory ordering is not necessary because there are no dependencies on other operations, or as a commenter puts it, (..) is not part of an invariant involving other memory locations.

    A common example is an atomic counter, incremented by multiple threads to keep track of the number of times a particular event has occurred. The increment operation (fetch_add) can be relaxed if the counter represents a value that has no dependency on other operations.
    I find the example given by cppreference not very convincing because the shared_ptr reference count does have a dependency; i.e. memory is deleted once its value becomes zero. A better example is a web server keeping track of the number of incoming requests only for reporting purposes.

  2. Memory ordering is necessary, but there is no need to use ordering constraints because the required synchronization has already passed (IMO this explains better why shared_ptr's reference count increment can be relaxed, see example below).
    The shared_ptr copy/move constructor can only be called while it has a synchronized view of the (reference to the) copied/moved-from instance (or it would be undefined behavior) and as such, no additional ordering is necessary.

The following example illustrates how memory ordering is typically used by a shared_ptr implementation to modify its reference count. Assume that all threads run in parallel after sp_main has been released (the shared_ptr reference count is then 10).

int main()
{
    std::vector<std::thread> v;
    auto sp_main = std::make_shared<int>(0);

    for (int i = 1; i <= 10; ++i)
    {
        // sp_main is passed by value
        v.push_back(thread{thread_func, sp_main, i});
    }

    sp_main.reset();

    for (auto &t : v)  t.join();
}

void thread_func(std::shared_ptr<int> sp, int n)
{
    // 10 threads are created

    if (n == 7)
    {
        // Only thread #7 modifies the integer
        *sp = 42;
    }

    // The only thead with a synchronized view of the managed integer is #7
    // All other threads cannot read/write access the integer without causing a race

    // 'sp' going out of scope -> destructor called
}

Thread creation guarantees an (inter-thread) happens-before relationship between make_shared (in main) and sp's copy/move-constructor (inside each thread). Therefore, shared_ptr's constructor has a synchronized view of memory and can safely increment ref_count with no additional ordering:

ctrlblk->ref_count.fetch_add(1, std::memory_order_relaxed);

For the destruction part, since only thread #7 writes to the shared integer, the others 9 threads are not allowed to access the same memory location without causing a race. This creates a problem because all threads are destructed at about the same time (assume reset in main has been called earlier) and only one thread is going to delete the shared integer (the one decrementing ref_count from 1 to 0).
It is imperative that the last thread has a synchronized memory view before it deletes the integer, but since 9 out of 10 threads do not have a synchronized view, additional ordering is necessary.

The destructor may contain something like:

if (ctrlblk->ref_count.fetch_sub(1, std::memory_order_acq_rel) == 1)
{
    // delete managed memory
}

The atomic ref_count has a single modification order and therefore all atomic modifications occur in some order. Let's say the threads (in this example) that perform the last 3 decrements on ref_count are thread #7 (3 → 2), #5 (2 → 1) and #3 (1 → 0). Both decrements performed by threads #7 and #5 come earlier in the modification order than the one performed by #3.
The release sequence becomes:

#7 (store release) → #5 (read-modify-write, no ordering required) → #3 (load acquire)

The end result is that the release operation performed by thread #7 has synchronized with the acquire operation performed by #3 and the integer modification (by #7) is guaranteed to have happened before the integer destruction (by #3).

Technically, only the threads that have accessed the managed memory location have to perform a release operation, but since a library implementer does not know about thread actions, all threads perform a release operation upon destruction.

For the final destruction of shared memory, technically only the last thread needs to perform an acquire operation and therefore a shared_ptr library implementer could optimize by setting a standalone fence that is only called by the last thread.

if (ctrlblk->ref_count.fetch_sub(1, std::memory_order_release) == 1)
{
    std::atomic_thread_fence(std::memory_order_acquire);

    // delete managed memory
}
like image 197
LWimsey Avatar answered Nov 14 '22 21:11

LWimsey