Given the following sample that intends to wait until another thread stores <code>42</code> in a shared variable <code>shared</code> without locks and without waiting for thread termination, why would <code>volatile T</code> or <code>std::atomic<T></code> be required or recommended to guarantee concurrency correctness? <pre class="prettyprint"><code>#include <atomic> #include <cassert> #include <cstdint> #include <thread> int main() { int64_t shared = 0; std::thread thread([&shared]() { shared = 42; }); while (shared != 42) { } assert(shared == 42); thread.join(); return 0; } </code></pre> With GCC 4.8.5 and default options, the sample works as expected.

The test seems to indicate that the sample is correct but it is not. Similar code could easily end up in production and might even run flawlessly for years. We can start off by compiling the sample with <code>-O3</code>. Now, the sample hangs indefinitely. (The default is <code>-O0</code>, no optimization / debug-consistency, which is somewhat similar to making every variable <code>volatile</code>, which is the reason the test didn't reveal the code as unsafe.) To get to the root cause, we have to inspect the generated assembly. First, the GCC 4.8.5 <code>-O0</code> based x86_64 assembly corresponding to the un-optimized working binary: <pre class="prettyprint"><code> // Thread B: // shared = 42; movq -8(%rbp), %rax movq (%rax), %rax movq $42, (%rax) // Thread A: // while (shared != 42) { // } .L11: movq -32(%rbp), %rax # Check shared every iteration cmpq $42, %rax jne .L11 </code></pre> Thread B executes a simple store of the value <code>42</code> in <code>shared</code>. Thread A reads <code>shared</code> for each loop iteration until the comparison indicates equality. Now, we compare that to the <code>-O3</code> outcome: <pre class="prettyprint"><code> // Thread B: // shared = 42; movq 8(%rdi), %rax movq $42, (%rax) // Thread A: // while (shared != 42) { // } cmpq $42, (%rsp) # check shared once je .L87 # and skip the infinite loop or not .L88: jmp .L88 # infinite loop .L87: </code></pre> Optimizations associated with <code>-O3</code> replaced the loop with a single comparison and, if not equal, an infinite loop to match the expected behavior. With GCC 10.2, the loop is optimized out. (Unlike C, infinite loops with no side-effects or volatile accesses are undefined behaviour in C++.) The problem is that the compiler and its optimizer are not aware of the implementation's concurrency implications. Consequently, the conclusion needs to be that <code>shared</code> cannot change in thread A - the loop is equivalent to dead code. (Or to put it another way, data races are UB, and the optimizer is allowed to assume that the program doesn't encounter UB. If you're reading a non-atomic variable, that must mean nobody else is writing it. This is what allows compilers to hoist loads out of loops, and similarly sink stores, which are very valuable optimizations for the normal case of non-shared variables.) The solution requires us to communicate to the compiler that <code>shared</code> is involved in inter-thread communication. One way to accomplish that may be <code>volatile</code>. While the actual meaning of <code>volatile</code> varies across compilers and guarantees, if any, are compiler-specific, the general consensus is that <code>volatile</code> prevents the compiler from optimizing volatile accesses in terms of register-based caching. This is essential for low-level code that interacts with hardware and has its place in concurrent programming, albeit with a downward trend due to the introduction of <code>std::atomic</code>. With <code>volatile int64_t shared</code>, the generated instructions change as follows: <pre class="prettyprint"><code> // Thread B: // shared = 42; movq 24(%rdi), %rax movq $42, (%rax) // Thread A: // while (shared != 42) { // } .L87: movq 8(%rsp), %rax cmpq $42, %rax jne .L87 </code></pre> The loop cannot be eliminated anymore as it must be assumed that <code>shared</code> changed even though there's no evidence of that in the form of code. As a result, the sample now works with <code>-O3</code>. If <code>volatile</code> fixes the issue, why would you ever need <code>std::atomic</code>? Two aspects relevant for lock-free code are what makes <code>std::atomic</code> essential: memory operation atomicity and memory order. To build the case for load/store atomicity, we review the generated assembly compiled with GCC4.8.5 <code>-O3 -m32</code> (the 32-bit version) for <code>volatile int64_t shared</code>: <pre class="prettyprint"><code> // Thread B: // shared = 42; movl 4(%esp), %eax movl 12(%eax), %eax movl $42, (%eax) movl $0, 4(%eax) // Thread A: // while (shared != 42) { // } .L88: # do { movl 40(%esp), %eax movl 44(%esp), %edx xorl $42, %eax movl %eax, %ecx orl %edx, %ecx jne .L88 # } while(shared ^ 42 != 0); </code></pre> For 32-bit x86 code generation, 64-bit loads and stores are usually split into two instructions. For single-threaded code, this is not an issue. For multi-threaded code, this means that another thread can see a partial result of the 64-bit memory operation, leaving room for unexpected inconsistencies that might not cause problems 100 percent of the time, but can occur at random and the probability of occurrence is heavily influenced by the surrounding code and software usage patterns. Even if GCC chose to generate instructions that guarantee atomicity by default, that still wouldn't affect other compilers and might not hold true for all supported platforms. To guard against partial loads/stores in all circumstances and across all compilers and supported platforms, <code>std::atomic</code> can be employed. Let's review how <code>std::atomic</code> affects the generated assembly. The updated sample: <pre class="prettyprint"><code>#include <atomic> #include <cassert> #include <cstdint> #include <thread> int main() { std::atomic<int64_t> shared; std::thread thread([&shared]() { shared.store(42, std::memory_order_relaxed); }); while (shared.load(std::memory_order_relaxed) != 42) { } assert(shared.load(std::memory_order_relaxed) == 42); thread.join(); return 0; } </code></pre> The generated 32-bit assembly based on GCC 10.2 (<code>-O3</code>: https://godbolt.org/z/8sPs55nzT): <pre class="prettyprint"><code> // Thread B: // shared.store(42, std::memory_order_relaxed); movl $42, %ecx xorl %ebx, %ebx subl $8, %esp movl 16(%esp), %eax movl 4(%eax), %eax # function arg: pointer to shared movl %ecx, (%esp) movl %ebx, 4(%esp) movq (%esp), %xmm0 # 8-byte reload movq %xmm0, (%eax) # 8-byte store to shared addl $8, %esp // Thread A: // while (shared.load(std::memory_order_relaxed) != 42) { // } .L9: # do { movq -16(%ebp), %xmm1 # 8-byte load from shared movq %xmm1, -32(%ebp) # copy to a dummy temporary movl -32(%ebp), %edx movl -28(%ebp), %ecx # and scalar reload movl %edx, %eax movl %ecx, %edx xorl $42, %eax orl %eax, %edx jne .L9 # } while(shared.load() ^ 42 != 0); </code></pre> To guarantee atomicity for loads and stores, the compiler emits an 8-byte SSE2 <code>movq</code> instruction (to/from the bottom half of a 128-bit SSE register). Additionally, the assembly shows that the loop remains intact even though <code>volatile</code> was removed. By using <code>std::atomic</code> in the sample, it is guaranteed that <ul> <li>std::atomic loads and stores are not subject to register-based caching</li> <li>std::atomic loads and stores do not allow partial values to be observed</li> </ul> The C++ standard doesn't talk about registers at all, but it does say: <blockquote> Implementations should make atomic stores visible to atomic loads within a reasonable amount of time. </blockquote> While that leaves room for interpretation, caching <code>std::atomic</code> loads across iterations, like triggered in our sample (without volatile or atomic) would clearly be a violation - the store might never become visible. Current compilers don't even optimize atomics within one block, like 2 accesses in the same iteration. On x86, naturally-aligned loads/stores (where the address is a multiple of the load/store size) are atomic up to 8 bytes without special instructions. That's why GCC is able to use <code>movq</code>. <code>atomic<T></code> with a large <code>T</code> may not be supported directly by hardware, in which case the compiler can fall back to using a mutex. A large <code>T</code> (e.g. the size of 2 registers) on some platforms might require an atomic RMW operation (if the compiler doesn't simply fall back to locking), which are sometimes provided with larger size than the largest efficient pure-load / pure-store that's guaranteed atomic. (e.g. on x86-64, <code>lock cmpxchg16</code>, or ARM <code>ldrexd</code>/<code>strexd</code> retry loop). Single-instruction atomic RMWs (like x86 uses) internally involve a cache line lock or a bus lock. For example, older versions of <code>clang -m32</code> for x86 will use <code>lock cmpxchg8b</code> instead of <code>movq</code> for 8-byte pure-load or pure-store. What's the second aspect mentioned above and what does <code>std::memory_order_relaxed</code> mean? Both, the compiler and CPU can reorder memory operations to optimize efficiency. The primary constraint of reordering is that all loads and stores must appear to have been executed in the order given by the code (program order). Therefore, in case of inter-thread communication, the memory order must be take into account to establish the required order despite reordering attempts. The required memory order can be specified for <code>std::atomic</code> loads and stores. <code>std::memory_order_relaxed</code> does not impose any particular order. Mutual exclusion primitives enforce a specific memory order (acquire-release order) so that memory operations stay in the lock scope and stores executed by previous lock owners are guaranteed to be visible to subsequent lock owners. Thus, using locks, all the aspects raised here are addressed simply by using the locking facility. As soon as you break out of the comfort locks provide, you have to be mindful of the consequences and the factors that affect concurrency correctness. Being as explicit as possible about inter-thread communication is a good starting point so that the compiler is aware of the load/store context and can generate code accordingly. Whenever possible, prefer <code>std::atomic<T></code> with <code>std::memory_order_relaxed</code> (unless the scenario calls for a specific memory order) to <code>volatile T</code> (and, of course, <code>T</code>). Also, whenever possible, prefer not to roll your own lock-free code to reduce code complexity and maximize the probability of correctness.

What's the difference between T, volatile T, and std::atomic<T>?

Tags:

c++

multithreading

c++11

concurrency

stdatomic

Given the following sample that intends to wait until another thread stores 42 in a shared variable shared without locks and without waiting for thread termination, why would volatile T or std::atomic<T> be required or recommended to guarantee concurrency correctness?

#include <atomic>
#include <cassert>
#include <cstdint>
#include <thread>

int main()
{
  int64_t shared = 0;
  std::thread thread([&shared]() {
    shared = 42;
  });
  while (shared != 42) {
  }
  assert(shared == 42);
  thread.join();
  return 0;
}

With GCC 4.8.5 and default options, the sample works as expected.

337

asked Mar 24 '21 21:03

horstr

1 Answers

The test seems to indicate that the sample is correct but it is not. Similar code could easily end up in production and might even run flawlessly for years.

We can start off by compiling the sample with -O3. Now, the sample hangs indefinitely. (The default is -O0, no optimization / debug-consistency, which is somewhat similar to making every variable volatile, which is the reason the test didn't reveal the code as unsafe.)

To get to the root cause, we have to inspect the generated assembly. First, the GCC 4.8.5 -O0 based x86_64 assembly corresponding to the un-optimized working binary:

        // Thread B:
        // shared = 42;
        movq    -8(%rbp), %rax
        movq    (%rax), %rax
        movq    $42, (%rax)

        // Thread A:
        // while (shared != 42) {
        // }
.L11:
        movq    -32(%rbp), %rax     # Check shared every iteration
        cmpq    $42, %rax
        jne     .L11

Thread B executes a simple store of the value 42 in shared. Thread A reads shared for each loop iteration until the comparison indicates equality.

Now, we compare that to the -O3 outcome:

        // Thread B:
        // shared = 42;
        movq    8(%rdi), %rax
        movq    $42, (%rax)

        // Thread A:
        // while (shared != 42) {
        // }
        cmpq    $42, (%rsp)         # check shared once
        je      .L87                # and skip the infinite loop or not
.L88:
        jmp     .L88                # infinite loop
.L87:

Optimizations associated with -O3 replaced the loop with a single comparison and, if not equal, an infinite loop to match the expected behavior. With GCC 10.2, the loop is optimized out. (Unlike C, infinite loops with no side-effects or volatile accesses are undefined behaviour in C++.)

The problem is that the compiler and its optimizer are not aware of the implementation's concurrency implications. Consequently, the conclusion needs to be that shared cannot change in thread A - the loop is equivalent to dead code. (Or to put it another way, data races are UB, and the optimizer is allowed to assume that the program doesn't encounter UB. If you're reading a non-atomic variable, that must mean nobody else is writing it. This is what allows compilers to hoist loads out of loops, and similarly sink stores, which are very valuable optimizations for the normal case of non-shared variables.)

The solution requires us to communicate to the compiler that shared is involved in inter-thread communication. One way to accomplish that may be volatile. While the actual meaning of volatile varies across compilers and guarantees, if any, are compiler-specific, the general consensus is that volatile prevents the compiler from optimizing volatile accesses in terms of register-based caching. This is essential for low-level code that interacts with hardware and has its place in concurrent programming, albeit with a downward trend due to the introduction of std::atomic.

With volatile int64_t shared, the generated instructions change as follows:

        // Thread B:
        // shared = 42;
        movq    24(%rdi), %rax
        movq    $42, (%rax)

        // Thread A:
        // while (shared != 42) {
        // }
.L87:
        movq    8(%rsp), %rax
        cmpq    $42, %rax
        jne     .L87

The loop cannot be eliminated anymore as it must be assumed that shared changed even though there's no evidence of that in the form of code. As a result, the sample now works with -O3.

If volatile fixes the issue, why would you ever need std::atomic? Two aspects relevant for lock-free code are what makes std::atomic essential: memory operation atomicity and memory order.

To build the case for load/store atomicity, we review the generated assembly compiled with GCC4.8.5 -O3 -m32 (the 32-bit version) for volatile int64_t shared:

        // Thread B:
        // shared = 42;
        movl    4(%esp), %eax
        movl    12(%eax), %eax
        movl    $42, (%eax)
        movl    $0, 4(%eax)

        // Thread A:
        // while (shared != 42) {
        // }
.L88:                               # do {
        movl    40(%esp), %eax
        movl    44(%esp), %edx
        xorl    $42, %eax
        movl    %eax, %ecx
        orl     %edx, %ecx
        jne     .L88                # } while(shared ^ 42 != 0);

For 32-bit x86 code generation, 64-bit loads and stores are usually split into two instructions. For single-threaded code, this is not an issue. For multi-threaded code, this means that another thread can see a partial result of the 64-bit memory operation, leaving room for unexpected inconsistencies that might not cause problems 100 percent of the time, but can occur at random and the probability of occurrence is heavily influenced by the surrounding code and software usage patterns. Even if GCC chose to generate instructions that guarantee atomicity by default, that still wouldn't affect other compilers and might not hold true for all supported platforms.

To guard against partial loads/stores in all circumstances and across all compilers and supported platforms, std::atomic can be employed. Let's review how std::atomic affects the generated assembly. The updated sample:

#include <atomic>
#include <cassert>
#include <cstdint>
#include <thread>

int main()
{
  std::atomic<int64_t> shared;
  std::thread thread([&shared]() {
    shared.store(42, std::memory_order_relaxed);
  });
  while (shared.load(std::memory_order_relaxed) != 42) {
  }
  assert(shared.load(std::memory_order_relaxed) == 42);
  thread.join();
  return 0;
}

The generated 32-bit assembly based on GCC 10.2 (-O3: https://godbolt.org/z/8sPs55nzT):

        // Thread B:
        // shared.store(42, std::memory_order_relaxed);
        movl    $42, %ecx
        xorl    %ebx, %ebx
        subl    $8, %esp
        movl    16(%esp), %eax
        movl    4(%eax), %eax       # function arg: pointer to  shared
        movl    %ecx, (%esp)
        movl    %ebx, 4(%esp)
        movq    (%esp), %xmm0       # 8-byte reload
        movq    %xmm0, (%eax)       # 8-byte store to  shared
        addl    $8, %esp

        // Thread A:
        // while (shared.load(std::memory_order_relaxed) != 42) {
        // }
.L9:                                # do {
        movq    -16(%ebp), %xmm1       # 8-byte load from shared
        movq    %xmm1, -32(%ebp)       # copy to a dummy temporary
        movl    -32(%ebp), %edx
        movl    -28(%ebp), %ecx        # and scalar reload
        movl    %edx, %eax
        movl    %ecx, %edx
        xorl    $42, %eax
        orl     %eax, %edx
        jne     .L9                 # } while(shared.load() ^ 42 != 0);

To guarantee atomicity for loads and stores, the compiler emits an 8-byte SSE2 movq instruction (to/from the bottom half of a 128-bit SSE register). Additionally, the assembly shows that the loop remains intact even though volatile was removed.

By using std::atomic in the sample, it is guaranteed that

std::atomic loads and stores are not subject to register-based caching
std::atomic loads and stores do not allow partial values to be observed

The C++ standard doesn't talk about registers at all, but it does say:

Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.

While that leaves room for interpretation, caching std::atomic loads across iterations, like triggered in our sample (without volatile or atomic) would clearly be a violation - the store might never become visible. Current compilers don't even optimize atomics within one block, like 2 accesses in the same iteration.

On x86, naturally-aligned loads/stores (where the address is a multiple of the load/store size) are atomic up to 8 bytes without special instructions. That's why GCC is able to use movq.

atomic<T> with a large T may not be supported directly by hardware, in which case the compiler can fall back to using a mutex.

A large T (e.g. the size of 2 registers) on some platforms might require an atomic RMW operation (if the compiler doesn't simply fall back to locking), which are sometimes provided with larger size than the largest efficient pure-load / pure-store that's guaranteed atomic. (e.g. on x86-64, lock cmpxchg16, or ARM ldrexd/strexd retry loop). Single-instruction atomic RMWs (like x86 uses) internally involve a cache line lock or a bus lock. For example, older versions of clang -m32 for x86 will use lock cmpxchg8b instead of movq for 8-byte pure-load or pure-store.

What's the second aspect mentioned above and what does std::memory_order_relaxed mean? Both, the compiler and CPU can reorder memory operations to optimize efficiency. The primary constraint of reordering is that all loads and stores must appear to have been executed in the order given by the code (program order). Therefore, in case of inter-thread communication, the memory order must be take into account to establish the required order despite reordering attempts. The required memory order can be specified for std::atomic loads and stores. std::memory_order_relaxed does not impose any particular order.

Mutual exclusion primitives enforce a specific memory order (acquire-release order) so that memory operations stay in the lock scope and stores executed by previous lock owners are guaranteed to be visible to subsequent lock owners. Thus, using locks, all the aspects raised here are addressed simply by using the locking facility. As soon as you break out of the comfort locks provide, you have to be mindful of the consequences and the factors that affect concurrency correctness.

Being as explicit as possible about inter-thread communication is a good starting point so that the compiler is aware of the load/store context and can generate code accordingly. Whenever possible, prefer std::atomic<T> with std::memory_order_relaxed (unless the scenario calls for a specific memory order) to volatile T (and, of course, T). Also, whenever possible, prefer not to roll your own lock-free code to reduce code complexity and maximize the probability of correctness.

187

answered Nov 11 '22 16:11

horstr

Related questions
                            
                                C++ byte array to int
                            
                                C++ converting double to int in mathematical calculations
                            
                                Creating multiple command pools per thread in Vulkan
                            
                                Mapping a vector to specific range
                            
                                Verify at compile time that objects are created as shared_ptr
                            
                                C++ : struct vs function for ordering elements
                            
                                how to define a pointer cast operator?
                            
                                Best practice for const temporary types [closed]
                            
                                Error building GDAL: PROJ_INCLUDE should be defined. PROJ >= 6 is a required dependency
                            
                                How can I make a char* point to a char[]?
                            
                                Compile-time check to make sure that there is no padding anywhere in a struct
                            
                                How to Implement Python Interfaces for C++ Libraries
                            
                                c++ creating subvector operating on same data
                            
                                Arguments for the copy constructor
                            
                                How to declare a static lookup table using C++11
                            
                                How to avoid this kind of code repetition?
                            
                                What is meant by locality of declaration?
                            
                                Using C++20 concepts to avoid std::function
                            
                                Will compiler unroll "for" loop when iterating a const container?
                            
                                How can I make classes easily configurable without run-time overhead?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What's the difference between T, volatile T, and std::atomic<T>?

Tags:

c++

multithreading

c++11

concurrency

stdatomic

horstr

People also ask

1 Answers

horstr

Recent Activity

Donate For Us