Consider the following pseudocode: <pre class="prettyprint"><code>expected = null; if (variable == expected) { atomic_compare_exchange_strong( &variable, expected, desired(), memory_order_acq_rel, memory_order_acq); } return variable; </code></pre> Observe there are no "acquire" semantics when the <code>variable == expected</code> check is performed. It seems to me that <code>desired</code> will be called at least once in total, and at most once per thread. Furthermore, if <code>desired</code> never returns <code>null</code>, then this code will never return <code>null</code>. Now, I have three questions: <ol> <li>Is the above necessarily true? i.e., can we really have well-ordered reads of shared variables even in the absence of fences on every read?</li> <li>Is it possible to implement this in C++? If so, how? If not, why? (Hopefully with a rationale, not just "because the standard says so".)</li> <li>If the answer to (2) is yes, then is it also possible to implement this in C++ without requiring <code>variable == expected</code> to perform an atomic read of <code>variable</code>?</li> </ol> Basically, my goal is to understand if it is possible to perform lazy-initialization of a shared variable in a manner that has performance identical to that of a non-shared variable once the code has been executed at least once by each thread? (This is somewhat of a "language-lawyer" question. So that implies the question isn't about whether this is a good or useful idea, but rather about whether it's technically possible to do this correctly.)

Regarding the question whether it is possible to perform lazy initialisation of a shared variable in C++, that has a performance (almost) identical to that of a non-shared variable: The answer is, that it depends on the hardware architecture, and the implementation of the compiler and run-time environment. At least, it is possible in some environments. In particular on x86 with GCC and Clang. On x86, atomic reads can be implemented without memory fences. Basically, an atomic read is identical to a non-atomic read. Take a look at the following compilation unit: <pre class="prettyprint"><code>std::atomic<int> global_value; int load_global_value() { return global_value.load(std::memory_order_seq_cst); } </code></pre> Although I used an atomic operation with sequential consistency (the default), there is nothing special in the generated code. The assembler code generated by GCC and Clang looks as follows: <pre class="prettyprint"><code>load_global_value(): movl global_value(%rip), %eax retq </code></pre> I said almost identical, because there are other reasons that might impact the performance. For example: <ul> <li>although there is no fence, the atomic operations still prevent some compiler optimisations, e.g. reordering instructions and elimination of stores and loads</li> <li>if there is at least one thread, that writes to a different memory location on the same cache line, it will have a huge impact on the performance (known as false sharing)</li> </ul> Having said that, the recommended way to implement lazy initialisation is to use <code>std::call_once</code>. That should give you the best result for all compilers, environments and target architectures. <pre class="prettyprint"><code>std::once_flag _init; std::unique_ptr<gadget> _gadget; auto get_gadget() -> gadget& { std::call_once(_init, [this] { _gadget.reset(new gadget{...}); }); return *_gadget; } </code></pre>

Reading shared variables with relaxed ordering: is it possible in theory? Is it possible in C++?

Tags:

c++

multithreading

c++11

memory-model

atomic

Consider the following pseudocode:

expected = null;
if (variable == expected)
{
    atomic_compare_exchange_strong(
        &variable, expected, desired(), memory_order_acq_rel, memory_order_acq);
}
return variable;

Observe there are no "acquire" semantics when the variable == expected check is performed.

It seems to me that desired will be called at least once in total, and at most once per thread.
Furthermore, if desired never returns null, then this code will never return null.

Now, I have three questions:

Is the above necessarily true? i.e., can we really have well-ordered reads of shared variables even in the absence of fences on every read?
Is it possible to implement this in C++? If so, how? If not, why?
(Hopefully with a rationale, not just "because the standard says so".)
If the answer to (2) is yes, then is it also possible to implement this in C++ without requiring variable == expected to perform an atomic read of variable?

Basically, my goal is to understand if it is possible to perform lazy-initialization of a shared variable in a manner that has performance identical to that of a non-shared variable once the code has been executed at least once by each thread?

(This is somewhat of a "language-lawyer" question. So that implies the question isn't about whether this is a good or useful idea, but rather about whether it's technically possible to do this correctly.)

722

asked May 01 '14 09:05

user541686

2 Answers

Regarding the question whether it is possible to perform lazy initialisation of a shared variable in C++, that has a performance (almost) identical to that of a non-shared variable:

The answer is, that it depends on the hardware architecture, and the implementation of the compiler and run-time environment. At least, it is possible in some environments. In particular on x86 with GCC and Clang.

On x86, atomic reads can be implemented without memory fences. Basically, an atomic read is identical to a non-atomic read. Take a look at the following compilation unit:

std::atomic<int> global_value;
int load_global_value() { return global_value.load(std::memory_order_seq_cst); }

Although I used an atomic operation with sequential consistency (the default), there is nothing special in the generated code. The assembler code generated by GCC and Clang looks as follows:

load_global_value():
    movl global_value(%rip), %eax
    retq

I said almost identical, because there are other reasons that might impact the performance. For example:

although there is no fence, the atomic operations still prevent some compiler optimisations, e.g. reordering instructions and elimination of stores and loads
if there is at least one thread, that writes to a different memory location on the same cache line, it will have a huge impact on the performance (known as false sharing)

Having said that, the recommended way to implement lazy initialisation is to use std::call_once. That should give you the best result for all compilers, environments and target architectures.

std::once_flag _init;
std::unique_ptr<gadget> _gadget;

auto get_gadget() -> gadget&
{
    std::call_once(_init, [this] { _gadget.reset(new gadget{...}); });
    return *_gadget;
}

answered Oct 25 '22 18:10

nosid

This is undefined behavior. You're modifying variable, at least in some thread, which means that all accesses to variable must be protected. In particular, when you're executing the atomic_compare_exchange_strong in one thread, there is nothing to guarantee that another thread might see the new value of variable before it sees the writes that might have occurred in desired(). (atomic_compare_exchange_strong only guarantees any ordering in the thread that executes it.)

answered Oct 25 '22 20:10

James Kanze

Related questions
                            
                                Inherit constructors from template base class without repeating template arguments?
                            
                                Accessing JSON.stringify from node.js C++ bindings
                            
                                Difference between "return-by-rvalue-ref" & "return-by-value" when you return using std::move?
                            
                                Is this considered SFINAE?
                            
                                Fast embedded database
                            
                                c++ image processing tutorials withuot 3rd party library
                            
                                Qt 5.1 QML property through Threads
                            
                                Strange error with a templated operator overload
                            
                                OpenGL triangle adjacency indexing
                            
                                QT: socket notifiers cannot be enabled from another thread
                            
                                Should I explicitly trunc?
                            
                                GDB jumps to wrong lines in out of order fashion
                            
                                Can a lambda have "C" linkage?
                            
                                Video Playback in DirectX 11
                            
                                Why is my TB_INSERTBUTTON message causing comctl32 to throw?
                            
                                Does the accessibility of deleted constructors/operators matter?
                            
                                C++ function type template parameter deduction rule
                            
                                Does gcc link program with static or dynamic library by default?
                            
                                How boost auto-linking makes choice?
                            
                                Can I always use std::inserter(container, container.end()) instead of std::back_inserter(container)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With