Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading shared variables with relaxed ordering: is it possible in theory? Is it possible in C++?

Consider the following pseudocode:

expected = null;
if (variable == expected)
{
    atomic_compare_exchange_strong(
        &variable, expected, desired(), memory_order_acq_rel, memory_order_acq);
}
return variable;

Observe there are no "acquire" semantics when the variable == expected check is performed.

It seems to me that desired will be called at least once in total, and at most once per thread.
Furthermore, if desired never returns null, then this code will never return null.

Now, I have three questions:

  1. Is the above necessarily true? i.e., can we really have well-ordered reads of shared variables even in the absence of fences on every read?

  2. Is it possible to implement this in C++? If so, how? If not, why?
    (Hopefully with a rationale, not just "because the standard says so".)

  3. If the answer to (2) is yes, then is it also possible to implement this in C++ without requiring variable == expected to perform an atomic read of variable?

Basically, my goal is to understand if it is possible to perform lazy-initialization of a shared variable in a manner that has performance identical to that of a non-shared variable once the code has been executed at least once by each thread?

(This is somewhat of a "language-lawyer" question. So that implies the question isn't about whether this is a good or useful idea, but rather about whether it's technically possible to do this correctly.)

like image 722
user541686 Avatar asked May 01 '14 09:05

user541686


People also ask

Should you use relaxed ordering for transactions from unrelated threads?

Transactions from unrelated threads are unlikely to have data dependencies. Consequently, you may be able to use relaxed ordering to improve system performance. The drawback is that only some transactions can be optimized for performance. Complete the following steps to decide whether to enable relaxed ordering in your design:

How to solve the problem of Memory reordering?

The usual locking synchronization mechanisms such as mutexes and semaphores are designed to take care of the memory reordering problem for you, both hardware and software wise. They are high level tools after all.

When should I enable relaxed ordering in my system?

If relaxed ordering improves performance without introducing errors, you can enable it in your system. 9.3. Receive Buffer Reordering 10. Throughput Optimization

How do you use relaxed ordering in ExpressPCI design?

PCI Express Design Using Relaxed Ordering If your analysis indicates that you can enable relaxed ordering, simulate your system with and without relaxed ordering enabled. Compare the results and performance. If relaxed ordering improves performance without introducing errors, you can enable it in your system.


2 Answers

Regarding the question whether it is possible to perform lazy initialisation of a shared variable in C++, that has a performance (almost) identical to that of a non-shared variable:

The answer is, that it depends on the hardware architecture, and the implementation of the compiler and run-time environment. At least, it is possible in some environments. In particular on x86 with GCC and Clang.

On x86, atomic reads can be implemented without memory fences. Basically, an atomic read is identical to a non-atomic read. Take a look at the following compilation unit:

std::atomic<int> global_value;
int load_global_value() { return global_value.load(std::memory_order_seq_cst); }

Although I used an atomic operation with sequential consistency (the default), there is nothing special in the generated code. The assembler code generated by GCC and Clang looks as follows:

load_global_value():
    movl global_value(%rip), %eax
    retq

I said almost identical, because there are other reasons that might impact the performance. For example:

  • although there is no fence, the atomic operations still prevent some compiler optimisations, e.g. reordering instructions and elimination of stores and loads
  • if there is at least one thread, that writes to a different memory location on the same cache line, it will have a huge impact on the performance (known as false sharing)

Having said that, the recommended way to implement lazy initialisation is to use std::call_once. That should give you the best result for all compilers, environments and target architectures.

std::once_flag _init;
std::unique_ptr<gadget> _gadget;

auto get_gadget() -> gadget&
{
    std::call_once(_init, [this] { _gadget.reset(new gadget{...}); });
    return *_gadget;
}
like image 63
nosid Avatar answered Oct 25 '22 18:10

nosid


This is undefined behavior. You're modifying variable, at least in some thread, which means that all accesses to variable must be protected. In particular, when you're executing the atomic_compare_exchange_strong in one thread, there is nothing to guarantee that another thread might see the new value of variable before it sees the writes that might have occurred in desired(). (atomic_compare_exchange_strong only guarantees any ordering in the thread that executes it.)

like image 41
James Kanze Avatar answered Oct 25 '22 20:10

James Kanze