I'm trying to figure out the most relaxed (and correct) memory order for shared pointer destructor. What I have in mind for now is as follows:
~shared_ptr() {
if (p) {
if (p->cnt.fetch_sub(1, std::memory_order_release) == 1) {
p->cnt.load(std::memory_order_acquire);
delete p;
}
}
}
Basically, I'm thinking that all previous fetch_sub()
should happen-before delete p;
, and by p->cnt.load(std::memory_order_acquire);
, I construct a release sequence that ensures this.
I'm new to C++ memory model, and not quite confident. Is my above reasoning correct, and the memory order I specified correct and most relaxed?
In theory, you may have the most efficient code, since there is no more synchronization than necessary.
But in practice, there are almost no CPU that provides instruction that would map perfectly to acquire/release memory order (maybe future ARMv8.3-A will). So you will have to check for each target the generated code.
For example on a x86_64 fetch_sub(std::memory_order_acq_rel)
and fetch_sub(std::memory_order_release)
will result in the exact same instruction.
So while in theory your code looks like optimal, in practice, you get a code that is less optimal than if you had opted to a simpler approach:
std::atomic<int> cnt;
int* p;
void optimal_in_therory() {
if (cnt.fetch_sub(1, std::memory_order_release) == 1) {
cnt.load(std::memory_order_acquire);
delete p;
}
}
void optimal_in_practice_on_x86_64() {
if (cnt.fetch_sub(1, std::memory_order_acq_rel) == 1) {
delete p;
}
}
Assembly:
optimal_in_therory():
lock sub DWORD PTR cnt[rip], 1
je .L4
rep ret
.L4:
mov eax, DWORD PTR cnt[rip] ;Unnecessary extra load
mov rdi, QWORD PTR p[rip]
mov esi, 4
jmp operator delete(void*, unsigned long)
optimal_in_practice_on_x86_64():
lock sub DWORD PTR cnt[rip], 1
je .L7
rep ret
.L7:
mov rdi, QWORD PTR p[rip]
mov esi, 4
jmp operator delete(void*, unsigned long)
One day I will live in Theory, because in Theory every thing goes well -Pierre Desproges
Why does the compiler keep this extra-load?
According to the standard optimizers are allowed to elide redundant load performed on non volatile atomics. For example if in your code you added three extra-loads:
cnt.load(std::memory_order_acquire);
cnt.load(std::memory_order_acquire);
cnt.load(std::memory_order_acquire);
With GCC or Clang the three loads would appear in the assembly:
mov eax, DWORD PTR cnt[rip]
mov eax, DWORD PTR cnt[rip]
mov eax, DWORD PTR cnt[rip]
This is a realy bad pessimization. My opinion is that it is kept as-is because of an historical confusion between "volatility" and "atomicity". While almost all programmers know that a volatile does not have the properties of an atomic variable, many code are still written with the idea that an atomic has the propertie of a volatile: "an atomic access is an observable behaviour". According to the standard it is not (an explicit example note about this fact in the standard). This is a recurring question on SO.
So your code is realy the optimal code in theory, and it is pessimized because compilers optimize code as if atomics were also volatiles.
The work around could be to replace the load by an atomic_thread_fence as proposed by Kieth in its comment. I am not an hardware expert but I imagine that such a fence could cause more memory "synchronization" than necessary (or at least in theory ;)).
Why I believe your code is the optimal in theory?
The last shared_ptr of a single object must call the destructor of that object without itself causing a data race. The desctructor may access the value the object, so the desctructor call must happen after the "invalidation" of the pointer to the object.
So delete p;
must "happen after" the destructor call of all others shared pointers that were sharing the same pointed object.
In the standard happens before is defined by the following paragraph:
[intro.races]/9:
An evaluation A inter-thread happens before an evaluation B if:
- A synchronizes with B, or [...]
[intro.races]/10:
An evaluation A happens before an evaluation B (or, equivalently, B happens after A) if:
A is sequenced before B, or
A inter-thread happens before B.
So there must be a "synchronize with" relation between the fetch_sub
that is sequenced before delete p
and the other fetch_sub
.
According to [atomics.order]/2:
An atomic operation A that performs a release operation on an atomic object M synchronizes with an atomic operation B that performs an acquire operation on M and takes its value from any side effect in the release sequence headed by A.
So delete p
must be sequenced after an acquire operation which load a value which is in the release sequence of all the other fetch_sub
.
According to [expr.races]/5 the last fetch_sub
(in the modification order of cnt) will belong to the release sequence of all the other release fetch_sub
because a fetch_sub
is a read-modify-write operation, as is fetch_add
(supposing no other operations happen on cnt
).
So delete p
will happen after all the other fetch_sub, and it is only before delete p
is called that a "synchronization" will be produced. Precisely not more that what is necessary.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With