On Visual C++ 2013, when I compile the following code
#include <atomic>
int main()
{
std::atomic<int> v(2);
return v.fetch_add(1, std::memory_order_relaxed);
}
I get back the following assembly on x86:
51 push ecx
B8 02 00 00 00 mov eax,2
8D 0C 24 lea ecx,[esp]
87 01 xchg eax,dword ptr [ecx]
B8 01 00 00 00 mov eax,1
F0 0F C1 01 lock xadd dword ptr [ecx],eax
59 pop ecx
C3 ret
and similarly on x64:
B8 02 00 00 00 mov eax,2
87 44 24 08 xchg eax,dword ptr [rsp+8]
B8 01 00 00 00 mov eax,1
F0 0F C1 44 24 08 lock xadd dword ptr [rsp+8],eax
C3 ret
I simply don't understand: why does a relaxed increment of an int
variable require a lock
prefix?
Is there a reason for this, or did they simply not include the optimization of removing it?
* I used /O2
with /NoDefaultLib
to trim it down and get rid of unnecessary C runtime code, but that's irrelevant to the question.
Because a lock is still required for it to be atomic; even with memory_order_relaxed
the requirement for increment/decrement is too strict to be lockless.
Imagine the same thing with no locks.
v = 0;
And then we spawn 100 threads, each with this command:
v++;
And then you wait for all threads to finish, what would you expect v to be? Unfortunately, it may not be 100. Say the value v=23 is loaded by one thread, and before 24 is created, another thread also loads 23 and then writes out 24 too. So the threads actually negate each other. This is because the increment itself is not atomic. Sure, load, store, add may be atomic on their own, but incrementing is multiple steps so it is not atomic.
But with std::atomic, all operations are atomic, regardless of the std::memory_order
setting. The only question is what order they will happen in. memory_order_relaxed
still guarantees atomicity, it just might be out of order with respect to anything else happening near it, even operating on the same value.
Atomic operations, even with relaxed ordering, still have to be atomic.
Even if some operations on current CPUs were atomic without a lock
prefix (hint: they are not, due to multi core caches), that wouldn't be guaranteed for future CPUs.
It would be shortsighted to have all your binaries fail horribly on the newest architecture just because you wanted to optimize a byte out of your binary relying on a feature that is not part of the assembly specification (and thus not guaranteed to be preserved in future x86_64 architectures)
Of course in this case multi-core systems are widespread so you do in practice need a lock
prefix for it to work on current CPUs. See Can num++ be atomic for 'int num'?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With