Why is CompareAndSwap instruction considered expensive?
I read in a book:
"Memory barriers are expensive, about as expensive as an atomic compareAndSet() instruction."
Thanks!
"CAS isn't appreciably different than a normal store. Some of the misinformation regarding CAS probably arises from the original implementation of lock:cmpxchg (CAS) on Intel processors. The lock: prefix caused the LOCK# signal to be asserted, acquiring exclusive access to the bus. This didn't scale of course. Subsequent implementations of lock:cmpxchg leverage cache coherency protocol -- typically snoop-based MESI -- and don't assert LOCK#." - David Dice, Biased locking in HotSpot
"Memory barriers are expensive, about as expensive as an atomic compareAndSet() instruction."
This is quite true.
E.g. on x86, a proper CAS on a multi-processor system has a lock prefix.
The lock prefix results in a full memory barrier:
"...locked operations serialize all outstanding load and store operations (that is, wait for them to complete)." ..."Locked operations are atomic with respect to all other memory operations and all externally visible events. Only instruction fetch and page table accesses can pass locked instructions. Locked instructions can be used to synchronize data written by one processor and read by another processor." - Intel® 64 and IA-32 Architectures Software Developer’s Manual, Chapter 8.1.2.
A memory barrier is in fact implemented as a dummy LOCK OR
or LOCK AND
in both the .NET and the JAVA JIT on x86/x64.
On x86, CAS results in a full memory barrier.
On PPC, it is different. An LL/SC pair - lwarx
& stwcx
- can be used to load the memory operand into a register, then either write it back if there was no other store to the target location, or retry the whole loop if there was. An LL/SC can be interrupted.
It also does not mean an automatic full fence.
Performance characteristics and behaviour can be very different on different architectures.
But then again - a weak LL/SC is not CAS.
That's because they introduce extra overhead for making the operation atomic. The underlying platform will have to suppress optimizations (like caching) and suspend threads execution for facilitating the barrier and that requires lots of extra effort. While that extra activity is in progress threads can't execute and so the overall program incurs a time delay.
"expensive" is very relative here. It's absolutely insignificant compared with, say, a harddisk access. But RAM bus speed has not kept up with the speed of modern CPUs, and compared with arithmetic operations inside the CPU, accessing the RAM directly (i.e. non-cached) is quite expensive. It can easily take 50 times as long to fetch an int
from RAM than to add two registers.
So, since memory barriers basically force direct RAM access (possibly for multiple CPUs), they are relatively expensive.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With