I understand that the MESI protocol successfully guarantees the same view of memory (caches) for different cores. My question comes from the fact that during writing MESI guarantees that the cache is exclusively owned by a CPU and then atomic CMPXCHG just compares and exchanges values atomically. So why do we need to use the LOCK instruction and thus lock the cache line when we already have that guarantee from the MESI protocol?
On a single-CPU system, cmpxchg is atomic with respect to other threads, or any other code running on the same CPU core. (But not to "system" observers like a memory-mapped I/O device, or a device doing DMA reads of normal memory, so lock cmpxchg was relevant even on uniprocessor CPU designs).
The MESI protocol is a formal mechanism for controlling cache coherency using snooping techniques. Its acronym stands for modified, exclusive, shared, invalid and refers to the states that cached data can take. The MESI protocol is a formal mechanism for controlling cache coherency using snooping techniques.
Yes, implementing MESI(f) or any coherence protocol will have an impact on performance, but that impact is actually positive when compared to the case where no coherence protocol exists. In that case, every read/write would need to go to main memory (ie, your application will be 100's of times SLOWER).
In that sense the Exclusive state is an opportunistic optimization: If the CPU wants to modify a cache line that is in state S, a bus transaction is necessary to invalidate all other cached copies. State E enables modifying a cache line with no bus transaction.
atomic CMPXCHG just compares and exchanges values atomically
No, the cache-access hardware doesn't implement CMPXCHG as a single-cycle inherently-atomic operation. It's synthesized out of multiple uops that load and separately store.
If that's how regular CMPXCHG worked, your reasoning would be correct. But regular CMPXCHG is not atomic (for observers on other cores).
lock cmpxchg
decodes to multiple uops that keep the cache-line "locked" from the load to the store, turning it into a single atomic transaction as far as any other observers in the system can see. (i.e. delay responding to MESI invalidate or share requests for that line until after the store commits). It also makes it a full memory barrier.
Without lock
, CMPXCHG decodes to multiple uops that load, check for equality stuff, and then either store a new value or not according to the compare result. As far as atomicity, it's the same as add [mem], edx
, which uses the ALU for addition in between load and store uops. i.e. it's not atomic, except on the same core with respect to interrupts (because interrupts can only happen at an instruction boundary).
The load and store are each separately atomic, but they aren't a single atomic RMW transaction. If another core invalidates our copy of the cache line and stores a new value between our load and our store, our store will step on the other store. And that other store will appear in the global order of operations on that cache line between our load and store, violating the definition of "atomic" = indivisible.
add [mem], edx
isn't atomic, and how lock
works to make it atomic.cmpxchg
without lock
: uniprocessor machines.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With