Relative performance of swap vs compare-and-swap locks on x86

Two common locking idioms are:

if (!atomic_swap(lockaddr, 1)) /* got the lock */

and:

if (!atomic_compare_and_swap(lockaddr, 0, val)) /* got the lock */

where val could simply be a constant or an identifier for the new prospective owner of the lock.

What I'd like to know is whether there tends to be any significant performance difference between the two on x86 (and x86_64) machines. I know this is a fairly broad question since the answer might vary a lot between individual cpu models, but that's part of the reason I'm asking SO rather than just doing benchmarks on a few cpus I have access to.

Why is compare and swap better than test and set?

test-and-set modifies the contents of a memory location and returns its old value as a single atomic operation. compare-and-swap atomically compares the contents of a memory location to a given value and, only if they are the same, modifies the contents of that memory location to a given new value.

How does compare and swap work?

Compare and swap is a technique used when designing concurrent algorithms. The approach is to compare the actual value of the variable to the expected value of the variable and if the actual value matches the expected value, then swap the actual value of the variable for the new value passed in.

What is Cmpxchg?

1. cmpxchg is a hardware locked operation rather than a software locked operation.

How CAS is implemented?

The simplest CAS implementations (and the easiest mental model) will simply freeze the local cache coherence protocol state machine after the load part of the CAS brings the relevant cache line into the nearest (e.g. L1) cache in exclusive mode, and will unfreeze it after the (optional) store completes.

I assume atomic_swap(lockaddr, 1) gets translated to a xchg reg,mem instruction and atomic_compare_and_swap(lockaddr, 0, val) gets translated to a cmpxchg[8b|16b].

Some linux kernel developers think cmpxchg ist faster, because the lock prefix isn't implied as with xchg. So if you are on a uniprocessor, multithread or can otherwise make sure the lock isn't needed, you are probably better of with cmpxchg.

But chances are your compiler will translate it to a "lock cmpxchg" and in that case it doesn't really matter. Also note that while latencies for this instructions are low (1 cycle without lock and about 20 with lock), if you happen to use are common sync variable between two threads, which is quite usual, some additional bus cycles will be enforced, which last forever compared to the instruction latencies. These will most likely completly be hidden by a 200 or 500 cpu cycles long cache snoop/sync/mem access/bus lock/whatever.

I found this Intel document, stating that there is no difference in practice:

http://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures/

One common myth is that the lock utilizing a cmpxchg instruction is cheaper than a lock utilizing an xchg instruction. This is used because cmpxchg will not attempt to get the lock in exclusive mode since the cmp will go through first. Figure 9 shows that the cmpxchg is just as expensive as the xchg instruction.

Relative performance of swap vs compare-and-swap locks on x86

Tags:

c

x86

assembly

atomic

locking

R.. GitHub STOP HELPING ICE

People also ask

2 Answers

Gunther Piez

Bo Persson

Recent Activity

Donate For Us

Relative performance of swap vs compare-and-swap locks on x86

Tags:

c

x86

assembly

atomic

locking

R.. GitHub STOP HELPING ICE

People also ask

2 Answers

Gunther Piez

Bo Persson

Related questions

Recent Activity

Donate For Us