Average latency of atomics cmpxchg instructions on Intel Cpus

Question

I am looking for some reference on average latencies for lock cmpxchg instruction for various intel processors. I am not able to locate any good reference on the topic and any reference would greatly help.

Thanks.

Zooba · Accepted Answer

There are few, if any, good references on this because there is so much variation. It depends on basically everything including bus speed, memory speed, processor speed, processor count, surrounding instructions, memory fencing and quite possibly the angle between the moon and Mt Everest...

If you have a very specific application, as in, known (fixed) hardware, operating environment, a real-time operating system and exclusive control, then maybe it will matter. In this case, benchmark. If you don't have this level of control over where your software is running, any measurements are effectively meaningless.

As discussed in these answers, locks are implemented using CAS, so if you can get away with CAS instead of a lock (which will need at least two operations) it will be faster (noticeably? only maybe).

The best references you will find are the Intel Software Developer's Manuals, though since there is so much variation they won't give you an actual number. They will, however, describe how to get the best performance possible. Possibly a processor datasheet (such as those here for the i7 Extreme Edition, under "Technical Documents") will give you actual numbers (or at least a range).

Arto Bendiken · Answer

The best x86 instruction latency reference is probably that contained in Agner's optimization manuals, based on actual empirical measurements on various Intel/AMD/VIA chips and frequently updated for the latest CPUs on the market.

Unfortunately, I don't see the CMPXCHG instruction listed in the instruction latency tables, but page 4 does state:

Instructions with a LOCK prefix have a long latency that depends on cache organization and possibly RAM speed. If there are multiple processors or cores or direct memory access (DMA) devices then all locked instructions will lock a cache line for exclusive access, which may involve RAM access. A LOCK prefix typically costs more than a hundred clock cycles, even on single-processor systems. This also applies to the XCHG instruction with a memory operand.

Average latency of atomics cmpxchg instructions on Intel Cpus

Tags:

x86

multithreading

atomic

lock-free

Sandeep

2 Answers

Zooba

Arto Bendiken

Recent Activity

Donate For Us

Average latency of atomics cmpxchg instructions on Intel Cpus

Tags:

x86

multithreading

atomic

lock-free

Sandeep

2 Answers

Zooba

Arto Bendiken

Related questions

Recent Activity

Donate For Us