Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How expensive are atomic operations?

I'm diving into multi-threaded programming and thinking about lock-free reference counting using atomic operations.

It's obvious, that atomic operation could be slower than non-atomic operations at least on constant scale. My worries are about other CPU synchronizations to perform atomic operations.

I wonder whether (if, and how much) execution of atomic operation on core A affects performance of other cores which:

  1. have nothing related to core A
  2. are executing different threads of same process as core A
  3. are executing atomic operation
  4. are executing atomic operation and are executing different threads of same process as core A
  5. are executing any memory related operation, ie. load, store,...
  6. are executing any memory related operation in same memory region (cache line, page?) as core A
like image 666
kravemir Avatar asked Sep 15 '15 16:09

kravemir


1 Answers

I'm comparing an atomic read-modify-write operation to the corresponding non-atomic operation on modern x86 CPUs.

have nothing related to core A

No effect.

are executing different threads of same process as core A

No effect.

are executing atomic operation

No effect.

are executing atomic operation and are executing different threads of same process as core A

No effect.

are executing any memory related operation, ie. load, store,...

No effect.

are executing any memory related operation in same memory region (cache line, page?) as core A

The cache line has to be exclusively acquired by the core performing the atomic operation (stealing it from any other core(s) that have it in their caches) and cannot be accessed by another core until the atomic operation is completed to cache and inter-cache traffic synchronizes it so that it's either shared or exclusive in the other core.

The main cost of atomic operations is to the pipelines of the core executing the atomic instruction. Because the atomic operation must take place all at once at a well-defined place, it (mostly) cannot overlap other operations. That's a huge penalty for a superscalar CPU that gains performance by keeping lots of instructions in various stages of processing.

like image 150
David Schwartz Avatar answered Sep 19 '22 01:09

David Schwartz