ARM atomics performance

Question

I am running the same code on an Intel CPU and an ARM CPU (Mac/iOS, compiler: Clang). By profiling the application, I noticed, that on iOS/ARM the atomic operations are the top 3 items, while on Intel, they are not even in the top 10. Is that true, that on ARM atomic operations are that much slower? (relatively of course)

Notlikethat · Accepted Answer

One point to note is that, thanks to implementation details, you're not necessarily seeing the whole story.

Under the load-linked/store-conditional paradigm of ARM, any atomic operation is at least 4 instructions - load-exclusive, <operation>¹, store-exclusive, conditional branch to retry if necessary. Every other core is entirely oblivious to this and carries on doing its own thing.

On x86, however, where instructions can operate directly on memory, atomics are typically accomplished by sticking the LOCK prefix on a single instruction. This means 2 things: firstly, you can never be interrupted inside your atomic 'routine' since it's a single instruction. Secondly, no other core can access memory while the bus is locked, so it effectively pauses execution of everything until it completes². Together, these mean that a sampling profiler will rarely, if ever, catch the atomic operation 'in progress' regardless of how long it actually takes.

_{[1] OK, so that makes an atomic swap only 3 instructions, but anything else has one or more instructions in the middle here.}

_{[2] This is slightly less true of modern cores which will only lock their own cache, rather than everything, to avoid affecting other cores accessing unrelated areas, but the hardware cache-coherency will still prevent anyone else interfering.}

ARM atomics performance

Tags:

c++

multithreading

atomic

intel

arm

István Csanády

1 Answers

Notlikethat

Recent Activity

Donate For Us

ARM atomics performance

Tags:

c++

multithreading

atomic

intel

arm

István Csanády

1 Answers

Notlikethat

Related questions

Recent Activity

Donate For Us