Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ARM atomics performance

I am running the same code on an Intel CPU and an ARM CPU (Mac/iOS, compiler: Clang). By profiling the application, I noticed, that on iOS/ARM the atomic operations are the top 3 items, while on Intel, they are not even in the top 10. Is that true, that on ARM atomic operations are that much slower? (relatively of course)

like image 550
István Csanády Avatar asked Sep 30 '22 22:09

István Csanády


1 Answers

One point to note is that, thanks to implementation details, you're not necessarily seeing the whole story.

Under the load-linked/store-conditional paradigm of ARM, any atomic operation is at least 4 instructions - load-exclusive, <operation>1, store-exclusive, conditional branch to retry if necessary. Every other core is entirely oblivious to this and carries on doing its own thing.

On x86, however, where instructions can operate directly on memory, atomics are typically accomplished by sticking the LOCK prefix on a single instruction. This means 2 things: firstly, you can never be interrupted inside your atomic 'routine' since it's a single instruction. Secondly, no other core can access memory while the bus is locked, so it effectively pauses execution of everything until it completes2. Together, these mean that a sampling profiler will rarely, if ever, catch the atomic operation 'in progress' regardless of how long it actually takes.

[1] OK, so that makes an atomic swap only 3 instructions, but anything else has one or more instructions in the middle here.

[2] This is slightly less true of modern cores which will only lock their own cache, rather than everything, to avoid affecting other cores accessing unrelated areas, but the hardware cache-coherency will still prevent anyone else interfering.

like image 56
Notlikethat Avatar answered Oct 03 '22 06:10

Notlikethat