And how much faster/slower it is as compared to an uncontested atomic variable (such as std::atomic<T>
of C++) operation.
Also, how much slower are contested atomic variables relative to the uncontested lock?
The architecture I'm working on is x86-64.
I happen to have a lot of low-level speed tests lying around. However, what exactly speed means is very uncertain because it depends a lot on what exactly you are doing (even unrelated from the operation itself).
Here are some numbers from an AMD 64-Bit Phenom II X6 3.2Ghz. I've also run this on Intel chips and the times do vary a lot (again, depending on exactly what is being done).
A GCC __sync_fetch_and_add
, which would be a fully-fenced atomic addition, has an average of 16ns, with a minimum time of 4ns. The minimum time is probably closer to the truth (though even there I have a bit of overhead).
An uncontested pthread mutex (through boost) is 14ns (which is also its minimum). Note this is also a bit too low, since the time will actually increase if something else had locked the mutex but it isn't uncontested now (since it will cause a cache sync).
A failed try_lock is 9ns.
I don't have a plain old atomic inc since on x86_64 this is just a normal exchange operation. Likely close to the minimum possible time, so 1-2ns.
Calling notify without a waiter on a condition variable is 25ns (if something is waiting about 304ns).
As all locks however cause certain CPU ordering guarantees, the amount of memory you have modified (whatever fits in the store buffer) will alter how long such operations take. And obviously if you ever have contention on a mutex that is your worst time. Any return to the linux kernel can be hundreds of nanoseconds even if no thread switch actually occurs. This is usually where atomic locks out-perform since they don't ever involve any kernel calls: your average case performance is also your worst case. Mutex unlocking also incurs an overhead if there are waiting threads, whereas an atomic would not.
NOTE: Doing such measurements is fraught with problems, so the results are always kind of questionable. My tests try to minimize variation by fixating CPU speed, setting cpu affinity for threads, running no other processes, and averaging over large result sets.
There’s a project on GitHub with the purpose of measuring this on different platforms. Unfortunately, after my master thesis I never really had the time to follow up on this but at least the rudimentary code is there.
It measures pthreads and OpenMP locks, compared to the __sync_fetch_and_add
intrinsic.
From what I remember, we were expecting a pretty big difference between locks and atomic operations (~ an order of magnitude) but the real difference turned out to be very small.
However, measuring now on my system yields results which reflect my original guess, namely that (regardless of whether pthreads or OpenMP is used) atomic operations are about five times faster, and a single locked increment operation takes about 35ns (this includes acquiring the lock, performing the increment, and releasing the lock).
depends on the lock implementation, depends on the system too. Atomic variables can't be really be contested in the same way as a lock (not even if you are using acquire-release semantics), that is the whole point of atomicity, it locks the bus to propagate the store (depending on the memory barrier mode), but thats an implementation detail.
However, most user-mode locks are just wrapped atomic ops, see this article by Intel for some figures on high performance, scalable locks using atomic ops under x86 and x64 (compared against Windows' CriticalSection
locks, unfortunately, no stats are to be found for the SWR locks, but one should always profile for ones own system/environment).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With