Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is std::mutex so slow on OSX?

I have the following benchmark: https://gist.github.com/leifwalsh/10010580

Essentially it spins up k threads and then each thread does about 16 million / k lock/increment/unlock cycles, using a spinlock and a std::mutex. On OSX, the std::mutex is devastatingly slower than the spinlock when contended, whereas on Linux it's competitive or a bit faster.

OSX:

spinlock 1:     334ms
spinlock 2:     3537ms
spinlock 3:     4815ms
spinlock 4:     5653ms
std::mutex 1:   813ms
std::mutex 2:   38464ms
std::mutex 3:   44254ms
std::mutex 4:   47418ms

Linux:

spinlock 1:     305ms
spinlock 2:     1590ms
spinlock 3:     1820ms
spinlock 4:     2300ms
std::mutex 1:   377ms
std::mutex 2:   1124ms
std::mutex 3:   1739ms
std::mutex 4:   2668ms

The processors are different, but not that different (OSX is Intel(R) Core(TM) i7-2677M CPU @ 1.80GHz, Linux is Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz), this seems like a library or kernel problem. Anyone know the source of the slowness?

To clarify my question, I understand that "there are different mutex implementations that optimize for different things and this isn't a problem, it's expected". This question is: what are the actual differences in implementation that cause this? Or, if it's a hardware issue (maybe the cache is just a lot slower on the macbook), that's acceptable too.

like image 997
leif Avatar asked Apr 06 '14 19:04

leif


People also ask

Is STD mutex slow?

Taking a std::unique_lock of a std::mutex is significantly slower than taking a std::unique_lock of a std::shared_mutex. This is despite the fact that they both offer the exact same lock constraints, and underneath the hood both of them are just calling RtlAcquireSRWLockExclusive().

Is std :: mutex fair?

On windows e.g. mutexes are mostly fair, but not always. Some implementations e.g. Thread Building Block provide special mutexes that are fair, but these are not based on the OSes native mutexes, and are usually implemented as spin-locks (which have their own caveats). Save this answer.

Does std :: mutex use Futex?

In many OSes these days mutexes are optimized to not go to the kernel until a contention occurs. For example, in Linux it's implemented using a futex ("fast userspace mutex"), in Windows - SRW lock.

What is the role of std :: mutex?

std::mutex The mutex class is a synchronization primitive that can be used to protect shared data from being simultaneously accessed by multiple threads.


3 Answers

You're just measuring the library's choice of trading off throughput for fairness. The benchmark is heavily artificial and penalizes any attempt to provide any fairness at all.

The implementation can do two things. It can let the same thread get the mutex twice in a row, or it can change which thread gets the mutex. This benchmark heavily penalizes a change in threads because the context switch takes time and because ping-ponging the mutex and val from cache to cache takes time.

Most likely, this is just showing the different trade-offs that implementations have to make. It heavily rewards implementations that prefer to give the mutex back to the thread that last held it. The benchmark even rewards implementations that waste CPU to do that! It even rewards implementations that waste CPU to avoid context switches, even when there's other useful work the CPU could do! It also doesn't penalize the implementation for inter-core traffic which can slow down other unrelated threads.

Also, people who implement mutexes generally presume that performance in the uncontended case is more important than performance in the contended case. There are numerous tradeoffs you can make between these cases, such as presuming that there might be a thread waiting or specifically checking if there is. The benchmark tests only (or at least, almost only) the case that is typically traded off in favor of the case presumed more common.

Bluntly, this is a senseless benchmark that is incapable of identifying a problem.

The specific explanation is almost certainly that the Linux implementation is a spinlock/futex hybrid while the OSX implementation is conventional, equivalent to locking a kernel object. The spinlock portion of the Linux implementation favors allowing the same thread that just released the mutex to lock it again, which your benchmark heavily rewards.

like image 65
David Schwartz Avatar answered Oct 21 '22 10:10

David Schwartz


You need to use the same STL implementation on both systems. This could either be an issue in libc++ or in pthread_mutex_*().

What the other posters are saying about mutex locks being conventional on OS X is a complete lie, however. Yes, Mach locksets and semaphores require system calls for every operation. But unless you explicitly use lockset or semaphore Mach APIs, then these ARE NOT USED in your application.

OS X's libpthread uses __psynch_* BSD system calls, which remotely match Linux futexes. In the uncontended case, libpthread makes no system call to acquire the mutex. Only an instruction such as cmpxchg is used.

Source: libpthread source code and my own knowledge (I'm the developer of Darling).

like image 23
LubosD Avatar answered Oct 21 '22 10:10

LubosD


David Schwartz is basically correct, except for the throughput/adaptiveness comment. It is actually much faster on Linux because it uses a futex & the overhead of a contended call is much smaller. What this means is that in the uncontended case, it simply does a function call, atomic operation & returns. If most of your locks are uncontended (which is usually the typical behaviour you'll see in many real-world programs), acquiring a lock is basically free. Even in the contended case, it's basically a function call, syscall + atomic operation + adding 1 thread to a list (the syscall being the expensive part of the operation). If the mutex is released during the syscall, then the function returns right away without enqueuing onto a wait-list.

On OSX, there is no futex. Acquiring a mutex requires always talking to the kernel. Moreover, OSX is a micro-kernel hybrid. That means to talk to the kernel, you need to send it a message. This means you do data marshalling, syscall, copy the data to a separate buffer. Then at some point the kernel comes by, unmarshalls the data & acquires the lock & sends you back a message. Thus in the uncontended case, it's much heavier-weight. In the contended case, it depends on how long you're blocked waiting for a lock: the longer you wait, the cheaper your lock operation becomes when amortized across total runtime.

On OSX, there is a much faster mechanism called dispatch queues, but it requires re-thinking how your program works. In addition to using lock-free synchronization (i.e. uncontended cases never jump to the kernel), they also do thread pooling & job scheduling. Additionally, they provide asynchronous dispatch which lets you schedule a job without needing to wait for a lock.

like image 21
Vitali Avatar answered Oct 21 '22 11:10

Vitali