Why is std::mutex so slow on OSX?

Tags:

I have the following benchmark: https://gist.github.com/leifwalsh/10010580

Essentially it spins up k threads and then each thread does about 16 million / k lock/increment/unlock cycles, using a spinlock and a std::mutex. On OSX, the std::mutex is devastatingly slower than the spinlock when contended, whereas on Linux it's competitive or a bit faster.

OSX:

spinlock 1:     334ms
spinlock 2:     3537ms
spinlock 3:     4815ms
spinlock 4:     5653ms
std::mutex 1:   813ms
std::mutex 2:   38464ms
std::mutex 3:   44254ms
std::mutex 4:   47418ms

Linux:

spinlock 1:     305ms
spinlock 2:     1590ms
spinlock 3:     1820ms
spinlock 4:     2300ms
std::mutex 1:   377ms
std::mutex 2:   1124ms
std::mutex 3:   1739ms
std::mutex 4:   2668ms

The processors are different, but not that different (OSX is Intel(R) Core(TM) i7-2677M CPU @ 1.80GHz, Linux is Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz), this seems like a library or kernel problem. Anyone know the source of the slowness?

To clarify my question, I understand that "there are different mutex implementations that optimize for different things and this isn't a problem, it's expected". This question is: what are the actual differences in implementation that cause this? Or, if it's a hardware issue (maybe the cache is just a lot slower on the macbook), that's acceptable too.

997

asked Apr 06 '14 19:04

leif

3 Answers

You're just measuring the library's choice of trading off throughput for fairness. The benchmark is heavily artificial and penalizes any attempt to provide any fairness at all.

The implementation can do two things. It can let the same thread get the mutex twice in a row, or it can change which thread gets the mutex. This benchmark heavily penalizes a change in threads because the context switch takes time and because ping-ponging the mutex and val from cache to cache takes time.

Most likely, this is just showing the different trade-offs that implementations have to make. It heavily rewards implementations that prefer to give the mutex back to the thread that last held it. The benchmark even rewards implementations that waste CPU to do that! It even rewards implementations that waste CPU to avoid context switches, even when there's other useful work the CPU could do! It also doesn't penalize the implementation for inter-core traffic which can slow down other unrelated threads.

Also, people who implement mutexes generally presume that performance in the uncontended case is more important than performance in the contended case. There are numerous tradeoffs you can make between these cases, such as presuming that there might be a thread waiting or specifically checking if there is. The benchmark tests only (or at least, almost only) the case that is typically traded off in favor of the case presumed more common.

Bluntly, this is a senseless benchmark that is incapable of identifying a problem.

The specific explanation is almost certainly that the Linux implementation is a spinlock/futex hybrid while the OSX implementation is conventional, equivalent to locking a kernel object. The spinlock portion of the Linux implementation favors allowing the same thread that just released the mutex to lock it again, which your benchmark heavily rewards.

answered Oct 21 '22 10:10

David Schwartz

You need to use the same STL implementation on both systems. This could either be an issue in libc++ or in pthread_mutex_*().

What the other posters are saying about mutex locks being conventional on OS X is a complete lie, however. Yes, Mach locksets and semaphores require system calls for every operation. But unless you explicitly use lockset or semaphore Mach APIs, then these ARE NOT USED in your application.

OS X's libpthread uses __psynch_* BSD system calls, which remotely match Linux futexes. In the uncontended case, libpthread makes no system call to acquire the mutex. Only an instruction such as cmpxchg is used.

Source: libpthread source code and my own knowledge (I'm the developer of Darling).

answered Oct 21 '22 10:10

LubosD

David Schwartz is basically correct, except for the throughput/adaptiveness comment. It is actually much faster on Linux because it uses a futex & the overhead of a contended call is much smaller. What this means is that in the uncontended case, it simply does a function call, atomic operation & returns. If most of your locks are uncontended (which is usually the typical behaviour you'll see in many real-world programs), acquiring a lock is basically free. Even in the contended case, it's basically a function call, syscall + atomic operation + adding 1 thread to a list (the syscall being the expensive part of the operation). If the mutex is released during the syscall, then the function returns right away without enqueuing onto a wait-list.

On OSX, there is no futex. Acquiring a mutex requires always talking to the kernel. Moreover, OSX is a micro-kernel hybrid. That means to talk to the kernel, you need to send it a message. This means you do data marshalling, syscall, copy the data to a separate buffer. Then at some point the kernel comes by, unmarshalls the data & acquires the lock & sends you back a message. Thus in the uncontended case, it's much heavier-weight. In the contended case, it depends on how long you're blocked waiting for a lock: the longer you wait, the cheaper your lock operation becomes when amortized across total runtime.

On OSX, there is a much faster mechanism called dispatch queues, but it requires re-thinking how your program works. In addition to using lock-free synchronization (i.e. uncontended cases never jump to the kernel), they also do thread pooling & job scheduling. Additionally, they provide asynchronous dispatch which lets you schedule a job without needing to wait for a lock.

answered Oct 21 '22 11:10

Vitali

Related questions
                            
                                LINK : fatal error LNK1181: cannot open input file 'libclamav.lib'
                            
                                Can 2 classes share a friend function?
                            
                                Temporaries, references and ternary operator in C++
                            
                                Win32 ::SetForegroundWindow() not working all the time
                            
                                Assign char* to string without copying [duplicate]
                            
                                C++ - index of element in sorted std::vector
                            
                                Calls to flush cout are ineffective
                            
                                Pipe an input to C++ cin from Bash
                            
                                Is a 3d array contiguous in memory, what about a 2d?
                            
                                Detecting end of input using std::getline
                            
                                Iterating through a 2d Vector row
                            
                                Is it possible to code this way:while(lambda){}
                            
                                Why is clang not optimizing this with NRVO?
                            
                                C++ Cosine works without the std namespace - Why? [duplicate]
                            
                                Convincing a colleague about their if not being nested
                            
                                Warning: overloaded virtual function "Base::process" is only partially overridden in class "derived"
                            
                                Linking against NuGet Libraries in Visual Studio 2013
                            
                                Qt macro keywords cause name collisions
                            
                                Perform a copy of Document object of rapidjson
                            
                                Does dereference a NULL pointer guarantee to crash a program in C/C++?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is std::mutex so slow on OSX?

Tags:

c++

macos

multithreading