I am looking for a lock implementation that degrades gracefully in the situation where you have two threads that constantly try to release and re-acquire the same lock, at a very high frequency.
Of course it is clear that in this case the two threads won't significantly progress in parallel. Theoretically, the best result would be achieved by running the whole thread 1, and then the whole thread 2, without any switching---because switching just creates massive overhead here. So I am looking for a lock implementation that would handle this situation gracefully by keeping the same thread running for a while before switching, instead of constantly switching.
As I would myself be tempted to answer this question by "your program is broken, don't do that", here is some justification about why we end up in this kind of situation.
The lock is a "single global lock", i.e. a very coarse lock. (It is the Global Interpreter Lock (GIL) inside PyPy, but the question is about how to do it in general, say if you have a C program.)
We have the following situation:
There is constantly contention. That's expected in this case: the lock is a global lock that needs to be acquired for most threads to progress. So we expect that a large fraction of them are waiting for the lock. Only one of these threads can progress.
The thread that holds the lock might do sometimes bursts of short releases. A typical example would be if this thread does repeated calls to "something external", e.g. many short writes to a file. Each of these writes is usually completed very quickly. The lock still has to be released just in case this external thing turns out to take longer than expected (e.g. if the write actually needs to wait for disk I/O), so that another thread can acquire the lock in this case.
If we use some standard mutex for the lock, then the lock will often switch to another thread as soon as the owner releases the lock. But the problem is what if the program runs several threads that each wants to do a long burst of short releases. The program ends up spending most of its time switching the lock between CPUs.
It is much faster to run the same thread for a while before switching, at least as long as the lock is released for very short periods of time. (E.g. on Linux/pthread a release immediately followed by an acquire will sometimes re-acquire the lock instantly even if there are other waiting threads; but we'd like this result in a large majority of cases, not just sometimes.)
Of course, as soon as the lock is released for a longer period of time, then it becomes a good idea to transfer ownership of the lock to a different thread.
So I'm looking for general ideas about how to do that. I guess it should exist already somewhere---in a paper, or in some multithreading library?
For reference, PyPy tries to implement something like this by polling: the lock is just a global variable, with synchronized compare-and-swap but no OS calls; one of the waiting threads is given the role of "stealer"; that "stealer" thread wakes up every 100 microseconds to check the variable. This is not horribly bad (it costs maybe 1-2% of CPU time in addition to the 100% consumed by the running thread). This actually implements what I'm asking for here, but the problem is that this is a hack that doesn't cleanly support more traditional cases of locks: for example, if thread 1 tries to send a message to thread 2 and wait for the answer, the two thread switches will take in average 100 microseconds each---which is far too much if the message is processed quickly.
lock contention: this occurs whenever one process or thread attempts to acquire a lock held by another process or thread. The more fine-grained the available locks, the less likely one process/thread will request a lock held by the other.
A highly contended lock can become a bottleneck in the system, quickly limiting its performance. Of course, the locks are also required to prevent the system from tearing itself to shreds, so a solution to high contention must continue to provide the necessary concurrency protection.
Contention. Attempting to lock an already locked mutex is called contention. In a well-planned program, contention should be quite low. You should design your code so that most attempts to lock the mutex will not block.
Using std::mutex takes about 74 machine cycles, while using a native Win32 CRITICAL_SECTION takes about 53 machine cycles. So unless 100 machine cycles is a significant amount of time compared to the code itself, the mutexes aren't going to be the source of a performance problem.
For reference, let me describe how we finally implemented it. I was unsure about it as it still feels like a hack, but it seems to work for PyPy's use case in practice.
We did it as described in the last paragraph of the question, with one addition: the "stealer" thread, which checks some global variable every 100 microseconds, does this by calling pthread_cond_timedwait
or WaitForSingleObject
with a regular, system-provided mutex, with a timeout of 100 microseconds. This gives a "composite lock" with both the global variable and the regular mutex. The "stealer" will succeed in stealing the "lock" if either it notices a value 0 is the global variable (every 100 microseconds), or immediately if the regular mutex is released by another thread.
It's then a matter of choosing how to release the composite lock in a case-by-case basis. Most external functions (writes to files, etc.) are expected to generally complete quickly, and so we release and re-acquire the composite lock by writing to the global variable. Only in a few specific function cases---like sleep() or lock_acquire()---we expect the calling thread to often block; around these functions, we release the composite lock by actually releasing the mutex instead.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With