Lock that handles a high-contention, high-frequency situation

Tags:

multithreading

I am looking for a lock implementation that degrades gracefully in the situation where you have two threads that constantly try to release and re-acquire the same lock, at a very high frequency.

Of course it is clear that in this case the two threads won't significantly progress in parallel. Theoretically, the best result would be achieved by running the whole thread 1, and then the whole thread 2, without any switching---because switching just creates massive overhead here. So I am looking for a lock implementation that would handle this situation gracefully by keeping the same thread running for a while before switching, instead of constantly switching.

Long version of the question

As I would myself be tempted to answer this question by "your program is broken, don't do that", here is some justification about why we end up in this kind of situation.

The lock is a "single global lock", i.e. a very coarse lock. (It is the Global Interpreter Lock (GIL) inside PyPy, but the question is about how to do it in general, say if you have a C program.)

We have the following situation:

There is constantly contention. That's expected in this case: the lock is a global lock that needs to be acquired for most threads to progress. So we expect that a large fraction of them are waiting for the lock. Only one of these threads can progress.
The thread that holds the lock might do sometimes bursts of short releases. A typical example would be if this thread does repeated calls to "something external", e.g. many short writes to a file. Each of these writes is usually completed very quickly. The lock still has to be released just in case this external thing turns out to take longer than expected (e.g. if the write actually needs to wait for disk I/O), so that another thread can acquire the lock in this case.

If we use some standard mutex for the lock, then the lock will often switch to another thread as soon as the owner releases the lock. But the problem is what if the program runs several threads that each wants to do a long burst of short releases. The program ends up spending most of its time switching the lock between CPUs.

It is much faster to run the same thread for a while before switching, at least as long as the lock is released for very short periods of time. (E.g. on Linux/pthread a release immediately followed by an acquire will sometimes re-acquire the lock instantly even if there are other waiting threads; but we'd like this result in a large majority of cases, not just sometimes.)

Of course, as soon as the lock is released for a longer period of time, then it becomes a good idea to transfer ownership of the lock to a different thread.

So I'm looking for general ideas about how to do that. I guess it should exist already somewhere---in a paper, or in some multithreading library?

For reference, PyPy tries to implement something like this by polling: the lock is just a global variable, with synchronized compare-and-swap but no OS calls; one of the waiting threads is given the role of "stealer"; that "stealer" thread wakes up every 100 microseconds to check the variable. This is not horribly bad (it costs maybe 1-2% of CPU time in addition to the 100% consumed by the running thread). This actually implements what I'm asking for here, but the problem is that this is a hack that doesn't cleanly support more traditional cases of locks: for example, if thread 1 tries to send a message to thread 2 and wait for the answer, the two thread switches will take in average 100 microseconds each---which is far too much if the message is processed quickly.

848

asked Jul 12 '16 17:07

Armin Rigo

1 Answers

For reference, let me describe how we finally implemented it. I was unsure about it as it still feels like a hack, but it seems to work for PyPy's use case in practice.

We did it as described in the last paragraph of the question, with one addition: the "stealer" thread, which checks some global variable every 100 microseconds, does this by calling pthread_cond_timedwait or WaitForSingleObject with a regular, system-provided mutex, with a timeout of 100 microseconds. This gives a "composite lock" with both the global variable and the regular mutex. The "stealer" will succeed in stealing the "lock" if either it notices a value 0 is the global variable (every 100 microseconds), or immediately if the regular mutex is released by another thread.

It's then a matter of choosing how to release the composite lock in a case-by-case basis. Most external functions (writes to files, etc.) are expected to generally complete quickly, and so we release and re-acquire the composite lock by writing to the global variable. Only in a few specific function cases---like sleep() or lock_acquire()---we expect the calling thread to often block; around these functions, we release the composite lock by actually releasing the mutex instead.

answered Nov 09 '22 05:11

Armin Rigo

Related questions
                            
                                About OpenCV coordinates
                            
                                Does the forward declaration need to be identical to its counterpart in the definition?
                            
                                C- Setting a array of structs to null
                            
                                Is it possible to read the TTL IP header field when receiving UDP packets?
                            
                                How to get a larger random number from the c function rand()
                            
                                Function using a local static variable thread safe/reentrant [closed]
                            
                                More efficient way to write if-conditionals with repetitive variable
                            
                                Template function to print a Thrust vector
                            
                                Doesn't %[] or %[^] specifier in scanf(),sscanf() or fscanf() store the input in null-terminated character array?
                            
                                Execution of multiple cases within the same switch statement
                            
                                Is using too many static variables in Objective-C a bad practice?
                            
                                sign extension in C, char>unsigned char
                            
                                C DLL In Code::Blocks
                            
                                Get integer out of an IV* in Perl
                            
                                inet_pton with all zero ip address
                            
                                Ncurses reading numpad keys and escaping
                            
                                Given return address, how to get the address of the function?
                            
                                Purpose of Curly Brace Usage of C Code found in Linux (include/linux/list.h)?
                            
                                assignment makes pointer from integer without a cast C
                            
                                Extract scalar value from SSE vector

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Lock that handles a high-contention, high-frequency situation

Tags:

c

multithreading

Long version of the question

Armin Rigo

People also ask

1 Answers

Armin Rigo

Recent Activity

Donate For Us