Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Releasing multiple locks without causing priority inversion

Short version: How do I release multiple locks from a single thread, without being preempted halfway through?

I have a program which is designed to run on an N-core machine. It consists of one main thread and N worker threads. Each thread (including the main thread) has a semaphore it can block on. Normally, each worker thread is blocked on decrementing its semaphore, and the main thread is running. Every now and then, though, the main thread should wake up the worker threads to do their thing for a certain amount of time, then block on its own semaphore waiting for them all to go back to sleep. Like so:

def main_thread(n):
    for i = 1 to n:
        worker_semaphore[i] = semaphore(0)
        spawn_thread(worker_thread, i)
    main_semaphore = semaphore(0)

    while True:
        ...do some work...
        workers_to_wake = foo()
        for i in workers_to_wake:
            worker_semaphore[i].increment() # wake up worker n
        for i in workers_to_wake:
            main_semaphore.decrement() # wait for all workers

def worker_thread(i):
    while True:
        worker_semaphore(i).decrement() # wait to be woken
        ...do some work...
        main_semaphore.increment() # report done with step

All well and good. The problem is, one of the woken workers may end up preempting the main thread halfway through waking the workers: This can happen, for instance, when the Windows scheduler decides to boost that worker's priority. This doesn't lead to deadlock, but it is inefficient, because the remainder of the threads stay asleep until the preempting worker finishes its work. It's basically priority inversion, with the main thread waiting on one of the workers, and some of the worker threads waiting on the main thread.

I can probably figure out OS- and scheduler-specific hacks for this, such as disabling priority boosting under Windows, and fiddling about with thread priorities and processor affinities, but I'd like something cross-platform-ish and robust and clean. So: How can I wake up a bunch of threads atomically?

like image 960
Sneftel Avatar asked Jun 16 '16 18:06

Sneftel


1 Answers

TL; DR

If you really have to get as much as you can out of your workers, just use an event semaphore, a control block and a barrier instead of your semaphores. Note however, that this is a more fragile solution and so you need to balance any potential gains against this downside.

Context

First I need to summarize the broader context in our discussion...

You have a Windows graphical application. It has a desired frame rate and so you need the main thread to run at that rate, scheduling all your workers at precisely timed intervals so that they have completed their work within the refresh interval. This means you have very tight constraints on the start and execution times for each thread. In addition, your worker threads are not all identical, so you can't just use a single work queue.

The problem

Like any modern operating system, Windows has a variety of synchronization primitives. However, none of these directly provides a mechanism for notifying multiple primitives at once. Looking through other operating systems, I see a similar pattern; they all provide ways of waiting on multiple primitives, but none provide an atomic way of triggering them.

So what can we do instead? The problems you need to solve are:

  1. Precisely timing the start of all required workers.
  2. Prodding the workers that actually need to run in the next frame.

Options

The most obvious solution for issue 1 is just to use a single event semaphore, but you could also use a read/write lock (by acquiring the write lock after the workers have finished and getting the workers to use a read lock). All other options are no longer atomic and so will need further synchronization to force the threads to do what you want - like lossleader's suggestion for locks inside your semaphores.

But we want an optimal solution that reduces context switches as much as possible due to the tight time constraints on your application, so let's see if either of these can be used to solve problem 2... How can you pick which worker threads should run from the main if we just have an event semaphore or read/write lock?

Well... A read/write lock is a great way for one thread to write some critical data to a control block and for many others to read from it. Why not just have a simple array of boolean flags (one for each worker thread) that your main thread updates each frame? Sadly you still need to stop execution of the workers until the timer pops. In short we're back at the semaphore and lock solution again.

However, owing to the nature of your application, you can make one more step. You can rely on the fact that you know your workers are not running outside of your time slicing and use an event semaphore as a crude form of lock instead.

A final optimization (if your environment supports them), is to use a barrier instead of the main semaphore. You know that all n threads need to be idle before you can continue, so just insist on it.

A solution

Applying the above, your pseudo-code would then look something like this:

def main_thread(n):
    main_event = event()
    for i = 1 to n:
        worker_scheduled[i] = False
        spawn_thread(worker_thread, i)
    main_barrier = barrier(n+1)

    while True:
        ...do some work...
        workers_to_wake = foo()
        for i in workers_to_wake:
            worker_scheduled[i] = True
        main_event.set()
        main_barrier.enter() # wait for all workers
        main_event.reset()

def worker_thread(i):
    while True:
       main_event.wait()
       if worker_scheduled[i]:
            worker_scheduled[i] = False
            ...do some work...
       main_barrier.enter() # report finished for this frame.
       main_event.reset() # to catch the case that a worker is scheduled before the main thread

Since there is no explicit policing of the worker_scheduled array, this is a much more fragile solution.

I would therefore personally only use it if I had to squeeze every last ounce of processing out of my CPU, but it sounds like that is exactly what you are looking for.

like image 65
Peter Brittain Avatar answered Oct 23 '22 03:10

Peter Brittain