Say we have a single-producer-thread single-consumer-thread lockless queue, and that the producer may go long periods without producing any data. It would be beneficial to let the consumer thread sleep when there is nothing in the queue (for the power savings and freeing up the CPU for other processes/threads). If the queue were not lockless, the straightforward way to solve this problem is to have the producing thread lock a mutex, do its work, signal a condition variable and unlock, and for the reading thread to lock the mutex, wait on the condition variable, do its reading, then unlock. But if we're using a lockless queue, using a mutex the exact same way would eliminate the performance we gain from using a lockless queue in the first place.
The naive solution is to have the producer after each insertion into the queue lock the mutex, signal the condition variable, then unlock the mutex, keeping the actual work (the insertion into the queue) completely outside the lock, and to have the consumer do the same, locking the mutex, waiting on the condition variable, unlocking it, pulling everything off the queue, then repeat, keeping the reading of the queue outside the lock. There's a race condition here though: between the reader pulling off the queue and going to sleep, the producer may have inserted an item into the queue. Now the reader will go to sleep, and may stay so indefinitely until the producer inserts another item and signals the condition variable again. This means you can occasionally end up with particular items seeming to take a very long time to travel through the queue. If your queue is always constantly active this may not be a problem, but if it were always active you could probably forget the condition variable entirely.
AFAICT the solution is for the producer to behave the same as if it were working with a regular needs-locking queue. It should lock the mutex, insert into the lockless queue, signal the condition variable, unlock. However, the consumer should behave differently. When it wakes, it should unlock the mutex immediately instead of waiting until it's read the queue. Then it should pull as much of the queue as it can and process it. Finally, only when the consumer is thinking of going to sleep, should it lock the mutex, check if there's any data, then if so unlock and process it or if not then wait on the condition variable. This way the mutex is contended less often than it would be with a lockfull queue, but there's no risk of going to sleep with data still left on the queue.
Is this the best way to do it? Are there alternatives?
Note: By 'fastest' I really mean 'fastest without dedicating a core to checking the queue over and over,' but that wouldn't fit in the title ;p
One alternative: Go with the naive solution, but have the consumer wait on the condition variable with a timeout corresponding to the maximum latency you are willing to tolerate for an item traveling through the queue. If the desired timeout is fairly short though, it may be below the minimum wait time for your OS or still consume too much CPU.
I'm assuming you're using the lock-free single-producer single-consumer queue from the Dr Dobbs article - or something similar - so I'll use the terminology from there.
In that case, your suggested answer in the paragraph that starts "AFAICT" is good, but I think it can be optimised slightly:
last
, call it saved_last
new_item
as usual, then take a copy of the divider
pointer, call it saved_divider
.saved_divider
is equal to new_item
, the object you just inserted, then your object has already been consumed, and your work is done.saved_divider
is not equal to saved_last
, then you don't need to wake up the consumer. This is because:
divider
had not yet reached either new_item
or saved_last
last
has only had those two valuesdivider
is equal to last
This ensures that, in the case where the consumer tends to remain busy, you avoid locking the mutex when you know the consumer is still awake (and not about to sleep). It also minimises the time when the mutex is held, to further reduce the possibility for contention.
The above explanation is quite wordy (because I wanted to include the explanation of why it works, rather than just what the algorithm is), but the code resulting from it should be quite simple.
Of course whether it's actually worth doing will depend on a lot of things, and I'd encourage you to measure if performance is critical for you. Good implementations of the mutex/condvar primitives internally use atomic operations for most cases, so they generally only make a kernel call (the most expensive bit!) if there's a need to - i.e. if there's a need to block, or there are definitely some threads waiting - so the time saved by not calling the mutex functions may only amount to the overhead of the library calls.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With