The POSIX documentation (IEEE 1003.1, 2013) for the pthread_cond_timedwait
function says:
It is important to note that when pthread_cond_wait() and pthread_cond_timedwait() return without error, the associated predicate may still be false. Similarly, when pthread_cond_timedwait() returns with the timeout error, the associated predicate may be true due to an unavoidable race between the expiration of the timeout and the predicate state change.
(emphasis mine)
We all know the story that the predicates controlled by condition variables should be checked in a while loop, and there may be spurious wake ups. But my question is about the unavoidable word -- it's a strong word. Why is such a race not avoidable?
Note that if such a race didn't exist, we could just check if pthread_cond_timedwait timed out; instead of checking the predicate again, and only then handle the timeout condition. (Assuming, of course, that we get signalled only 1) with the mutex held and 2) when the predicate actually changes.)
Wouldn't it suffice to atomically check, with the "user mutex" held, if we got woken by timeout or were signalled?
For instance, let's consider an implementation of condition variables built on top of POSIX ones. (Error handling and initialization is omitted, you can fill in the obvious gaps).
class CV
{
pthread_mutex_t mtx;
pthread_cond_t cv;
int waiters; // how many threads are sleeping
int wakeups; // how many times this cv got signalled
public:
CV();
~CV();
// returns false if it timed out, true otherwise
bool wait(Mutex *userMutex, struct timespec *timeout)
{
pthread_mutex_lock(&mtx);
waiters++;
const int oldWakeups = wakeups;
userMutex->unlock();
int ret; // 0 on success, non-0 on timeout
for (;;) {
ret = pthread_cond_timedwait(&mtx, &cv, timeout);
if (!(ret == 0 && wakeups == 0))
break; // not spurious
}
if (ret == 0) // not timed out
wakeups--;
pthread_mutex_unlock(&mtx);
userMutex->lock();
pthread_mutex_lock(&mtx);
waiters--;
if (ret != 0 && wakeups > oldWakeups) {
// got a wakeup after a timeout: report the wake instead
ret = 0;
wakeups--;
}
pthread_mutex_unlock(&mtx);
return (ret == 0);
}
void wake()
{
pthread_mutex_lock(&mtx);
wakeups = min(wakeups + 1, waiters);
pthread_cond_signal(&cv);
pthread_mutex_unlock(&mtx);
}
};
It is possible to show that
CV::wait
reports a timeout, then we did not get signalled and thus the predicate didn't change; and thatDoes the code above contain some serious mistake? If not, is the standard wrong at saying the race is unavoidable, or must it do some other assumption that I'm missing?
First off, note that this has a generally dangerous part:
pthread_mutex_unlock(&mtx);
// Trouble is here
userMutex->lock();
pthread_mutex_lock(&mtx);
At the commented point, anything can happen. You have no locks held. The power of a condition variable is that they are always either holding the lock, or waiting.
Then there is the issue at hand, unavoidable races
if (ret != 0 && wakeups > oldWakeups) {
// got a wakeup after a timeout: report the wake instead
ret = 0;
wakeups--;
}
There is no guarantee of what order a bunch of pthread_cond_t's waiting will be woken up, which wreaks havoc on your counting
Thread1 Thread2 Thread3
{lock userMtx in calling code}
{lock mtx}
waiters++ (=1)
oldWakeups = 0
{unlock userMtx }
wait {unlock mtx}
{lock userMtx in calling code}
{lock mtx}
signal_all
wakeups = 1
{unlock mtx}
{unlock userMtx in calling code}
timeout(unavoid. racecase) {lock mtx}
{unlock mtx}
{lock userMtx in calling code}
{lock mtx}
waiters++ (=2)
oldWawkupes = 1
{unlock userMtx }
wait {unlock mtx}
timeout {lock mtx}
{unlock mtx}
{lock userMtx}
{lock mtx}
waiters-- (=1)
wakeups-- (=0)*
{unlock mtx}
{unlock userMtx in calling code}
{lock userMtx}
{lock mtx}
waiters--(=0)
wakeups == oldWakeups (=0)
{unlock mtx}
{unlock userMtx in calling code}
At this point, on thread 1, oldWakeups = wakeups, so the check for the unavoidable race case fails to notice the race case, recreating the unavoidable race case. This is caused by thread 3 getting to steal the signal intended for thread1, making thread 3 (a true timeout) look like a signal, and thread1 (a race signal/timeout) look like a timeout
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With