Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Destruction of condition variable randomly loses notification

Given a condition_variable as a member of a class, my understanding is that:

  1. The condition variable is destroyed after the class destructor completes.
  2. Destruction of a condition variable does not need to wait for notifications to have been received.

In light of these expectations, my question is: why does the example code below randomly fail to notify a waiting thread?

#include <mutex>
#include <condition_variable>
#define NOTIFY_IN_DESTRUCTOR 

struct notify_on_delete {
    std::condition_variable cv;

    ~notify_on_delete() {
#ifdef NOTIFY_IN_DESTRUCTOR
        cv.notify_all();
#endif
    }
};

int main () {
    for (int trial = 0; trial < 10000; ++trial) {
        notify_on_delete* nod = new notify_on_delete();
        std::mutex flag;
        bool kill = false;

        std::thread run([nod, &flag, &kill] () {
            std::unique_lock<std::mutex> lock(flag);
            kill = true;
            nod->cv.wait(lock);
        });

        while(true) {
            std::unique_lock<std::mutex> lock(flag);
            if (!kill) continue;
#ifdef NOTIFY_IN_DESTRUCTOR
            delete nod;
#else
            nod->cv.notify_all();
#endif
            break;
        }
        run.join();
#ifndef NOTIFY_IN_DESTRUCTOR
        delete nod;
#endif
    }
    return 0;
}

In the code above, if NOTIFY_IN_DESTRUCTOR is not defined then the test will run to completion reliably. However, when NOTIFY_IN_DESTRUCTOR is defined the test will randomly hang (usually after a few thousand trials).

I am compiling using Apple Clang: Apple LLVM version 9.0.0 (clang-900.0.39.2) Target: x86_64-apple-darwin17.3.0 Thread model: posix C++14 specified, compiled with DEBUG flags set.

EDIT:

To clarify: this question is about the semantics of the specified behavior of instances of condition_variable. The second point above appears to be reenforced in the following quote:

Blockquote Requires: There shall be no thread blocked on *this. [ Note: That is, all threads shall have been notified; they may subsequently block on the lock specified in the wait. This relaxes the usual rules, which would have required all wait calls to happen before destruction. Only the notification to unblock the wait needs to happen before destruction. The user should take care to ensure that no threads wait on *this once the destructor has been started, especially when the waiting threads are calling the wait functions in a loop or using the overloads of wait, wait_­for, or wait_­until that take a predicate. — end note ]

The core semantic question seems to be what "blocked on" means. My present interpretation of the quote above would be that after the line

cv.notify_all(); // defined NOTIFY_IN_DESTRUCTOR

in ~notify_on_delete() the thread test is not "blocked on" nod - which is to say that I presently understand that after this call "the notification to unblock the wait" has occurred, so according to the quote the requirement has been met to proceed with the destruction of the condition_variable instance.

Can someone provide a clarification of "blocked on" or "notification to unblock" to the effect that in the code above, the call to notify_all() does not satisfy the requirements of ~condition_variable()?

like image 958
AngelGabriel Avatar asked Jan 04 '18 10:01

AngelGabriel


2 Answers

When NOTIFY_IN_DESTRUCTOR is defined:
Calling notify_one()/notify_all() doesn't mean that the waiting thread is immediately woken up and the current thread will wait for the other thread. It just means that if the waiting thread wakes up at some point after the current thread has called notify, it should proceed. So in essence, you might be deleting the condition variable before the waiting thread wakes up (depending on how the threads are scheduled).

The explanation for why it hangs, even if the condition variable is deleted while the other thread is waiting on it lies on the fact the wait/notify operations are implemented using queues associated with the condition variables. These queues hold the threads waiting on the condition variables. Freeing the condition variable would mean getting rid of these thread queues.

like image 132
nishantsingh Avatar answered Nov 18 '22 06:11

nishantsingh


I am pretty sure your vendors implementation is broken. Your program looks almost OK from the perspective of obeying the contract with the cv/mutex classes. I couldn’t 100% verify, I am behind one version.

The notion of “blocking” is confusing in the condition_variable (CV) class because there are multiple things to be blocking on. The contract requires the implementation to be more complex than a veneer on pthread_cond* (for example). My reading of it indicates that a single CV would require at least 2 pthread_cond_t’s to implement.

The crux is the destructor having a definition while threads are waiting upon a CV; and its ruin is in a race between CV.wait and ~CV. The naive implementation simply has ~CV broadcast the condvar then eliminate it, and has CV.wait remember the lock in a local variable, so that when it awakens from the runtime notion of blocking it no longer has to reference the object. In that implementation, ~CV becomes a “fire and forget” mechanism.

Sadly, a racing CV.wait could meet the preconditions, yet not be finished interacting with the object yet, when ~CV sneaks in and destroys it. To resolve the race CV.wait and ~CV need to exclude each other, thus the CV requires at least a private mutex to resolve races.

We aren’t finished yet. There usually isn’t an underlying support [ eg. kernel ] for an operation like “wait on cv controlled by lock and release this other lock once I am blocked”. I think that even the posix folks found that too funny to require. Thus, burying a mutex in my CV isn’t enough, I actually require a mechanism that permits me to process events within it; thus a private condvar is required inside the implementation of CV. Obligatory David Parnas meme.

Almost OK, because as Marek R points out, you are relying on referencing a class after its destruction has begun; not the cv/mutex class, your notify_on_delete class. The conflict is a bit academic. I doubt clang would depend upon nod remaining valid after control had transferred to nod->cv.wait(); but the real customer of most compiler vendors are benchmarks, not programmers.

As as general note, multi-threaded programming is difficult, and having now peaked at the c++ threading model, it might be best to give it a decade or two to settle down. It’s contracts are astonishing. When I first looked at your program, I thought ‘duh, there is no way you can destroy a cv that can be accessed because RAII’. Silly me.

Pthreads is another awful API for threading. At least it doesn’t attempt over-reach, and is mature enough that robust test suites keep vendors in line.

like image 5
mevets Avatar answered Nov 18 '22 05:11

mevets