I have a large program that needs to be made as resilient as possible, and has a large number of threads.
I need to catch all signals SIGBUS
SIGSEGV
, and re-initialize the problem thread if necessary, or disable the thread to continue with reduced functionality.
My first thought is to do a setjump
, and then set signal handlers, that can log the problem, and then do a longjump
back to a recovery point in the thread. There is the issue that the signal handler would need to determine which thread the signal came from, to use the appropriate jump buffer as jumping back to the wrong thread would be useless.
Does anyone have any idea how to determine the offending thread in the signal handler?
A process-directed signal may be delivered to any one of the threads that does not currently have the signal blocked. If more than one of the threads has the signal unblocked, then the kernel chooses an arbitrary thread to which to deliver the signal. Save this answer.
A signal mask is associated with each thread. The list of actions associated with each signal number is shared among all threads in the process. If the signal action specifies termination, stop, or continue, the entire process, thus including all its threads, is respectively terminated, stopped, or continued.
The pthread_kill() function sends the signal sig to thread, a thread in the same process as the caller. The signal is asynchronously directed to thread. If sig is 0, then no signal is sent, but error checking is still performed; this can be used to check for the existence of a thread ID.
Each thread can have its own set of signals that will be blocked from delivery. The sigthreadmask subroutine must be used to get and set the calling thread's signal mask. The sigprocmask subroutine must not be used in multi-threaded programs; otherwise, unexpected behavior may result.
It is generally unsafe to provide slots in your QThread subclass, unless you protect the member variables with a mutex. On the other hand, you can safely emit signals from your QThread::run() implementation, because signal emission is thread-safe.
I'm going to assume you've already thought this through and have an extremely good reason to believe that your program will be more resilient by attempting to retry after a SIGSEGV - bearing in mind segfaults highlight issues with dangling pointers and other abuses that might also be corrupting unpredictable locations in your process address space without segfaulting.
Since you've thought this through extremely carefully, and you've determined (somehow) that the particular way your application segfaults cannot possibly disguise the corruption of the accounting data used for canceling and restarting threads, and that you have perfect cancellation logic for those threads (also extraordinarily rare), let's go ahead and tackle the problem.
The SIGSEGV handler on Linux is executed in the thread of the failing instruction (man 7 signal). We can't call pthread_self() as it's not async signal safe, but the internet widely seems to agree that syscall (man 2 syscall) is safe, so we can get the thread ID via syscall SYS_gettid. So we'll to maintain a mapping of pthread_t's (pthread_self) to pid's (gettid()). Since write() is also safe, we can trap SEGV, write the current thread ID down a pipe, and then pause until pthread_cancel terminates us.
We also need a monitor thread to keep an eye on when things go pear-shaped. The monitor thread monitors the read end of the pipe for information on the terminated thread, and may restart it.
Because I think pretending to handle SIGSEGV is daft, I'm going to call the structures here which do so daft_thread_t, etc. someone_please_fix_me represents your broken code. The monitor thread is main(). When a thread segfaults, it is trapped by the signal handler, writes its ID down a pipe; the monitor reads the pipe, cancels the thread with pthread_cancel and pthread_join, and restarts it.
#include <assert.h>
#include <errno.h>
#include <pthread.h>
#include <signal.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <sys/syscall.h>
#define MAX_DAFT_THREADS (1024) // arbitrary
#define CHECK_OSCALL(call, onfail) { \
if ((call) == -1) { \
char buf[512]; \
strerror_r(errno, buf, sizeof(buf)); \
fprintf(stderr, "%s@%d failed: %s\n", __FILE__, __LINE__, buf); \
onfail; \
} \
}
/*********************** daft thread accounting *****************/
typedef void* (*threadproc_t)(void* arg);
struct daft_thread_t {
threadproc_t start_routine;
void* start_routine_arg;
pthread_t pthread;
pid_t tid;
};
struct daft_thread_accounting_info_t {
int monitor_pipe[2];
pthread_mutex_t info_lock;
size_t daft_thread_count;
struct daft_thread_t daft_threads[MAX_DAFT_THREADS];
};
static struct daft_thread_accounting_info_t g_thread_accounting;
void daft_thread_accounting_info_init(struct daft_thread_accounting_info_t* inf)
{
memset(inf, 0, sizeof(*inf));
pthread_mutex_init(&inf->info_lock, NULL);
CHECK_OSCALL(pipe(inf->monitor_pipe), abort());
}
struct daft_thread_wrapper_data_t {
struct daft_thread_t* thread_info;
};
static void* daft_thread_wrapper(void* arg)
{
struct daft_thread_t* wrapper = arg;
wrapper->tid = gettid();
return (*wrapper->start_routine)(wrapper->start_routine_arg);
}
static void start_daft_thread(threadproc_t proc, void* arg)
{
struct daft_thread_t* info;
pthread_mutex_lock(&g_thread_accounting.info_lock);
assert (g_thread_accounting.daft_thread_count < MAX_DAFT_THREADS);
info = &g_thread_accounting.daft_threads[g_thread_accounting.daft_thread_count++];
pthread_mutex_unlock(&g_thread_accounting.info_lock);
info->start_routine = proc;
info->start_routine_arg = arg;
CHECK_OSCALL(pthread_create(&info->pthread, NULL, daft_thread_wrapper, info), abort());
}
static struct daft_thread_t* find_thread_by_tid(pid_t thread_id)
{
int k;
struct daft_thread_t* info = NULL;
pthread_mutex_lock(&g_thread_accounting.info_lock);
for (k = 0; k < g_thread_accounting.daft_thread_count; ++k) {
if (g_thread_accounting.daft_threads[k].tid == thread_id) {
info = &g_thread_accounting.daft_threads[k];
break;
}
}
pthread_mutex_unlock(&g_thread_accounting.info_lock);
return info;
}
static void restart_daft_thread(struct daft_thread_t* info)
{
void* unused;
CHECK_OSCALL(pthread_cancel(info->pthread), abort());
CHECK_OSCALL(pthread_join(info->pthread, &unused), abort());
info->tid = 0;
CHECK_OSCALL(pthread_create(&info->pthread, NULL, daft_thread_wrapper, info), abort());
}
/************* signal handling stuff **************/
struct sigdeath_notify_info {
int signum;
pid_t tid;
};
static void sigdeath_handler(int signum, siginfo_t* info, void* ctx)
{
int z;
struct sigdeath_notify_info inf = {
.signum = signum,
.tid = gettid()
};
z = write(g_thread_accounting.monitor_pipe[1], &inf, sizeof(inf));
assert (z == sizeof(inf)); // or else SIGABRT. Are we handling that too? Hope not.
pause(); // returning doesn't do us any good.
}
static void register_signal_handlers()
{
struct sigaction sa = {};
sa.sa_sigaction = sigdeath_handler;
sa.sa_flags = SA_SIGINFO;
CHECK_OSCALL(sigaction(SIGSEGV, &sa, NULL), abort());
CHECK_OSCALL(sigaction(SIGBUS, &sa, NULL), abort());
}
pid_t gettid() { return (pid_t) syscall(SYS_gettid); }
/** This is the code that segfaults randomly. Kwality with a 'k'. */
static void* someone_please_fix_me(void* arg)
{
char* i_think_this_address_looks_nice = (char*) 42;
sleep(1 + rand() % 200);
i_think_this_address_looks_nice[0] = 'q'; // ugh
return NULL;
}
// main() will serve as the monitor thread here
int main()
{
int k;
struct sigdeath_notify_info death;
daft_thread_accounting_info_init(&g_thread_accounting);
register_signal_handlers();
for (k = 0; k < 200; ++k) {
start_daft_thread(someone_please_fix_me, (void*) k);
}
while (read(g_thread_accounting.monitor_pipe[0], &death, sizeof(death)) == sizeof(death)) {
struct daft_thread_t* info = find_thread_by_tid(death.tid);
if (info == NULL) {
fprintf(stderr, "*** thread_id %u not found\n", death.tid);
continue;
}
fprintf(stderr, "Thread %u (%d) died of %d, restarting.\n",
death.tid, (int) info->start_routine_arg, death.signum);
restart_daft_thread(info);
}
fprintf(stderr, "Shouldn't get here.\n");
return 0;
}
If you haven't thought about it: Attempting to recover from SIGSEGV is extraordinarily risky - I strongly advise against it. Threads share an address space. The thread that segfaulted might also have corrupted other thread data or global accounting data, such as malloc()'s accounting. A far safer approach - assuming the failing code is irreparably broken but must be used - is to quarantine the failing code behind a process boundary, for instance by fork()ing before invoking the broken code. You then must trap SIGCLD and deal with the process crashing or terminating normally, alongside a number of other pitfalls, but at least you don't have to worry about random corruption. Of course, the best option is to fix the bloody code so you're not observing segfaults.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With