trapping signals in a multithreaded environment

Tags:

I have a large program that needs to be made as resilient as possible, and has a large number of threads. I need to catch all signals SIGBUS SIGSEGV, and re-initialize the problem thread if necessary, or disable the thread to continue with reduced functionality.

My first thought is to do a setjump, and then set signal handlers, that can log the problem, and then do a longjump back to a recovery point in the thread. There is the issue that the signal handler would need to determine which thread the signal came from, to use the appropriate jump buffer as jumping back to the wrong thread would be useless.

Does anyone have any idea how to determine the offending thread in the signal handler?

373

asked May 29 '15 22:05

camelccc

1 Answers

I'm going to assume you've already thought this through and have an extremely good reason to believe that your program will be more resilient by attempting to retry after a SIGSEGV - bearing in mind segfaults highlight issues with dangling pointers and other abuses that might also be corrupting unpredictable locations in your process address space without segfaulting.

Since you've thought this through extremely carefully, and you've determined (somehow) that the particular way your application segfaults cannot possibly disguise the corruption of the accounting data used for canceling and restarting threads, and that you have perfect cancellation logic for those threads (also extraordinarily rare), let's go ahead and tackle the problem.

The SIGSEGV handler on Linux is executed in the thread of the failing instruction (man 7 signal). We can't call pthread_self() as it's not async signal safe, but the internet widely seems to agree that syscall (man 2 syscall) is safe, so we can get the thread ID via syscall SYS_gettid. So we'll to maintain a mapping of pthread_t's (pthread_self) to pid's (gettid()). Since write() is also safe, we can trap SEGV, write the current thread ID down a pipe, and then pause until pthread_cancel terminates us.

We also need a monitor thread to keep an eye on when things go pear-shaped. The monitor thread monitors the read end of the pipe for information on the terminated thread, and may restart it.

Because I think pretending to handle SIGSEGV is daft, I'm going to call the structures here which do so daft_thread_t, etc. someone_please_fix_me represents your broken code. The monitor thread is main(). When a thread segfaults, it is trapped by the signal handler, writes its ID down a pipe; the monitor reads the pipe, cancels the thread with pthread_cancel and pthread_join, and restarts it.

#include <assert.h>
#include <errno.h>
#include <pthread.h>
#include <signal.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <sys/syscall.h>

#define MAX_DAFT_THREADS (1024) // arbitrary

#define CHECK_OSCALL(call, onfail) { \
    if ((call) == -1) { \
        char buf[512]; \
        strerror_r(errno, buf, sizeof(buf)); \
        fprintf(stderr, "%s@%d failed: %s\n", __FILE__, __LINE__, buf); \
        onfail; \
    } \
}

/*********************** daft thread accounting *****************/
typedef void* (*threadproc_t)(void* arg);

struct daft_thread_t {
    threadproc_t start_routine;
    void* start_routine_arg;
    pthread_t pthread;
    pid_t tid;
};

struct daft_thread_accounting_info_t {
    int monitor_pipe[2];
    pthread_mutex_t info_lock;
    size_t daft_thread_count;
    struct daft_thread_t daft_threads[MAX_DAFT_THREADS];
};

static struct daft_thread_accounting_info_t g_thread_accounting;

void daft_thread_accounting_info_init(struct daft_thread_accounting_info_t* inf)
{
    memset(inf, 0, sizeof(*inf));
    pthread_mutex_init(&inf->info_lock, NULL);
    CHECK_OSCALL(pipe(inf->monitor_pipe), abort());
}

struct daft_thread_wrapper_data_t {
    struct daft_thread_t* thread_info;
};

static void* daft_thread_wrapper(void* arg)
{
    struct daft_thread_t* wrapper = arg;
    wrapper->tid = gettid();
    return (*wrapper->start_routine)(wrapper->start_routine_arg);
}

static void start_daft_thread(threadproc_t proc, void* arg)
{
    struct daft_thread_t*  info;
    pthread_mutex_lock(&g_thread_accounting.info_lock);
    assert (g_thread_accounting.daft_thread_count < MAX_DAFT_THREADS);
    info = &g_thread_accounting.daft_threads[g_thread_accounting.daft_thread_count++];
    pthread_mutex_unlock(&g_thread_accounting.info_lock);
    info->start_routine = proc;
    info->start_routine_arg = arg;
    CHECK_OSCALL(pthread_create(&info->pthread, NULL, daft_thread_wrapper, info), abort());
}

static struct daft_thread_t* find_thread_by_tid(pid_t thread_id)
{
    int k;
    struct daft_thread_t* info = NULL;
    pthread_mutex_lock(&g_thread_accounting.info_lock);
    for (k = 0; k < g_thread_accounting.daft_thread_count; ++k) {
        if (g_thread_accounting.daft_threads[k].tid == thread_id) {
            info = &g_thread_accounting.daft_threads[k];
            break;
        }
    }
    pthread_mutex_unlock(&g_thread_accounting.info_lock);
    return info;
}

static void restart_daft_thread(struct daft_thread_t* info)
{
    void* unused;
    CHECK_OSCALL(pthread_cancel(info->pthread), abort());
    CHECK_OSCALL(pthread_join(info->pthread, &unused), abort());
    info->tid = 0;
    CHECK_OSCALL(pthread_create(&info->pthread, NULL, daft_thread_wrapper, info), abort());
}

/************* signal handling stuff **************/
struct sigdeath_notify_info {
    int signum;
    pid_t tid;
};

static void sigdeath_handler(int signum, siginfo_t* info, void* ctx)
{
    int z;
    struct sigdeath_notify_info inf = {
        .signum = signum,
        .tid = gettid()
    };
    z = write(g_thread_accounting.monitor_pipe[1], &inf, sizeof(inf));
    assert (z == sizeof(inf)); // or else SIGABRT. Are we handling that too? Hope     not.
    pause(); // returning doesn't do us any good.
}

static void register_signal_handlers()
{
    struct sigaction sa = {};
    sa.sa_sigaction = sigdeath_handler;
    sa.sa_flags = SA_SIGINFO;
    CHECK_OSCALL(sigaction(SIGSEGV, &sa, NULL), abort());
    CHECK_OSCALL(sigaction(SIGBUS, &sa, NULL), abort());
}

pid_t gettid() { return (pid_t) syscall(SYS_gettid); }

/** This is the code that segfaults randomly. Kwality with a 'k'. */
static void* someone_please_fix_me(void* arg)
{
    char* i_think_this_address_looks_nice = (char*) 42;
    sleep(1 + rand() % 200);
    i_think_this_address_looks_nice[0] = 'q'; // ugh
    return NULL;
}

// main() will serve as the monitor thread here
int main()
{
    int k;
    struct sigdeath_notify_info death;
    daft_thread_accounting_info_init(&g_thread_accounting);
    register_signal_handlers();
    for (k = 0; k < 200; ++k) {
        start_daft_thread(someone_please_fix_me, (void*) k);
    }
    while (read(g_thread_accounting.monitor_pipe[0], &death, sizeof(death)) == sizeof(death)) {
        struct daft_thread_t* info = find_thread_by_tid(death.tid);
        if (info == NULL) {
            fprintf(stderr, "*** thread_id %u not found\n", death.tid);
            continue;
        }
        fprintf(stderr, "Thread %u (%d) died of %d, restarting.\n",
            death.tid, (int) info->start_routine_arg, death.signum);
        restart_daft_thread(info);
    }
    fprintf(stderr, "Shouldn't get here.\n");
    return 0;
}

If you haven't thought about it: Attempting to recover from SIGSEGV is extraordinarily risky - I strongly advise against it. Threads share an address space. The thread that segfaulted might also have corrupted other thread data or global accounting data, such as malloc()'s accounting. A far safer approach - assuming the failing code is irreparably broken but must be used - is to quarantine the failing code behind a process boundary, for instance by fork()ing before invoking the broken code. You then must trap SIGCLD and deal with the process crashing or terminating normally, alongside a number of other pitfalls, but at least you don't have to worry about random corruption. Of course, the best option is to fix the bloody code so you're not observing segfaults.

157

answered Oct 18 '22 17:10

lyngvi

Related questions
                            
                                Understanding number of loads and stores retired in a x86 micro-benchmark
                            
                                CreateProcessAsUser creates blank/black window
                            
                                While signal not received?
                            
                                Global register variables in gcc
                            
                                Why `gcc -Q -march=corei7-avx --help=target` lies?
                            
                                Design C-container with `const` elements?
                            
                                Warning: Assignment from Incompatible Pointer Type [enabled by default] while assigning Function Address to Function Pointer
                            
                                Array slicing in C
                            
                                Easiest way to simulate the maximum CPU load?
                            
                                Possible to use a 9 Pin Serial port as "GPIO" using ioctl()?
                            
                                Can a modern C/C++ compiler optimize better with the code in header?
                            
                                What is a narrow prototype and why would I need one?
                            
                                C Pass arguments as void-pointer-list to imported function from LoadLibrary()
                            
                                Execute sudo command in C with system()
                            
                                Is it possible to specify a #include file path relative to the user's current directory when compiling?
                            
                                Reading a matrix from a text file in c
                            
                                When a process forks, would the shared library .so still in the address space? And would the constructor be executed again?
                            
                                Compile a C program in Linux using shared library [duplicate]
                            
                                'No Shared Cipher' Error with EDH-RSA-DES-CBC3-SHA
                            
                                How do I make syscalls from my C program

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

trapping signals in a multithreaded environment

Tags:

c

multithreading

signals

setjmp

camelccc

People also ask

1 Answers

lyngvi

Recent Activity

Donate For Us