Recover from segfault in Python

Tags:

I have a few functions in my code that are randomly causing SegmentationFault error. I've identified them by enabling the faulthandler. I'm a bit stuck and have no idea how to reliably eliminate this problem.

I'm thinking about some workaround. Since the functions are crashing randomly, I could potentially retry them after a failure. The problem is that there's no way to recover from SegmentationFault crash.
The best idea I have for now is to rewrite these functions a bit and run them via subprocess. This solution will help me, that a crashed function won't crash the whole application, and can be retried.

Some of the functions are quite small and often executed, so it will significantly slow down my app. Is there any method to execute function in a separate context, faster than a subprocess that won't crash whole program in case of segfault?

705

asked Nov 02 '20 11:11

Djent

2 Answers

tl;dr: You can write C code using signal, setjmp, longjmp.

You have multiple choices to handle SIGSEGV:

spawing sub process using subprocess library
forking using multiprocessing library
writing custom signal handler

Subprocess and fork have already been describe, so I will focus on signal handler point of view.

Writing signal handler

From a kernel perspective, there is no difference between SIGSEGV and any other signals like SIGUSR1, SIGQUIT, SIGINT, etc. In fact, some libraries (like JVM) use them as way of communication.

Unfortunately you can't override signal handler from python code. See doc:

It makes little sense to catch synchronous errors like SIGFPE or SIGSEGV that are caused by an invalid operation in C code. Python will return from the signal handler to the C code, which is likely to raise the same signal again, causing Python to apparently hang. From Python 3.3 onwards, you can use the faulthandler module to report on synchronous errors.

This mean, error management should be done in C code.

You can write custom signal handler and use setjmp and longjmp to save and restore stack context.

For example, here is a simple CPython C extension:

#include <signal.h>
#include <setjmp.h>

#define PY_SSIZE_T_CLEAN
#include <Python.h>

static jmp_buf jmpctx;

void handle_segv(int signo)
{
    longjmp(jmpctx, 1);
}

static PyObject *
install_sig_handler(PyObject *self, PyObject *args)
{
    signal(SIGSEGV, handle_segv);
    Py_RETURN_TRUE;
}

static PyObject *
trigger_segfault(PyObject *self, PyObject *args)
{
    if (!setjmp(jmpctx))
    {
        // Assign a value to NULL pointer will trigger a seg fault
        int *x = NULL;
        *x = 42;

        Py_RETURN_TRUE; // Will never be called
    }

    Py_RETURN_FALSE;
}

static PyMethodDef SpamMethods[] = {
    {"install_sig_handler", install_sig_handler, METH_VARARGS, "Install SIGSEGV handler"},
    {"trigger_segfault", trigger_segfault, METH_VARARGS, "Trigger a segfault"},
    {NULL, NULL, 0, NULL},
};

static struct PyModuleDef spammodule = {
    PyModuleDef_HEAD_INIT,
    "crash",
    "Crash and recover",
    -1,
    SpamMethods,
};

PyMODINIT_FUNC
PyInit_crash(void)
{
    return PyModule_Create(&spammodule);
}

And the caller app:

import crash

print("Install custom sighandler")
crash.install_sig_handler()

print("bad_func: before")
retval = crash.trigger_segfault()
print("bad_func: after (retval:", retval, ")")

This will produces following output:

Install custom sighandler
bad_func: before
bad_func: after (retval: False )

Pros and cons

Pros:

From an OS perspective, the app just catch SIGSEGV as a regular signal. Error handling will be fast.
It does not need forking (not always possible if your app hold various kind of file descriptor, socket, ...)
It does not need spawning sub processes (not always possible and much slower method).

Cons:

Might cause memory leak.
Might hide undefined / dangerous behavior

Keep in mind that segmentation fault is a really serious error ! Always try to first fix it instead of hiding it.

Few links and references:

https://docs.python.org/3/library/signal.html#execution-of-python-signal-handlers
How to write a signal handler to catch SIGSEGV?
https://docs.python.org/3/extending/extending.html#extending-python-with-c-or-c
https://www.cplusplus.com/reference/csetjmp/setjmp/
https://www.cplusplus.com/reference/csetjmp/longjmp/

answered Oct 01 '22 02:10

arthurlm

I had some unreliable C extensions throw segfaults every once in a while and, since there was no way I was going to be able to fix that, what I did was create a decorator that would run the wrapped function in a separate process. That way you can stop segfaults from killing the main process.

Something like this: https://gist.github.com/joezuntz/e7e7764e5b591ed519cfd488e20311f1

Mine was a bit simpler, and it did the job for me. Additionally it lets you choose a timeout and a default return value in case there was a problem:

#! /usr/bin/env python3

# std imports
import multiprocessing as mp


def parametrized(dec):
    """This decorator can be used to create other decorators that accept arguments"""

    def layer(*args, **kwargs):
        def repl(f):
            return dec(f, *args, **kwargs)

        return repl

    return layer


@parametrized
def sigsev_guard(fcn, default_value=None, timeout=None):
    """Used as a decorator with arguments.
    The decorated function will be called with its input arguments in another process.

    If the execution lasts longer than *timeout* seconds, it will be considered failed.

    If the execution fails, *default_value* will be returned.
    """

    def _fcn_wrapper(*args, **kwargs):
        q = mp.Queue()
        p = mp.Process(target=lambda q: q.put(fcn(*args, **kwargs)), args=(q,))
        p.start()
        p.join(timeout=timeout)
        exit_code = p.exitcode

        if exit_code == 0:
            return q.get()

        logging.warning('Process did not exit correctly. Exit code: {}'.format(exit_code))
        return default_value

    return _fcn_wrapper

So you would use it like:


@sigsev_guard(default_value=-1, timeout=60)
def your_risky_function(a,b,c,d):
    ...

158

answered Oct 01 '22 02:10

imochoa

Related questions
                            
                                Best practice for conditionally getting values from Python dictionary [closed]
                            
                                Select first row when there are multiple rows with repeated values in a column [duplicate]
                            
                                sort pandas dataframe by sum of columns
                            
                                AttributeError: module 'tensorflow_core._api.v2.image' has no attribute 'resize_images'
                            
                                How do we kill the process spawned by subprocess.call() function in python?
                            
                                Export Plotly Dash datatable output to a CSV by clicking download link
                            
                                How can I add grid lines to a catplot in seaborn?
                            
                                The 'google-api-python-client' distribution was not found and is required by the application with pyinstaller
                            
                                Can you plot interquartile range as the error band on a seaborn lineplot?
                            
                                List comprehension for multiplying each string in a list by numbers from given range
                            
                                How do I chain the movement of a snake's body?
                            
                                Django shell_plus: How to access Jupyter notebook in Docker Container
                            
                                matplotlib: values for the (xx-small, x-small, small, medium, large, x-large, xx-large, larger, smaller) special sizes
                            
                                Is there a __dunder__ method corresponding to |= (pipe equal/update) for dicts in python 3.9?
                            
                                ModuleNotFoundError: No module named 'tf_slim'
                            
                                Plotting a pie chart out of a dictionary
                            
                                Pandas - check if dataframe has negative value in any column
                            
                                How can I reduce the number of conditions in a statement? [duplicate]
                            
                                no such option: --use-feature while installing tensorflow object detection api
                            
                                OpenCV Probabilistic Hough Line Transform giving different results with C++ and Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Recover from segfault in Python

Tags:

python

python-3.x

python-3.8