Finding the cause of a BrokenProcessPool in python's concurrent.futures

Tags:

In a nutshell

I get a BrokenProcessPool exception when parallelizing my code with concurrent.futures. No further error is displayed. I want to find the cause of the error and ask for ideas of how to do that.

Full problem

I am using concurrent.futures to parallelize some code.

with ProcessPoolExecutor() as pool:
    mapObj = pool.map(myMethod, args)

I end up with (and only with) the following exception:

concurrent.futures.process.BrokenProcessPool: A child process terminated abruptly, the process pool is not usable anymore

Unfortunately, the program is complex and the error appears only after the program has run for 30 minutes. Therefore, I cannot provide a nice minimal example.

In order to find the cause of the issue, I wrapped the method that I run in parallel with a try-except-block:

def myMethod(*args):
    try:
        ...
    except Exception as e:
        print(e)

The problem remained the same and the except block was never entered. I conclude that the exception does not come from my code.

My next step was to write a custom ProcessPoolExecutor class that is a child of the original ProcessPoolExecutor and allows me to replace some methods with cusomized ones. I copied and pasted the original code of the method _process_worker and added some print statements.

def _process_worker(call_queue, result_queue):
    """Evaluates calls from call_queue and places the results in result_queue.
        ...
    """
    while True:
        call_item = call_queue.get(block=True)
        if call_item is None:
            # Wake up queue management thread
            result_queue.put(os.getpid())
            return
        try:
            r = call_item.fn(*call_item.args, **call_item.kwargs)
        except BaseException as e:
                print("??? Exception ???")                 # newly added
                print(e)                                   # newly added
            exc = _ExceptionWithTraceback(e, e.__traceback__)
            result_queue.put(_ResultItem(call_item.work_id, exception=exc))
        else:
            result_queue.put(_ResultItem(call_item.work_id,
                                         result=r))

Again, the except block is never entered. This was to be expected, because I already ensured that my code does not raise an exception (and if everything worked well, the exception should be passed to the main process).

Now I am lacking ideas how I could find the error. The exception is raised here:

def submit(self, fn, *args, **kwargs):
    with self._shutdown_lock:
        if self._broken:
            raise BrokenProcessPool('A child process terminated '
                'abruptly, the process pool is not usable anymore')
        if self._shutdown_thread:
            raise RuntimeError('cannot schedule new futures after shutdown')

        f = _base.Future()
        w = _WorkItem(f, fn, args, kwargs)

        self._pending_work_items[self._queue_count] = w
        self._work_ids.put(self._queue_count)
        self._queue_count += 1
        # Wake up queue management thread
        self._result_queue.put(None)

        self._start_queue_management_thread()
        return f

The process pool is set to be broken here:

def _queue_management_worker(executor_reference,
                             processes,
                             pending_work_items,
                             work_ids_queue,
                             call_queue,
                             result_queue):
    """Manages the communication between this process and the worker processes.
        ...
    """
    executor = None

    def shutting_down():
        return _shutdown or executor is None or executor._shutdown_thread

    def shutdown_worker():
        ...

    reader = result_queue._reader

    while True:
        _add_call_item_to_queue(pending_work_items,
                                work_ids_queue,
                                call_queue)

        sentinels = [p.sentinel for p in processes.values()]
        assert sentinels
        ready = wait([reader] + sentinels)
        if reader in ready:
            result_item = reader.recv()
        else:                               #THIS BLOCK IS ENTERED WHEN THE ERROR OCCURS
            # Mark the process pool broken so that submits fail right now.
            executor = executor_reference()
            if executor is not None:
                executor._broken = True
                executor._shutdown_thread = True
                executor = None
            # All futures in flight must be marked failed
            for work_id, work_item in pending_work_items.items():
                work_item.future.set_exception(
                    BrokenProcessPool(
                        "A process in the process pool was "
                        "terminated abruptly while the future was "
                        "running or pending."
                    ))
                # Delete references to object. See issue16284
                del work_item
            pending_work_items.clear()
            # Terminate remaining workers forcibly: the queues or their
            # locks may be in a dirty state and block forever.
            for p in processes.values():
                p.terminate()
            shutdown_worker()
            return
        ...

It is (or seems to be) a fact that a process terminates, but I have no clue why. Are my thoughts correct so far? What are possible causes that make a process terminate without a message? (Is this even possible?) Where could I apply further diagnostics? Which questions should I ask myself in order to come closer to a solution?

I am using python 3.5 on 64bit Linux.

681

asked Jan 03 '17 23:01

Samufi

2 Answers

I think I was able to get as far as possible:

I changed the _queue_management_worker method in my changed ProcessPoolExecutor module such that the exit code of the failed process is printed:

def _queue_management_worker(executor_reference,
                             processes,
                             pending_work_items,
                             work_ids_queue,
                             call_queue,
                             result_queue):
    """Manages the communication between this process and the worker processes.
        ...
    """
    executor = None

    def shutting_down():
        return _shutdown or executor is None or executor._shutdown_thread

    def shutdown_worker():
        ...

    reader = result_queue._reader

    while True:
        _add_call_item_to_queue(pending_work_items,
                                work_ids_queue,
                                call_queue)

        sentinels = [p.sentinel for p in processes.values()]
        assert sentinels
        ready = wait([reader] + sentinels)
        if reader in ready:
            result_item = reader.recv()
        else:                               

            # BLOCK INSERTED FOR DIAGNOSIS ONLY ---------
            vals = list(processes.values())
            for s in ready:
                j = sentinels.index(s)
                print("is_alive()", vals[j].is_alive())
                print("exitcode", vals[j].exitcode)
            # -------------------------------------------


            # Mark the process pool broken so that submits fail right now.
            executor = executor_reference()
            if executor is not None:
                executor._broken = True
                executor._shutdown_thread = True
                executor = None
            # All futures in flight must be marked failed
            for work_id, work_item in pending_work_items.items():
                work_item.future.set_exception(
                    BrokenProcessPool(
                        "A process in the process pool was "
                        "terminated abruptly while the future was "
                        "running or pending."
                    ))
                # Delete references to object. See issue16284
                del work_item
            pending_work_items.clear()
            # Terminate remaining workers forcibly: the queues or their
            # locks may be in a dirty state and block forever.
            for p in processes.values():
                p.terminate()
            shutdown_worker()
            return
        ...

Afterwards I looked up the meaning of the exit code:

from multiprocessing.process import _exitcode_to_name
print(_exitcode_to_name[my_exit_code])

whereby my_exit_code is the exit code that was printed in the block I inserted to the _queue_management_worker. In my case the code was -11, which means that I ran into a segmentation fault. Finding the reason for this issue will be a huge task but goes beyond the scope of this question.

116

answered Oct 11 '22 18:10

Samufi

If you are using macOS, there is a known issue with how some versions of macOS uses forking that's not considered fork-safe by Python in some scenarios. The workaround that worked for me is to use no_proxy environment variable.

Edit ~/.bash_profile and include the following (it might be better to specify list of domains or subnets here, instead of *)

no_proxy='*'

Refresh the current context

source ~/.bash_profile

My local versions the issue was seen and worked around are: Python 3.6.0 on macOS 10.14.1 and 10.13.x

Sources: Issue 30388 Issue 27126

answered Oct 11 '22 17:10

gowthamnvv

Related questions
                            
                                Cython package with __init__.pyx: Possible?
                            
                                Number of max_workers when using ThreadPoolExecutor from concurrent.futures?
                            
                                Any pointers on using Ropevim? Is it a usable library?
                            
                                Selecting the most fluent text from a set of possibilities via grammar checking (Python)
                            
                                How to structure celery tasks
                            
                                Python HTTP Server/Client: Remote end closed connection without response error
                            
                                Is there a way to embed dependencies within a python script?
                            
                                pipenv specify minimum version of python in pipfile?
                            
                                Why are unittest2 methods camelCase if names_with_underscores are preferred?
                            
                                How do I share Protocol Buffer .proto files between multiple repositories
                            
                                Divide the list into three lists such that their sum are close to each other
                            
                                How to use virtualenv with Google App Engine SDK on Mac OS X 10.6
                            
                                Printing numpy.float64 with full precision
                            
                                Django memory usage going up with every request
                            
                                Force importing module from current directory
                            
                                How can I install packages hosted in a private PyPI using setup.py?
                            
                                Is there any disadvantage in using PYTHONDONTWRITEBYTECODE in Docker?
                            
                                Which openid / oauth library to connect a django project to Google Apps Accounts?
                            
                                Python - How are signals different from pubsub?
                            
                                pass callback from python to c++ using boost::python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Finding the cause of a BrokenProcessPool in python's concurrent.futures

Tags:

python

debugging

concurrent.futures

Samufi

People also ask

2 Answers

Samufi

gowthamnvv

Recent Activity

Donate For Us