Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python multiprocessing within Flask request with Gunicorn + Nginx

I want to build a service that will be able to handle:

  • a low volume of requests
  • a high compute cost for each request
  • but where the high compute cost can be parallelized.

My understanding of a pre-fork server is that something like the following happens:

  1. server starts
  2. Gunicorn creates multiple OS processes, also called workers, ready to accept requests
  3. request comes in. Nginx forwards to Gunicorn. Gunicorn sends to one of the workers.

What I want to understand is what happens if, in my Flask code, when handling the request, I have this:

from multiprocessing import pool as ProcessPool
with ProcessPool(4) as pool:
    pool.map(some_expensive_function, some_data)

In particular:

  1. Will additional OS processes be started? Will the speedup be what I expect? (I.e., similar to if I ran the ProcessPool outside of a Flask production context?) If Gunicorn created 4 web workers, will there now be 7 OS processes running? 9? Is there a risk of making too many? Does Gunicorn assume that each worker will not fork or does it not care?
  2. If a web-worker dies or is killed after starting the ProcessPool, will it be closed by the context manager properly?
  3. Is this a sane thing to do? What are the alternatives?
like image 365
Neil Avatar asked Oct 24 '20 14:10

Neil


1 Answers

Great question! With Python multiprocessing, there are 3 "start methods" that can be used, and they all have implications for your questions. As the docs explain, they are:

  • 'spawn': The parent process starts a fresh python interpreter process. The child process will only inherit those resources necessary to run the process object’s run() method. In particular, unnecessary file descriptors and handles from the parent process will not be inherited. Starting a process using this method is rather slow compared to using fork or forkserver. Available on Unix and Windows. The default on Windows and macOS.
  • 'fork': The parent process uses os.fork() to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process. All resources of the parent are inherited by the child process. Note that safely forking a multithreaded process is problematic. Available on Unix only. The default on Unix.
  • 'forkserver' When the program starts and selects the forkserver start method, a server process is started. From then on, whenever a new process is needed, the parent process connects to the server and requests that it fork a new process. The fork server process is single threaded so it is safe for it to use os.fork(). No unnecessary resources are inherited. Available on Unix platforms which support passing file descriptors over Unix pipes.

As for Gunicorn's pre-fork model, you've explained it well. Each of the workers is running in its own process. Since you're trying to use multiprocessing within a worker, rather than alongside Gunicorn, this should be doable, but will still be a bit error-prone.

import multiprocessing

mp = multiprocessing.get_context('spawn')

This code gives us the mp object, which has the same API as the multiprocessing module, but with a set start method. In the case of the code above, it is set to 'spawn'. This is the safest route for using multiprocessing within a Gunicorn worker, as it is the most isolated from the process that created it, and less likely to run into problems around accidentally shared resources.

with mp.Pool(processes=4) as pool:
    pool.map(some_expensive_function, some_data)

We then use the mp object to create a process pool as you've done. This code must be inside a function/module that is only called/used within the worker processes. If it is used within the server process it could cause problems.

  1. Will additional OS processes be started? Will the speedup be what I expect? (I.e., similar to if I ran the ProcessPool outside of a Flask production context?)

Quite a lot of questions packed in here. Additional OS processes will be started. The speedup could vary massively, and will depend on a number of factors such as:

  • How many other processes are running? How many worker processors is Gunicorn running?
  • Is the server under heavy load?
  • How many cores does the processor have?
  • How parallelizable is the work? Does some_expensive_function(data_1) have to wait for some_expensive_function(data_2) before it can do its work?

To figure out if using multiprocessing is faster, and how much faster it will be, you'll have to test it. Best you can do prior to that is form a rough estimate based on factors like those listed above.

  1. (cont.) If Gunicorn created 4 web workers, will there now be 7 OS processes running? 9? Is there a risk of making too many? Does Gunicorn assume that each worker will not fork or does it not care?

If there are 4 Gunicorn worker processes, and each of them is fulfilling a request that uses multiprocessing with 4 processes, then there will be 1 Gunicorn parent process + 4 worker processes + 4 * 4 worker subprocesses = 21 processes, not to mention the processes being used by Nginx.

Gunicorn recommends you create (2 * num_cores) + 1 workers, but in your case you may want to decrease that, perhaps by dividing it by 4, to account for the fact that your worker processes themselves work best when using multiple cores. To find the most efficient configuration, you'll have to benchmark various configurations to find out what works best for you.

  1. If a web-worker dies or is killed after starting the ProcessPool, will it be closed by the context manager properly?

This depends on how the worker dies. If it is killed via SIGKILL, or encounters a segmentation fault, or some other critical error, then it will abruptly die without getting to run any finalization code. The context manager can only do its job in the cases where a try-finally block would be able to execute the 'finally' block. For more about that, check out this answer: Does 'finally' always execute in Python?

  1. Is this a sane thing to do? What are the alternatives?

It's not insane per se, but it's not the kind of approach I'd generally recommend. One alternative would be to have some_expensive_function implemented with its own server. Your Gunicorn workers could use IPC or network communication to send work to the some_expensive_function server process, and it would handle dividing this work among sub processes. One advantage of a design like that is that the some_expensive_function server process can easily be moved to run on another computer if performance demands it.

It's similar to how databases are generally run as their own server process, and can either be located on the same computer or on a separate computer (potentially behind a load balancer for read-only queries, or a sharding configuration) depending on what performance requirements must be met.

If you decide to go that route, you may find the Python package Celery useful for distributing the work from the Gunicorn workers.


If you want to do this, you should probably be running Gunicorn with preload_app=True.

like image 67
Will Da Silva Avatar answered Oct 26 '22 01:10

Will Da Silva