I've a machine learning application which uses flask to expose api(for production this is not a good idea, but even if I'll use django in future the idea of the question shouldn't change).
The main problem is how to serve multiple requests to my app. Few months back celery
has been added to get around this problem. The number of workers in celery
that was spawned is equal to the number of cores present in the machine. For very few users this was looking fine and was in production for some time.
When the number of concurrent users got increased, it was evident that we should do a performance testing on it. It turns out: it is able to handle 20 users for 30 GB and 8 core machine without authentication and without any front-end. Which is not looking like a good number.
I didn't know there are things like: application server, web server, model server. When googling for this problem: gunicorn
was a good application server python application.
gunicorn
or any other application server along with celery
and whycelery
and only use gunicorn
with the application can I achieve concurrency. I have read somewhere celery
is not good for machine learning applications.gunicorn
and celery
. How can we achieve the best out of both.Note: Main goal is to maximize concurrency. While serving in production authentication will be added. One front-end application might come into action in between in production.
There is no shame in flask. If in fact you just need a web API wrapper, flask
is probably a much better choice than django
(simply because django
is huge and you'd be using only a fraction of its capability).
However, your concurrency problems are apparently stemming from the fact that you are doing some heavy-duty processing for each request. There is simply no way around that; if you require a certain amount of computational resources per request, you can't magic those up. From here on, it's a juggling act.
If you want a guaranteed response immediately, you need to have as many workers as potential simultaneous requests. This may involve load balancing over multiple servers, if you can't scrounge up enough resources on one server. (cue gunicorn
, a web application server, responsible for accepting connections and then distributing them to multiple application processes.)
If you are okay with not getting an immediate response, you can let stuff queue up. (cue celery
, a task queue, which worker processes can use to retrieve the next thing to be done, and deposit results). This works best if you don't need a response in the same request-response cycle; e.g. you submit a job from client, and they only get an acknowledgement that the job has been received; you would need a second request to ask about the status of the job, and possibly the results of the job if it is finished.
Alternately, instead of Flask you could use websockets
or Tornado, to push out the response to the client when it is available (as opposed to user polling for results, or waiting on a live HTTP connection and taking up a server process).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With