Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Celery for Realtime, Synchronous External API Querying with Gevent

I'm working on a web application that will receive a request from a user and have to hit a number of external APIs to compose the answer to that request. This could be done directly from the main web thread using something like gevent to fan out the request.

Alternatively, I was thinking, I could put incoming requests into a queue and use workers to distribute the load. The idea would be to try to keep it real time, while splitting up the requests amongst several workers. Each of these workers would be querying only one of the many external APIs. The response they receive would then go through a series transformations, be saved into a DB, be transformed to a common schema and saved in a common DB to finally be composed into one big response that would be returned through the web request. The web request is most likely going to be blocking all this time, with a user waiting, so keeping the queueing and dequeueing as fast as possible is important.

The external API calls can easily be turned into individual tasks. I think the linking from one api task to a transformation to a DB saving task could be done using a chain, etc, and the final result combining all results returned to the web thread using a chord.

Some questions:

  • Can this (and should this) be done using celery?
  • I'm using django. Should I try to use django-celery over plain celery?
  • Each one of those tasks might spawn off other tasks - such as logging what just happened or other types of branching off. Is this possible?
  • Could tasks be returning the data they get - i.e. potentially Kb of data through celery (redis as underlying in this case) or should they write to the DB, and just pass pointers to that data around?
  • Each task is mostly I/O bound, and was initially just going to use gevent from the web thread to fan out the requests and skip the whole queuing design, but it turns out that it would be reused for a different component. Trying to keep the whole round trip through the Qs real time will probably require many workers making sure the queueus are mostly empty. Or is it? Would running the gevent worker pool help with this?
  • Do I have to write gevent specific tasks or will using the gevent pool deal with network IO automagically?
  • Is it possible to assign priority to certain tasks?
  • What about keeping them in order?
  • Should I skip celery and just use kombu?
  • It seems like celery is geared more towards "tasks" that can be deferred and are not time sensitive. Am I nuts for trying to keep this real time?
  • What other technologies should I look at?

Update: Trying to hash this out a bit more. I did some reading on Kombu and it seems to be able to do what I'm thinking of, although at a much lower level than celery. Here is a diagram of what I had in mind. It's a simplified version, i.e. skipping the DB saving steps done by worker_2.

What seems to be possible with raw queues as accessible with Kombu is the ability for a number of workers to subscribe to a broadcast message. The type and number does not need to be known by the publisher if using a queue. Can something similar be achieved using Celery? It seems like if you want to make a chord, you need to know at runtime what tasks are going to be involved in the chord, whereas in this scenario you can simply add listeners to the broadcast, and simply make sure they announce they are in the running to add responses to the final queue.

Update 2: I see there is the ability to broadcast Can you combine this with a chord? In general, can you combine celery with raw kombu? This is starting to sound like a question about smoothies.

like image 335
Andres Avatar asked May 02 '13 00:05

Andres


1 Answers

I will try to answer as many of the questions as possible.

Can this (and should this) be done using celery?

Yes you can

I'm using django. Should I try to use django-celery over plain celery?

Django has a good support for celery and would make the life much easier during development

Each one of those tasks might spawn off other tasks - such as logging what just happened or other types of branching off. Is this possible?

You can start subtasks from withing a task with ignore_result = true for only side effects

Could tasks be returning the data they get - i.e. potentially Kb of data through celery (redis as underlying in this case) or should they write to the DB, and just pass pointers to that data around?

I would suggest putting the results in db and then passing id around would make your broker and workers happy. Less data transfer/pickling etc.

Each task is mostly I/O bound, and was initially just going to use gevent from the web thread to fan out the requests and skip the whole queuing design, but it turns out that it would be reused for a different component. Trying to keep the whole round trip through the Qs real time will probably require many workers making sure the queueus are mostly empty. Or is it? Would running the gevent worker pool help with this?

Since the process is io bound then gevent will definitely help here. However, how much the concurrency should be for gevent pool'd worker, is something that I'm looking for answer too.

Do I have to write gevent specific tasks or will using the gevent pool deal with network IO automagically?

Gevent does the monkey patching automatically when you use it in pool. But the libraries that you use should play well with gevent. Otherwise, if your parsing some data with simplejson (which is written in c) then that would block other gevent greenlets.

Is it possible to assign priority to certain tasks?

You cannot assign specific priorities to certain tasks, but route them to different queue and then have those queues being listened to by varying number of workers. The more the workers for a particular queue, the higher would be the priority of that tasks on that queue.

What about keeping them in order?

Chain is one way to maintain order. Chord is a good way to summarize. Celery takes care of it, so you dont have to worry about it. Even when using gevent pool, it would at the end be possible to reason about the order of the tasks execution.

Should I skip celery and just use kombu?

You can, if your use case will not change to something more complex over time and also if you are willing to manage your processes through celeryd + supervisord by yourself. Also, if you don't care about the task monitoring that comes with tools such as celerymon, flower, etc.

It seems like celery is geared more towards "tasks" that can be deferred and are not time sensitive.

Celery supports scheduled tasks as well. If that is what you meant by that statement.

Am I nuts for trying to keep this real time?

I don't think so. As long as your consumers are fast enough, it will be as good as real time.

What other technologies should I look at?

Pertaining to celery, you should choose result store wisely. My suggestion would be to use cassandra. It is good for realtime data (both write and query wise). You can also use redis or mongodb. They come with their own set of problems as result store. But then a little tweaking in configuration can go a long way.

If you mean something completely different from celery, then you can look into asyncio (python3.5) and zeromq for achieving the same. I can't comment more on that though.

like image 124
Arshad Ansari Avatar answered Oct 04 '22 17:10

Arshad Ansari