Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Distributed Celery scheduler

Tags:

python

celery

I'm looking for a distributed cron-like framework for Python, and found Celery. However, the docs says "You have to ensure only a single scheduler is running for a schedule at a time, otherwise you would end up with duplicate tasks", Celery is using celery.beat.PersistentScheduler which store the schedule to a local file.

So, my question, is there another implementation than the default that can put the schedule "into the cluster" and coordinate task execution so that each task is only run once? My goal is to be able to run celerybeat with identical schedules on all hosts in the cluster.

Thanks

like image 914
Jonas Bergström Avatar asked Aug 10 '11 13:08

Jonas Bergström


People also ask

Is celery distributed?

Celery is a distributed task queue written in Python, which works using distributed messages. Each execution unit in celery is called a task. A task can be executed concurrently on one or more servers using processes called workers.

How does celery distribute work?

Celery is a simple, flexible, and reliable distributed system to process vast amounts of messages, while providing operations with the tools required to maintain such a system. It's a task queue with focus on real-time processing, while also supporting task scheduling.

How do you schedule celery tasks?

Some Celery Terminology: A task is just a Python function. You can think of scheduling a task as a time-delayed call to the function. For example, you might ask Celery to call your function task1 with arguments (1, 3, 3) after five minutes. Or you could have your function batchjob called every night at midnight.


1 Answers

tl;dr: No Celerybeat is not suitable for your use case. You have to run just one process of celerybeat, otherwise your tasks will be duplicated.

I know this is a very old question. I will try to make a small summary because I have the same problem/question (in the year 2018).

Some background: We're running Django application (with Celery) in the Kubernetes cluster. Cluster (EC2 instances) and Pods (~containers) are autoscaled: simply said, I do not know when and how many instances of the application are running.

It's your responsibility to run only one process of the celerybeat, otherwise, your tasks will be duplicated. [1] There was this feature request in the Celery repository: [2]

Requiring the user to ensure that only one instance of celerybeat exists across their cluster creates a substantial implementation burden (either creating a single point-of-failure or encouraging users to roll their own distributed mutex).

celerybeat should either provide a mechanism to prevent inadvertent concurrency, or the documentation should suggest a best-practice approach.

After some time, this feature request was rejected by the author of Celery for lack of resources. [3] I highly recommend reading the entire thread on the Github. People there recommend these project/solutions:

  • https://github.com/ybrs/single-beat
  • https://github.com/sibson/redbeat
  • Use locking mechanism (http://docs.celeryproject.org/en/latest/tutorials/task-cookbook.html#ensuring-a-task-is-only-executed-one-at-a-time)

I did not try anything from the above (I do not want another dependency in my app and I do not like locking tasks /you need to deal with fail-over etc./).

I ended up using CronJob in Kubernetes (https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/).

[1] celerybeat - multiple instances & monitoring

[2] https://github.com/celery/celery/issues/251

[3] https://github.com/celery/celery/issues/251#issuecomment-228214951

like image 163
illagrenan Avatar answered Oct 19 '22 22:10

illagrenan