How to monitor queue health in celery

Question

I have the following set-up:

Generic worker pool with 100 workers
High priority worker pool with 50 workers
I used such large numbers because most of the time my tasks spend waiting for I/O with very long timeouts (doing HTTP requests that can take up to 20s to respond)
Using RabbitMQ as the broker
I have set up celeryd as a deamon using the init.d scripts from celery'd github, with the following parameters: CELERYD_OPTS="--time-limit=600 -c:low_p 100 -c:high_p 50 -Q:low_p low_priority_queue_name -Q:high_p high_priority_queue_name"

My problem is, sometimes the queue seems to "back up"... that is it will stop consuming tasks. It seems there are to scenarios for this:

There is a slow build-up of "unacknowledged" messages in the broker, even though celery inspect active will show that not all workers are used up - that is, I will only see a few active tasks
The queue will just stop consuming new tasks, without the buildup.
When in its "dead" state, using strace on the worker processes returns nothing... completely zero activity from the worker

I would appreciate any information or pointers on:

How I can debug it. I can use strace to see what the worker processes are doing, but so far that has been useful in telling me that the worker is hanging
How I can monitor this, and possible do auto-recovery. There are many tools for managing celery (flower and events but they are both excellent in real-time - but don't have any automated monitoring/alarming functionality). Am I just better off writing my own monitoring tools with supervisord?

Also, I am starting my tasks from django-celery

Vasiliy Faronov · Accepted Answer

A very basic queue watchdog can be implemented with just a single script that’s run every minute by cron. First, it fires off a task that, when executed (in a worker), touches a predefined file, for example:

with open('/var/run/celery-heartbeat', 'w'):
    pass

Then the script checks the modification timestamp on that file and, if it’s more than a minute (or 2 minutes, or whatever) away, sends an alarm and/or restarts the workers and/or the broker.

It gets a bit trickier if you have multiple machines, but the same idea applies.

How to monitor queue health in celery

Tags:

python

rabbitmq

celery

django-celery

Goro

1 Answers

Vasiliy Faronov

Recent Activity

Donate For Us

How to monitor queue health in celery

Tags:

python

rabbitmq

celery

django-celery

Goro

1 Answers

Vasiliy Faronov

Related questions

Recent Activity

Donate For Us