Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to monitor queue health in celery

I have the following set-up:

  • Generic worker pool with 100 workers
  • High priority worker pool with 50 workers
  • I used such large numbers because most of the time my tasks spend waiting for I/O with very long timeouts (doing HTTP requests that can take up to 20s to respond)
  • Using RabbitMQ as the broker
  • I have set up celeryd as a deamon using the init.d scripts from celery'd github, with the following parameters: CELERYD_OPTS="--time-limit=600 -c:low_p 100 -c:high_p 50 -Q:low_p low_priority_queue_name -Q:high_p high_priority_queue_name"

My problem is, sometimes the queue seems to "back up"... that is it will stop consuming tasks. It seems there are to scenarios for this:

  • There is a slow build-up of "unacknowledged" messages in the broker, even though celery inspect active will show that not all workers are used up - that is, I will only see a few active tasks
  • The queue will just stop consuming new tasks, without the buildup.
  • When in its "dead" state, using strace on the worker processes returns nothing... completely zero activity from the worker

I would appreciate any information or pointers on:

  • How I can debug it. I can use strace to see what the worker processes are doing, but so far that has been useful in telling me that the worker is hanging
  • How I can monitor this, and possible do auto-recovery. There are many tools for managing celery (flower and events but they are both excellent in real-time - but don't have any automated monitoring/alarming functionality). Am I just better off writing my own monitoring tools with supervisord?

Also, I am starting my tasks from django-celery

like image 582
Goro Avatar asked Jul 08 '13 16:07

Goro


1 Answers

A very basic queue watchdog can be implemented with just a single script that’s run every minute by cron. First, it fires off a task that, when executed (in a worker), touches a predefined file, for example:

with open('/var/run/celery-heartbeat', 'w'):
    pass

Then the script checks the modification timestamp on that file and, if it’s more than a minute (or 2 minutes, or whatever) away, sends an alarm and/or restarts the workers and/or the broker.

It gets a bit trickier if you have multiple machines, but the same idea applies.

like image 189
Vasiliy Faronov Avatar answered Sep 28 '22 04:09

Vasiliy Faronov