Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Celery: Remote workers frequently losing connection

I have a Celery broker running on a cloud server (Django app), and two workers on local servers in my office connected behind a NAT. The local workers frequently lose connection, and have to be restarted to re-establish connection with the broker. Usually celeryd restart hangs the first time I try it, so I have to ctr+C and retry once or twice to get it back up and connected. The workers' logs two most common errors:

[2014-08-03 00:08:45,398: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/celery/worker/consumer.py", line 278, in start
    blueprint.start(self)
  File "/usr/local/lib/python2.7/dist-packages/celery/bootsteps.py", line 123, in start
    step.start(parent)
  File "/usr/local/lib/python2.7/dist-packages/celery/worker/consumer.py", line 796, in start
    c.loop(*c.loop_args())
  File "/usr/local/lib/python2.7/dist-packages/celery/worker/loops.py", line 72, in asynloop
    next(loop)
  File "/usr/local/lib/python2.7/dist-packages/kombu/async/hub.py", line 320, in create_loop
    cb(*cbargs)
  File "/usr/local/lib/python2.7/dist-packages/kombu/transport/base.py", line 159, in on_readable
    reader(loop)
  File "/usr/local/lib/python2.7/dist-packages/kombu/transport/base.py", line 142, in _read
    raise ConnectionError('Socket was disconnected')
ConnectionError: Socket was disconnected

[2014-03-07 20:15:41,963: CRITICAL/MainProcess] Couldn't ack 11, reason:RecoverableConnectionError(None, 'connection already closed', None, '')
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/kombu/message.py", line 93, in ack_log_error
    self.ack()
  File "/usr/local/lib/python2.7/dist-packages/kombu/message.py", line 88, in ack
    self.channel.basic_ack(self.delivery_tag)
  File "/usr/local/lib/python2.7/dist-packages/amqp/channel.py", line 1583, in basic_ack
    self._send_method((60, 80), args)
  File "/usr/local/lib/python2.7/dist-packages/amqp/abstract_channel.py", line 50, in _send_method
    raise RecoverableConnectionError('connection already closed')

How do I go about debugging this? Is the fact that the workers are behind a NAT an issue? Is there a good tool to monitor whether the workers have lost connection? At least with that, I could get them back online by manually restarting the worker.

like image 402
Neil Avatar asked Aug 04 '14 14:08

Neil


1 Answers

Unfortunately yes, there is a problem with late acks in Celery+Kombu - task handler tries to use closed connection. I worked around it like this:

CELERY_CONFIG = {
    'CELERYD_MAX_TASKS_PER_CHILD': 1,
    'CELERYD_PREFETCH_MULTIPLIER': 1,
    'CELERY_ACKS_LATE': True,
}

CELERYD_MAX_TASKS_PER_CHILD - guarantees that worker will be restarted after finishing the task.

As for the tasks that already lost connection, there is nothing you can do right now. Maybe it'll be fixed in version 4. I just make sure that the tasks are as idempotent as possible.

like image 99
Pavel Kirichenko Avatar answered Oct 03 '22 03:10

Pavel Kirichenko