Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

celery missed heartbeat (on_node_lost)

I just upgraded to celery 3.1 and now I see this i my logs ::

on_node_lost - INFO - missed heartbeat from celery@queue_name for every queue/worker in my cluster.

According to the docs BROKER_HEARTBEAT is off by default and I haven't configured it.

Should I explicitly set BROKER_HEARTBEAT=0 or is there something else that I should be checking?

like image 911
Douglas Ferguson Avatar asked Jan 15 '14 08:01

Douglas Ferguson


3 Answers

Saw the same thing, and noticed a couple of things in the log files.

1) There were messages about time drift at the start of the log and occasional missed heartbeats.

2) At the end of the log file, the drift messages went away and only the missed heartbeat messages were present.

3) There were no changes to the system when the drift messages went away... They just stopped showing up.

I figured that the drift itself was likely the problem itself.

After syncing the time on all the servers involved these messages went away. For ubuntu, run ntpdate as a cron or ntpd.

like image 137
user3691996 Avatar answered Nov 07 '22 19:11

user3691996


Celery 3.1 added in the new mingle and gossip procedures. I too was getting a ton of missed heartbeats and passing --without-gossip to my workers cleared it up.

https://docs.celeryproject.org/en/3.1/whatsnew-3.1.html#mingle-worker-synchronization

Mingle: Worker synchronization

The worker will now attempt to synchronize with other workers in the same cluster.

Synchronized data currently includes revoked tasks and logical clock.

This only happens at startup and causes a one second startup delay to collect broadcast responses from other workers.

You can disable this bootstep using the --without-mingle argument.

https://docs.celeryproject.org/en/3.1/whatsnew-3.1.html#gossip-worker-worker-communication

Gossip: Worker <-> Worker communication

Workers are now passively subscribing to worker related events like heartbeats.

This means that a worker knows what other workers are doing and can detect if they go offline. Currently this is only used for clock synchronization, but there are many possibilities for future additions and you can write extensions that take advantage of this already.

Some ideas include consensus protocols, reroute task to best worker (based on resource usage or data locality) or restarting workers when they crash.

We believe that although this is a small addition, it opens amazing possibilities.

You can disable this bootstep using the --without-gossip argument.

like image 26
user3204501 Avatar answered Nov 07 '22 19:11

user3204501


I'm having a similar issue. I have found the reason in my case.

I have two server to run worker.

when I use "ping" to another server, I found when the ping time larger than 2 second, the log will show " missed heartbeat from celery@ ". The default heartbeat interval is 2 second.

The reason is my poor network. http://docs.celeryproject.org/en/latest/internals/reference/celery.worker.heartbeat.html

like image 1
mutex86 Avatar answered Nov 07 '22 20:11

mutex86