Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to check Resque worker status to determine whether it's dead or stale

The default resque web interface says that I have 5 of 7 workers working. I don't understand how this could be happening.

I'm on heroku, so when my dyno restarts, it should spin down existing dynos and workers, then spin up new ones. So, I'm assuming some of these workers are stale, but resque thinks there are so many more workers working than there should be... (there should only be 1)

How can I check whether any of these are stale or dead? I expect to see only 1 worker working.

Eventually, I expect I'll do whatever this SO post says: How do I clear stuck/stale Resque workers?, but first I'd like to know how to determine whether a worker should be removed... I don't want to blindly unregister workers...

Apologies if this is an obvious question. I'm new to resque.

Thanks!

like image 812
user5243421 Avatar asked May 07 '15 21:05

user5243421


1 Answers

The only way to determine whether a worker is actually working is to check on the host machine of the worker. After a restart on Heroku, this machines no longer exists so if the worker didn't unregister itself Resque will believe it still to be working. The decentralized nature of Resque workers means that you can't easily check the actual status of the workers. When each workers is started it registers itself with redis. When that worker picks up a job and starts working it again registers it status with redis. When you iterate like so:

Resque.workers.each { |w| w.working? }

you are pulling a list of workers from redis and checking the last registered state of those workers form redis. It doesn't actually query the worker itself.

The hostnames in the resque-web display will match up with the names you see in heroku log output so that's one not very good way to see what's actually running. I was hoping one could automate by using the dyno IDs obtained form the platform API but they don't match the hostnames.

Make sure that you are gracefully handling Resque::TermException as specified in this document. You could also look into some of the heartbeat solutions others have come up with to work around this problem. I've had issues where even using TERM_CHILD and proper signal handling leaves stale workers floating around. My solution has been to wait until no jobs are being processed, unregister all workers, then restart with heroku ps:restart worker.

like image 175
Lukas Eklund Avatar answered Sep 20 '22 03:09

Lukas Eklund