When we restart or deploy we get a number of Resque jobs in the failed queue with either Resque::TermException (SIGTERM)
or Resque::DirtyExit
.
We're using the new TERM_CHILD=1 RESQUE_TERM_TIMEOUT=10
in our Procfile so our worker line looks like:
worker: TERM_CHILD=1 RESQUE_TERM_TIMEOUT=10 bundle exec rake environment resque:work QUEUE=critical,high,low
We're also using resque-retry
which I thought might auto-retry on these two exceptions? But it seems to not be.
So I guess two questions:
Resque::TermException
in each job, and use this to reschedule the job. But is there a clean way to do this for all jobs? Even a monkey patch.Thanks!
Edit: Getting all jobs to complete in less than 10 seconds seems unreasonable at scale. It seems like there needs to be a way to automatically re-queue these jobs when the Resque::DirtyExit exception is run.
I ran into this issue as well. It turns out that Heroku sends the SIGTERM
signal to not just the parent process but all forked processes. This is not the logic that Resque expects which causes the RESQUE_PRE_SHUTDOWN_TIMEOUT
to be skipped, forcing jobs to executed without any time to attempt to finish a job.
Heroku gives workers 30s to gracefully shutdown after a SIGTERM
is issued. In most cases, this is plenty of time to finish a job with some buffer time left over to requeue the job to Resque if the job couldn't finish. However, for all of this time to be used you need to set the RESQUE_PRE_SHUTDOWN_TIMEOUT
and RESQUE_TERM_TIMEOUT
env vars as well as patch Resque to correctly respond to SIGTERM
being sent to forked processes.
Here's a gem which patches resque and explains this issue in more detail:
https://github.com/iloveitaly/resque-heroku-signals
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With