Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Retrying failed jobs in RQ

We are using RQ with our WSGI application. What we do is have several different processes in different back-end servers that run the tasks, connecting to (possibly) several different task servers. To better configure this setup, we are using a custom management layer in our system which takes care of running workers, setting up the task queues, etc.

When a job fails, we would like to implement a retry, which retries a job several times after an increasing delay, and eventually either complete it or have it fail and log an error entry in our logging system. However, I am not sure how this should be implemented. I have already created a custom worker script which allows us to log error to our database, and my first attempt at retry was something along the lines of this:

# This handler would ideally wait some time, then requeue the job.
def worker_retry_handler(job, exc_type, exc_value, tb):
    print 'Doing retry handler.'
    current_retry = job.meta[attr.retry] or 2

    if current_retry >= 129600:
        log_error_message('Job catastrophic failure.', ...)
    else:
        current_retry *= 2

        log_retry_notification(current_retry)
        job.meta[attr.retry] = current_retry
        job.save()
        time.sleep(current_retry)

        job.perform()

return False

As I mentioned, we also have a function in the worker file which correctly resolves the server to which it should connect, and can post jobs. The problem is not necessarily how to publish a job, but what to do with the job instance that you get in the exception handler.

Any help would be greatly appreciated. If there are suggestions or pointers on better ways to do this would also be great. Thanks!

like image 686
Juan Carlos Coto Avatar asked Jan 17 '13 22:01

Juan Carlos Coto


1 Answers

I see two possible issues:

  1. You should have a return value. False prevents the default exception handling from happening to the job (see the last section on this page: http://python-rq.org/docs/exceptions/)

  2. I think by the time your handler gets called the job is no longer queued. I'm not 100% positive (especially given the docs that I pointed to above), but if it's on the failed queue, you can call requeue_job(job.id) to retry it. If it's not (which it sounds like it won't be), you could probably grab the proper queue and enqueue to it directly.

like image 73
Borys Avatar answered Oct 21 '22 07:10

Borys