Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS Beanstalk: Exponential backoff for SQS?

We are using the worker tier on Beanstalk to send out webhooks. We need to use exponential backoff in case any error when contacting the third party. However, it is unclear to me how this would work.

If the job fails and I invoke a ChangeMessageVisibility to some increasing time backoff time I have two choices:

  1. Return a success 200. Then SQS will remove it from the queue - not good.
  2. Return an error code. Then SQS will override the message visibility to the default value?

From Environment Tiers - AWS Beanstalk:

A web application in a worker environment tier should only listen on the local host. When the web application in the worker environment tier returns a 200 OK response to acknowledge that it has received and successfully processed the request, the daemon sends a DeleteMessage call to the SQS queue so that the message will be deleted from the queue. (SQS automatically deletes messages that have been in a queue for longer than the configured RetentionPeriod.) If the application returns any response other than 200 OK, then Elastic Beanstalk waits to put the message back in the queue after the configured VisibilityTimeout period. If there is no response, then Elastic Beanstalk waits to put the message back in the queue after the InactivityTimeout period so that the message is available for another attempt at processing.

like image 371
Elliot Chance Avatar asked Jul 06 '15 01:07

Elliot Chance


2 Answers

ChangeMessageVisibility has a limit of 12 hours and only applies to inflight jobs (jobs that while they are running you want to notify SQS "I need more time to complete this").

The only solution is to create a new job in the queue with the same details and an additional counter for retries (in the message or as an attribute) and use the DelaySeconds with an exponential backoff based on retries + 1.

Unfortunately DelaySeconds has a limit of 15 minutes (900 seconds) so for you to schedule a job longer than that you have a few options:

  1. Keep rescheduling the job every 15 minutes but don't cary out the task until the retries get high enough. This would run 95 jobs that do nothing until the 96th. This could generate a colossal amount of dummy jobs.
  2. Put the job somewhere else (like a database or cache) an use a cron or some other scheduled process to put it back in the queue once a minimum timestamp is reached. The timestamp would be now + 1 day for example.
like image 92
Elliot Chance Avatar answered Oct 15 '22 09:10

Elliot Chance


There are pros and cons to increasing the ChangeMessageVisibility of a failing job:

Pros:

  • you wont loose the job in the process of removing it & requeuing it.

Cons:

  • the 12h limit for a job to be inflight.
  • you can only have 20k inflights job at a time

So one idea to mitigate the cons would be to setup a redrive policy to a dlq if the job fails too many times.

like image 32
metakungfu Avatar answered Oct 15 '22 08:10

metakungfu