Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Retry logic in AWS step function

I am testing the retry logic of the step function. Theoretically following step function should have been retried to execute the lambda 3 times if it fails.

{
  "StartAt": "Bazinga",
  "States": {
    "Bazinga": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "arn:aws:lambda:ap-southeast-2:518815385770:function:errorTest:$LATEST",
        "Payload": {
          "Input.$": "$"
        }
      },
      "Retry" : [
        {
          "ErrorEquals": [ "States.All", "States.Timeout" ],
          "IntervalSeconds": 1,
          "MaxAttempts": 3,
          "BackoffRate": 1.0
        }
      ],
       "Next": "Fail"
    },
    "Fail": {
      "Type": "Fail"
    }
  }
}

The lambda it calls is set to timeout in 3 seconds. The lambda freezes for 4 seconds. This means the lambda times out and throws States.Timeout error. The code is given below:

function sleep(ms){
    return new Promise(resolve=>{
        setTimeout(resolve,ms)
    })
}

exports.handler = async (event) => {
    console.log('------------> executing ....')
    await sleep(4000)
};

The problem is step function does not retry the task. This can be confirmed from the following CloudWatch logs.


05:59:36
START RequestId: dd1a2ee9-f389-44be-aaa6-07f2ca7983b0 Version: $LATEST

05:59:36
2019-07-24T05:59:36.340Z dd1a2ee9-f389-44be-aaa6-07f2ca7983b0 INFO ------------> executing ....

05:59:39
END RequestId: dd1a2ee9-f389-44be-aaa6-07f2ca7983b0

05:59:39
REPORT RequestId: dd1a2ee9-f389-44be-aaa6-07f2ca7983b0 Duration: 3003.29 ms Billed Duration: 3000 ms Memory Size: 128 MB Max Memory Used: 26 MB

05:59:39
2019-07-24T05:59:39.317Z dd1a2ee9-f389-44be-aaa6-07f2ca7983b0 Task timed out after 3.00 seconds 

Not sure what went wrong. Any help is appreciated, thanks in advance.

like image 495
Asur Avatar asked Jul 24 '19 01:07

Asur


People also ask

How do you handle errors in Step Functions?

Handling a failure using Retry The function is retried twice with an exponential backoff between retries. This variant uses the predefined error code States. TaskFailed , which matches any error that a Lambda function outputs.

What is BackoffRate?

BackoffRate – the value of this key signifies the multiplier by which the retry interval (IntervalSeconds) increases after each retry attempt. For example, the first retry attempt will wait 3 seconds, and the second retry attempt will wait 4.5 seconds.

Does AWS Lambda Auto Retry?

When a function returns an error after execution, Lambda attempts to run it two more times by default. With Maximum Retry Attempts, you can customize the maximum number of retries from 0 to 2. This gives you the option to continue processing new events with fewer or no retries.

How do you avoid Lambda Retry?

To stop it from retrying, you should make sure that any error is handled and you tell Lambda that your invocation finished successfully by returning a non-error (or in Node, calling callback(null, <any>) . To do this you can enclose the entire body of your handler function with a try-catch.


2 Answers

To answer my own question, there were 2 problems with the retry logic I placed.

  1. States.All should have been States.ALL (notice the case of L)
  2. When the lambda timed out, the error being thrown was Lambda.Unknown instead of States.Timeout.

I updated my step function with following code and now it works:

{
  "StartAt": "Bazinga",
  "States": {
    "Bazinga": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "arn:aws:lambda:ap-southeast-2:518815385770:function:errorTest:$LATEST",
        "Payload": {
          "Input.$": "$"
        }
      },
      "Retry" : [
        {
          "ErrorEquals": [ "States.Timeout", "Lambda.Unknown" ],
          "IntervalSeconds": 1,
          "MaxAttempts": 3,
          "BackoffRate": 1.0
        }
      ],
       "Next": "Fail"
    },
    "Fail": {
      "Type": "Fail"
    }
  }
}
like image 150
Asur Avatar answered Oct 17 '22 16:10

Asur


Because you didn't define TimeoutSeconds in the ASL. Example:

"Type": "Task",
"Resource": "${FunctionArn}",
"TimeoutSeconds": 3,

Otherwise it will throw out Lambda.Unknown failure

like image 27
fizwan Avatar answered Oct 17 '22 15:10

fizwan