Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS Step Functions history event limitation

I use step functions for a big loop, so far no problem, but the day when my loop exceeded 8000 executions I came across the error "Maximum execution history size" which is 25000.

There is there a solution for not having the history events?

Otherwise, where I can easily migrate my step functions (3 lambda) because aws batch will ask me a lot of code rewrite ..

Thanks a lot

like image 232
blinard Avatar asked Jul 03 '17 10:07

blinard


2 Answers

One approach to avoid the 25k history event limit is to add a choice state in your loop that takes in a counter or boolean and decides to exit the loop.

Outside of the loop you can put a lambda function that starts another execution (with a different id). After this, your current execution completes normally and another execution will continue to do the work.

Please note that the "LoopProcessor" in the example below must return a variable "$.breakOutOfLoop" to break out of the loop, which must also be determined somewhere in your loop and passed through.

Depending on your use case, you may need to restructure the data you pass around. For example, if you are processing a lot of data, you may want to consider using S3 objects and pass the ARN as input/output through the state machine execution. If you are trying to do a simple loop, one easy way would be to add a start offset (think of it as a global counter) that is passed into the execution as input, and each LoopProcessor Task will increment a counter (with the start offset as the initial value). This is similar to pagination solutions.

Here is a basic example of the ASL structure to avoid the 25k history event limit:

{
  "Comment": "An example looping while avoiding the 25k event history limit.",
  "StartAt": "FirstState",
  "States": {

    "FirstState": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:function:FUNCTION_NAME",
      "Next": "ChoiceState"
    },

    "ChoiceState": {
      "Type" : "Choice",
      "Choices": [
        {
          "Variable": "$.breakOutOfLoop",
          "BooleanEquals": true,
          "Next": "StartNewExecution"
        }
      ],
      "Default": "LoopProcessor"
    },

    "LoopProcessor": {
      "Type" : "Task",
      "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:function:ProcessWork",
      "Next": "ChoiceState"
    },

    "StartNewExecution": {
      "Type" : "Task",
      "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:function:StartNewLooperExecution",
      "Next": "FinalState"
    },

    "FinalState": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:function:FUNCTION_NAME",
      "End": true
    }
  }
}

Loop Processor Example

Hope this helps!

like image 180
SunnyD Avatar answered Oct 10 '22 10:10

SunnyD


To guarantee the execution of all the steps and their orders, step function stores the history of execution after the completion of each state, this storing is the reason behind the limit on the history execution size.

Having said that, one way to mitigate this limit is by following @sunnyD answer. However, it has below limitations

  1. the invoker of a step function(if there is one) will not get the execution output of the complete data. Instead, he gets the output of the first execution in a chain of execution.
  2. The limit on the number of execution history size has a high chance of increasing in the future versions so writing logic on this number would require you to modify the code/configuration every time the limit is increased or decreased.

Another alternate solution is to arrange step function as parent and child step functions. In this arrangement, the parent step function contains a task to loop through the entire set of data and create new execution of child step function for each record or set of records(a number which is will not exceed history execution limit of a child SF) in your data. The second step in parent step function will wait for a period of time before it checks the Cloudwatch metrics for the completion of all child function and exits with the output.

Few things to keep in mind about this solution are,

  1. The startExecution API will throttle at 500 bucket size with 25 refills every second.
  2. Make sure your wait time in parent SF is sufficient for child SFs to finish its execution otherwise implement a loop to check the completion of child SF.
like image 39
Vishwanath gowda k Avatar answered Oct 10 '22 11:10

Vishwanath gowda k