Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS step-function mapState iterate over large payloads

I have a state-machine consisting of a first pre-process task that generates an array as output, which is used by a subsequent map state to loop over. The output array of the first task has gotten too big and the state-machine throws the error States.DataLimitExceeded: The state/task 'arn:aws:lambda:XYZ' returned a result with a size exceeding the maximum number of characters service limit.

Here is an example of the state-machine yaml:

stateMachines:
  myStateMachine:
    name: "myStateMachine"
    definition:
      StartAt: preProcess
      States:
        preProcess:
          Type: Task
          Resource:
            Fn::GetAtt: [preProcessLambda, Arn]
          Next: mapState
          ResultPath: "$.preProcessOutput"
        mapState:
          Type: Map
          ItemsPath: "$.preProcessOutput.data"
          MaxConcurrency: 100
          Iterator:
            StartAt: doMap
            States:
              doMap:
                Type: Task
                Resource:
                  Fn::GetAtt: [doMapLambda, Arn]
                End: true
          Next: ### next steps, not relevant

A possible solution I came up with would be that state preProcess saves its output in an S3-bucket and state mapState reads directly from it. Is this possible? At the moment the output of preProcess is

ResultPath: "$.preProcessOutput"

and mapState takes the array

ItemsPath: "$.preProcessOutput.data" as input.

How would I need to adapt the yaml that the map state reads directly from S3?

like image 497
benito_h Avatar asked Feb 20 '20 10:02

benito_h


People also ask

What is the maximum state transition rate when using Express workflows with AWS Step Functions?

The new AWS Step Functions Express Workflows type uses fast, in-memory processing for high-event-rate workloads of up to 100,000 state transitions per second, for a total workflow duration of up to 5 minutes.

How many Step Functions can run at once?

Step Functions Limits, September 2020 Time to check AWS's service quotas. Step Functions is engineered for limits of 300 new executions per second in N. Virginia, Oregon, and Ireland and 150 per second in all other regions.

How long can a step function run?

Step Functions has two workflow types. Standard workflows have exactly-once workflow execution and can run for up to one year. This means that each step in a Standard workflow will execute exactly-once. Express workflows, however, have at-least-once workflow execution and can run for up to five minutes.


Video Answer


3 Answers

I am solving a similar problem at work currently too. Because a step function stores its entire state, you can pretty quickly have problems as your json grows as it maps over all the values.

The only real way to solve this is to use hierarchies of step functions. That is, step functions on your step functions. So you have:

parent -> [batch1, batch2, batch...N]

And then each batch have a number of single jobs:

batch1 -> [j1,j2,j3...jBATCHSIZE]

I had a pretty simple step function, and I found at ~4k was about the max batch size I could have before I would start hitting state limits.

Not a pretty solution be hey it works.

like image 52
Derrops Avatar answered Oct 07 '22 05:10

Derrops


I don't think it is possible to read directly from S3 at this time. There are a few things you could try to do to get around this limitation. One is making your own iterator and not using Map State. Another is the following:

Have a lambda read your s3 file and chunk it by index or some id/key. The idea behind this step is to pass the iterator in Map State a WAY smaller payload. Say your data has the below structure.

[ { idx: 1, ...more keys }, {idx: 2, ...more keys }, { idx: 3, ...more keys }, ... 4,997 more objects of data ]

Say you want your iterator to process 1,000 rows at a time. Return the following tuples representing indexs from your lambda instead: [ [ 0, 999 ], [ 1000, 1999 ], [ 2000, 2999 ], [ 3000, 3999 ], [ 4000, 4999] ]

Your Map State will receive this new data structure and each iteration will be one of the tuples. Iteration #1: [ 0, 999 ], Iteration #2: [ 1000, 1999 ], etc

Inside your iterator, call a lambda which uses the tuple indexes to query into your S3 file. AWS has a query language over S3 buckets called Amazon S3 Select: https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference-select.html

Here’s another great resource on how to use S3 select and get the data into a readable state with node: https://thetrevorharmon.com/blog/how-to-use-s3-select-to-query-json-in-node-js

So, for iteration #1, we are querying the first 1,000 objects in our data structure. I can now call whatever function I normally would have inside my iterator.

What's key about this approach is the inputPath is never receiving a large data structure.

like image 29
MAK Avatar answered Oct 07 '22 05:10

MAK


As of September 2020 the limit on step functions has been increased 8-fold

https://aws.amazon.com/about-aws/whats-new/2020/09/aws-step-functions-increases-payload-size-to-256kb/

Maybe now it fits within your requirements

like image 44
Gabriel Furstenheim Avatar answered Oct 07 '22 05:10

Gabriel Furstenheim