AWS step-function mapState iterate over large payloads

Tags:

I have a state-machine consisting of a first pre-process task that generates an array as output, which is used by a subsequent map state to loop over. The output array of the first task has gotten too big and the state-machine throws the error States.DataLimitExceeded: The state/task 'arn:aws:lambda:XYZ' returned a result with a size exceeding the maximum number of characters service limit.

Here is an example of the state-machine yaml:

stateMachines:
  myStateMachine:
    name: "myStateMachine"
    definition:
      StartAt: preProcess
      States:
        preProcess:
          Type: Task
          Resource:
            Fn::GetAtt: [preProcessLambda, Arn]
          Next: mapState
          ResultPath: "$.preProcessOutput"
        mapState:
          Type: Map
          ItemsPath: "$.preProcessOutput.data"
          MaxConcurrency: 100
          Iterator:
            StartAt: doMap
            States:
              doMap:
                Type: Task
                Resource:
                  Fn::GetAtt: [doMapLambda, Arn]
                End: true
          Next: ### next steps, not relevant

A possible solution I came up with would be that state preProcess saves its output in an S3-bucket and state mapState reads directly from it. Is this possible? At the moment the output of preProcess is

ResultPath: "$.preProcessOutput"

and mapState takes the array

ItemsPath: "$.preProcessOutput.data" as input.

How would I need to adapt the yaml that the map state reads directly from S3?

497

asked Feb 20 '20 10:02

benito_h

Video Answer

3 Answers

I am solving a similar problem at work currently too. Because a step function stores its entire state, you can pretty quickly have problems as your json grows as it maps over all the values.

The only real way to solve this is to use hierarchies of step functions. That is, step functions on your step functions. So you have:

parent -> [batch1, batch2, batch...N]

And then each batch have a number of single jobs:

batch1 -> [j1,j2,j3...jBATCHSIZE]

I had a pretty simple step function, and I found at ~4k was about the max batch size I could have before I would start hitting state limits.

Not a pretty solution be hey it works.

answered Oct 07 '22 05:10

Derrops

I don't think it is possible to read directly from S3 at this time. There are a few things you could try to do to get around this limitation. One is making your own iterator and not using Map State. Another is the following:

Have a lambda read your s3 file and chunk it by index or some id/key. The idea behind this step is to pass the iterator in Map State a WAY smaller payload. Say your data has the below structure.

[ { idx: 1, ...more keys }, {idx: 2, ...more keys }, { idx: 3, ...more keys }, ... 4,997 more objects of data ]

Say you want your iterator to process 1,000 rows at a time. Return the following tuples representing indexs from your lambda instead: [ [ 0, 999 ], [ 1000, 1999 ], [ 2000, 2999 ], [ 3000, 3999 ], [ 4000, 4999] ]

Your Map State will receive this new data structure and each iteration will be one of the tuples. Iteration #1: [ 0, 999 ], Iteration #2: [ 1000, 1999 ], etc

Inside your iterator, call a lambda which uses the tuple indexes to query into your S3 file. AWS has a query language over S3 buckets called Amazon S3 Select: https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference-select.html

Here’s another great resource on how to use S3 select and get the data into a readable state with node: https://thetrevorharmon.com/blog/how-to-use-s3-select-to-query-json-in-node-js

So, for iteration #1, we are querying the first 1,000 objects in our data structure. I can now call whatever function I normally would have inside my iterator.

What's key about this approach is the inputPath is never receiving a large data structure.

answered Oct 07 '22 05:10

MAK

As of September 2020 the limit on step functions has been increased 8-fold

https://aws.amazon.com/about-aws/whats-new/2020/09/aws-step-functions-increases-payload-size-to-256kb/

Maybe now it fits within your requirements

answered Oct 07 '22 05:10

Gabriel Furstenheim

Related questions
                            
                                Connecting to Aurora Serverless remotely
                            
                                AWS Glue: ETL to read S3 CSV files
                            
                                How much it cost to use Amazon S3 for Video Streaming backend?
                            
                                Enabling HTTPS and HTTP with Elastic Beanstalk application
                            
                                What are Vended Logs in AWS CloudWatch?
                            
                                AWS AutoScaling CoolDown components
                            
                                Mocking promise from DynamoDB Documentclient
                            
                                Create/update Amazon Athena tables from Amazon S3 bucket files
                            
                                How to access redis logs on AWS ElastiCache
                            
                                Specify a SerDe serialization lib with AWS Glue Crawler
                            
                                AWS CodeBuild Colorized logs
                            
                                Getting the response results from an asynchronous call to AWS lambda in python
                            
                                How to output AWS CDK synth to terminal/web browser [if possible]
                            
                                Amazon Aurora PostgreSQL SELECT INTO OUTFILE S3
                            
                                CannotPullContainerError: context canceled error when starting ECS task
                            
                                AWS Amplify & React - Module not found: Can't resolve '@aws-amplify/analytics'
                            
                                How to upload a file to Amazon Glacier Deep Archive using boto3
                            
                                AWS - DecodeAuthorizationMessage Not authorized to decode message
                            
                                Disabling security for one method resource endpoint in API Gateway via AWS SAM template
                            
                                S3 static website /w bluegreen deployment

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

AWS step-function mapState iterate over large payloads

Tags:

amazon-web-services

aws-lambda

aws-serverless

aws-step-functions

benito_h

People also ask

Video Answer

3 Answers

Derrops

MAK

Gabriel Furstenheim

Recent Activity

Donate For Us