AWS Step Functions with batch processing limitations

Tags:

Scenario: A bunch of records (like 10k, maybe more) of small size (average of 50 Bytes each) must be processed. The processing must be done in parallel or any other way to improve performance (remember, we have a lot of records to go through). Also, the processing itself it is a very simple task (that's one of the why's for using AWS Lambda). Although it's simplicity, some processing may end before/after others, so that's another reason why those records are independent of each other and the order of processing does not matter.

So far, Step Functions looks like the way to go.

With Step Functions, we can have the following graph:

enter image description here

I can define the RecordsRetrieval as one task. After that, those records will be processed in parallel by the tasks ProcessRecords-Task-1, ProcessRecords-Task-2 and ProcessRecords-Task-3. By the looks of it, all fine and dandy, right? wrong!

First Problem: Dynamic Scaling If i want to have dynamic scaling of those tasks (let's say... 10, 100, 5k or 10k), taking in consideration the amount of records to be processed, i would have to dynamic build the json to achieve that (not a very elegant solution, but it might work). I am very confident that the number of tasks have a limit, so i cannot rely on that. It would be way better if the scaling heavy-lifting is handled by the infra structure and not by me.

Either way, for a well defined set of parallel tasks like: GetAddress, GetPhoneNumber, GetWhatever... is great! Works like a charm!

Second Problem: Payload Dispatch After the RecordsRetrieval task, i need that each one of those records to be processed individually. With Step Functions i did not see any way of accomplishing that. Once the RecordsRetrieval task pass along it's payload (in this case those records), all the parallel tasks will be handling the same payload.

Again, just like i said in the first problem, for a well defined set of parallel tasks it will be a perfect fit.

Conclusion I think that, probably, AWS Step Functions is not the solution for my scenario. This is a summary of my knowledge about it, so feel free to comment if i did miss something.

I am digging with the microservice approach for many reasons (scalability, serverless, simplicity and so forth).

I know that it is possible to retrieve those records and send one by one to another lambda, but again, not a very elegant solution.

I also know that this is a batch job and AWS has the Batch service. What i am trying to do is to keep the microservice approach without depending on AWS Batch/EC2.

What are your thoughts about it? Feel free to comment. Any suggestions will be appreciated.

531

asked Feb 10 '18 19:02

Juan

Video Answer

1 Answers

Having said with your inputs, according to me following solution can work inline with your criteria. You can use either AWS lambda or AWS batch for below solution.

Click to copy

var BATCH_RECORD_SIZE = 100;
var totalRecords = getTotalCountOfRecords();
var noOfBatchInvocation = getTotalCountOfRecords() % BATCH_RECORD_SIZE == 0 ? getTotalCountOfRecords() / BATCH_RECORD_SIZE : getTotalCountOfRecords() /BATCH_RECORD_SIZE + 1;
var start = 0;
for( 1 to noOfBatchInvocation ) {
    // invoke lambda / submit job
    invokeLambda(start, BATCH_RECORD_SIZE);
    // OR
    submitJobWith(start, BATCH_RECORD_SIZE);
    // increment start
    start += BATCH_RECORD_SIZE;
}

Define lambda which task will be just get number of records as above. This lambda can be triggered on an s3 event or scheduled event or as per your way. Here we can define number of records processed per lambda invocation/batch job. This lambda will invoke/ submit batch job no of times = (total records) / (no of records per job/lambda invocation).
If you prefer lambda, then define lambda such way that it takes two parameters start and limit as an input. These parameter will decide from where to start reading the file to be processed and where to stop. This lambda will also know that from where to read the records.
If you prefer batch, then define the job definition with same logic as above.

You can use AWS lambda as your record processing is not compute/memory intensive. But if it is, then I will suggest to use AWS batch for this processing.

191

answered Sep 29 '22 00:09

Rishikesh Darandale

Related questions
                            
                                How to generate a keypair and then ssh into aws instance all via ansible to run commands on that instance
                            
                                How to write JSON data to Dynamodb by ignoring empty elements in boto3
                            
                                terraform get list variable to resource
                            
                                CodeBuild unable to create Logs
                            
                                How to configure CORS for an AWS API Gateway Custom Authorizer?
                            
                                Amazon Aurora DB Cluster Not Auto Balancing Correctly
                            
                                File metadata not kept in S3 after a CLI copy
                            
                                AWS API Gateway: Regex for error is not picked up
                            
                                Signing in with AWS amplify on React Native
                            
                                Why changeResourceRecordSets gets not authorized to access this resource?
                            
                                AWS RDS "Publicly Accessible = No" vs instance in private subnet
                            
                                How to read values from local file into Docker-compose environment variables?
                            
                                AWS blocked mixed content when calling API
                            
                                In 2018, what is the correct way to allow invocation of a single lambda function?
                            
                                Cloud Formation: S3 linked to Lambda gives The ARN is not well formed
                            
                                Apache Flask Error 13, permission denied [duplicate]
                            
                                Limit access to lambda or api gateway to a specific vpc
                            
                                Athena and S3 Inventory. HIVE_BAD_DATA: Field size's type LONG in ORC is incompatible with type varchar defined in table schema
                            
                                (AWS) Athena: Query Results seem too short
                            
                                aws cloudformation lambda python bad handler

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

AWS Step Functions with batch processing limitations

Tags:

amazon-web-services

aws-lambda

aws-step-functions

Juan

People also ask

Video Answer

1 Answers

Rishikesh Darandale

Recent Activity

Donate For Us