Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS Step Functions with batch processing limitations

Scenario: A bunch of records (like 10k, maybe more) of small size (average of 50 Bytes each) must be processed. The processing must be done in parallel or any other way to improve performance (remember, we have a lot of records to go through). Also, the processing itself it is a very simple task (that's one of the why's for using AWS Lambda). Although it's simplicity, some processing may end before/after others, so that's another reason why those records are independent of each other and the order of processing does not matter.

So far, Step Functions looks like the way to go.

With Step Functions, we can have the following graph:

enter image description here

I can define the RecordsRetrieval as one task. After that, those records will be processed in parallel by the tasks ProcessRecords-Task-1, ProcessRecords-Task-2 and ProcessRecords-Task-3. By the looks of it, all fine and dandy, right? wrong!

First Problem: Dynamic Scaling If i want to have dynamic scaling of those tasks (let's say... 10, 100, 5k or 10k), taking in consideration the amount of records to be processed, i would have to dynamic build the json to achieve that (not a very elegant solution, but it might work). I am very confident that the number of tasks have a limit, so i cannot rely on that. It would be way better if the scaling heavy-lifting is handled by the infra structure and not by me.

Either way, for a well defined set of parallel tasks like: GetAddress, GetPhoneNumber, GetWhatever... is great! Works like a charm!

Second Problem: Payload Dispatch After the RecordsRetrieval task, i need that each one of those records to be processed individually. With Step Functions i did not see any way of accomplishing that. Once the RecordsRetrieval task pass along it's payload (in this case those records), all the parallel tasks will be handling the same payload.

Again, just like i said in the first problem, for a well defined set of parallel tasks it will be a perfect fit.

Conclusion I think that, probably, AWS Step Functions is not the solution for my scenario. This is a summary of my knowledge about it, so feel free to comment if i did miss something.

I am digging with the microservice approach for many reasons (scalability, serverless, simplicity and so forth).

I know that it is possible to retrieve those records and send one by one to another lambda, but again, not a very elegant solution.

I also know that this is a batch job and AWS has the Batch service. What i am trying to do is to keep the microservice approach without depending on AWS Batch/EC2.

What are your thoughts about it? Feel free to comment. Any suggestions will be appreciated.

like image 531
Juan Avatar asked Feb 10 '18 19:02

Juan


People also ask

What is the maximum state transition rate when using Express workflows with AWS Step Functions?

The new AWS Step Functions Express Workflows type uses fast, in-memory processing for high-event-rate workloads of up to 100,000 state transitions per second, for a total workflow duration of up to 5 minutes.

How many Step Functions can run at once?

Step Functions Limits, September 2020 Time to check AWS's service quotas. Step Functions is engineered for limits of 300 new executions per second in N. Virginia, Oregon, and Ireland and 150 per second in all other regions.

Can we use AWS Lambda for batch processing?

AWS Lambda can process batches of messages from sources like Amazon Kinesis Data Streams or Amazon DynamoDB Streams. In normal operation, the processing function moves from one batch to the next to consume messages from the stream.

How long can an AWS step function run?

For long-running queries requiring multi-step processing, utilize Step Functions to orchestrate the tasks by using Asynchronous Express Workflows. They can also run for up to five minutes.


Video Answer


1 Answers

Having said with your inputs, according to me following solution can work inline with your criteria. You can use either AWS lambda or AWS batch for below solution.

var BATCH_RECORD_SIZE = 100;
var totalRecords = getTotalCountOfRecords();
var noOfBatchInvocation = getTotalCountOfRecords() % BATCH_RECORD_SIZE == 0 ? getTotalCountOfRecords() / BATCH_RECORD_SIZE : getTotalCountOfRecords() /BATCH_RECORD_SIZE + 1;
var start = 0;
for( 1 to noOfBatchInvocation ) {
    // invoke lambda / submit job
    invokeLambda(start, BATCH_RECORD_SIZE);
    // OR
    submitJobWith(start, BATCH_RECORD_SIZE);
    // increment start
    start += BATCH_RECORD_SIZE;
}
  • Define lambda which task will be just get number of records as above. This lambda can be triggered on an s3 event or scheduled event or as per your way. Here we can define number of records processed per lambda invocation/batch job. This lambda will invoke/ submit batch job no of times = (total records) / (no of records per job/lambda invocation).
  • If you prefer lambda, then define lambda such way that it takes two parameters start and limit as an input. These parameter will decide from where to start reading the file to be processed and where to stop. This lambda will also know that from where to read the records.
  • If you prefer batch, then define the job definition with same logic as above.

You can use AWS lambda as your record processing is not compute/memory intensive. But if it is, then I will suggest to use AWS batch for this processing.

like image 191
Rishikesh Darandale Avatar answered Sep 29 '22 00:09

Rishikesh Darandale