Scenario: A bunch of records (like 10k, maybe more) of small size (average of 50 Bytes each) must be processed. The processing must be done in parallel or any other way to improve performance (remember, we have a lot of records to go through). Also, the processing itself it is a very simple task (that's one of the why's for using AWS Lambda). Although it's simplicity, some processing may end before/after others, so that's another reason why those records are independent of each other and the order of processing does not matter.
So far, Step Functions looks like the way to go.
With Step Functions, we can have the following graph:
I can define the RecordsRetrieval as one task. After that, those records will be processed in parallel by the tasks ProcessRecords-Task-1, ProcessRecords-Task-2 and ProcessRecords-Task-3. By the looks of it, all fine and dandy, right? wrong!
First Problem: Dynamic Scaling If i want to have dynamic scaling of those tasks (let's say... 10, 100, 5k or 10k), taking in consideration the amount of records to be processed, i would have to dynamic build the json to achieve that (not a very elegant solution, but it might work). I am very confident that the number of tasks have a limit, so i cannot rely on that. It would be way better if the scaling heavy-lifting is handled by the infra structure and not by me.
Either way, for a well defined set of parallel tasks like: GetAddress, GetPhoneNumber, GetWhatever... is great! Works like a charm!
Second Problem: Payload Dispatch After the RecordsRetrieval task, i need that each one of those records to be processed individually. With Step Functions i did not see any way of accomplishing that. Once the RecordsRetrieval task pass along it's payload (in this case those records), all the parallel tasks will be handling the same payload.
Again, just like i said in the first problem, for a well defined set of parallel tasks it will be a perfect fit.
Conclusion I think that, probably, AWS Step Functions is not the solution for my scenario. This is a summary of my knowledge about it, so feel free to comment if i did miss something.
I am digging with the microservice approach for many reasons (scalability, serverless, simplicity and so forth).
I know that it is possible to retrieve those records and send one by one to another lambda, but again, not a very elegant solution.
I also know that this is a batch job and AWS has the Batch service. What i am trying to do is to keep the microservice approach without depending on AWS Batch/EC2.
What are your thoughts about it? Feel free to comment. Any suggestions will be appreciated.
The new AWS Step Functions Express Workflows type uses fast, in-memory processing for high-event-rate workloads of up to 100,000 state transitions per second, for a total workflow duration of up to 5 minutes.
Step Functions Limits, September 2020 Time to check AWS's service quotas. Step Functions is engineered for limits of 300 new executions per second in N. Virginia, Oregon, and Ireland and 150 per second in all other regions.
AWS Lambda can process batches of messages from sources like Amazon Kinesis Data Streams or Amazon DynamoDB Streams. In normal operation, the processing function moves from one batch to the next to consume messages from the stream.
For long-running queries requiring multi-step processing, utilize Step Functions to orchestrate the tasks by using Asynchronous Express Workflows. They can also run for up to five minutes.
Having said with your inputs, according to me following solution can work inline with your criteria. You can use either AWS lambda or AWS batch for below solution.
var BATCH_RECORD_SIZE = 100;
var totalRecords = getTotalCountOfRecords();
var noOfBatchInvocation = getTotalCountOfRecords() % BATCH_RECORD_SIZE == 0 ? getTotalCountOfRecords() / BATCH_RECORD_SIZE : getTotalCountOfRecords() /BATCH_RECORD_SIZE + 1;
var start = 0;
for( 1 to noOfBatchInvocation ) {
// invoke lambda / submit job
invokeLambda(start, BATCH_RECORD_SIZE);
// OR
submitJobWith(start, BATCH_RECORD_SIZE);
// increment start
start += BATCH_RECORD_SIZE;
}
You can use AWS lambda as your record processing is not compute/memory intensive. But if it is, then I will suggest to use AWS batch for this processing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With