I'm trying to figure out an architecture for processing rather big files (maybe few hundred MB) on a serverless AWS. This is what I've got so far:
API Gateway -> S3 -> Lambda function -> SNS -> Lambda function
In this scenario, the text file is uploaded to S3 through API Gateway. Then some Lambda function is called based on the event generated on S3. This Lambda function will open the text file and read it line by line, generating tasks to be done as messages in an SNS topic. Each message will invoke a separate Lambda function process the task.
My only concern is the first Lambda function call. What if it times out? How can I make sure that it's not a point of failure?
You can ask S3 to only return a particular byte range of a given object, using the Range header: https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectGET.html
for example:
Range: bytes=0-9
would return only the first 10 bytes of the S3 object.
To read a file line by line, you would have to decide on a specific chunk size (1 MB for example), read 1 chunk of the file at a time and split the chunk by line (by looking for newline characters). Once the whole chunk has been read, you could re-invoke the lambda and pass the chunk pointer as a parameter. The new invocation of the lambda will read the file from the chunk pointer given as a parameter.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With