I am trying to figure out the logic for processing multiple files in S3 at once as files are added randomly. For discussion sake here's an example:
- Files are added randomly to S3 bucket; an by bursty or at random intervals
- Lambda function is triggered once 9 files are in the S3 bucket; the lambda function post processes or combines these files together.
- Once processed, the files will be moved to another bucket or deleted.
Here's what I've tried:
- I have S3 triggers for all S3 puts
- In my lambda function I ignore the filename itself and list the S3 bucket based on the key to count how many files exist
- problem is when traffic is bursty or arrives steady but at a rapid pace it is difficult to identify unique groups of 9 files
- I have uuid prefixes on file names for performance reasons so sequential filenames don't exist.
- I've considered writing meta data to a nosql db but haven't gone down that route yet.
One possible solution is to use a scheduled lambda (can be as often or as sparse as you want based on your traffic) that pulls events from a SQS queue populated by S3 put events. The assumes that you're focused on batch processing n files at a time and the order does not matter (given the uuid naming).
To create this workflow would be something like this:
- Create SQS queue for holding S3 PUT events
- Add trigger to S3 bucket on PUTs to create event in SQS queue from 1.
- Create Lambda with env variables (for bucket and queue)
- The lambda should check the queue if there are any in-flight messages and use just the bucket
- If there are, stop run (to prevent a file from being processed multiple times)
- If no in-flight messages, list objects from S3 with limit of
n (your batch size)
- Run your process logic if enough objects are returned (could be less than
n)
- Delete files
- Create CloudWatch rule for running lambda every
n seconds/minutes/hours
Some other things to keep in mind based on your situation's specifics:
- If there are a lot of files rapidly being sent and
n is significantly small, single-tracking processing (step 3.2 would result in long processing times). This also depends on the length of processing time, whether data can be processed multiple times, etc...
ListObjectsV2 could return less than the MaxKeys parameter, if this is an issue, could have a larger MaxKeys and just process the first n.