We have been using AWS S3 notifications to trigger lambda functions when files land on S3 and this model has worked reasonably well until we noticed that some files are processed multiple times, generating duplicates in our datastore. We noticed that it happened for about 0.05% of our files.
I know can guard against this by performing an upsert, but what is of concern to us is the potential cost of running unnecessary lambda functions, as this impacts our cost.
I've searched Google and SO, but only found similar-ish issues. We are not having a timeout problem, as the files have been processed fully. Our files are rather small, with the biggest file being less than 400k. We are not receiving the same event twice, as the events have different request ids, even though they are running on the same file.
After wasting quite some time looking into S3, SNS and Lambda documentations, I've found a note on specific to S3 notification that reads:
If your application requires particular semantics (for example, ensuring that no events are missed, or that operations run only once), we recommend that you account for missed and duplicate events when designing your application.
https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
Effectively this means that S3 notifications is the wrong solution for us, but considering the research time I've invested in this issue, I thought I'd contribute this here for anyone else who may have overlooked the page linked above.
If sequence number is same for duplicate events: As a workaround, you can consider to trigger notification to secondary database or maintain index of S3 objects using event notifications. Then, store and compare the sequencer key values to check for duplicates as each event notification is processed. I did additional research on how you can compare unique values from the event notification in Lambda function and found article[1] which might be helpful to achieve this. Additionally, please also have a look at external article[2], [3] for sample codes for reference and ensure to test this in your development environment before implementing in production.
References:
[1] https://aws.amazon.com/premiumsupport/knowledge-center/lambda-function-idempotent/
[2] https://cloudonaut.io/your-lambda-function-might-execute-twice-deal-with-it/
[3] https://adrianhesketh.com/2020/11/27/idempotency-and-once-only-processing-in-lambda-part-1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With