How do you handle Amazon Kinesis Record duplicates?

Tags:

According to the Amazon Kinesis Streams documentation, a record can be delivered multiple times.

The only way to be sure to process every record just once is to temporary store them in a database that supports Integrity checks (e.g. DynamoDB, Elasticache or MySQL/PostgreSQL) or just checkpoint the RecordId for each Kinesis shard.

Do you know a better / more efficient way of handling duplicates?

428

asked Mar 27 '17 23:03

Antonio

2 Answers

We had exactly that problem when building a telemetry system for a mobile app. In our case we were also unsure that producers where sending each message exactly once, therefore for each received record we calculated its MD5 on the fly and checked whether it is presented in some form of a persistent storage, but indeed what storage to use is the trickiest bit.

Firstly, we tried trivial relational database, but it quickly became a major bottleneck of the whole system as this isn't just read-heavy but also write-heavy case, since the volume of data going though Kinesis was quite significant.

We ended up having a DynamoDB table storing MD5's for each unique message. The issue we had was that it wasn't so easy to delete the messages - even though our table contained partition and sort keys, DynamoDB does not allow to drop all records with a given partition key, we had to query all of the to get sort key values (which wastes time and capacity). Unfortunately, we had to just simply drop the whole table once in a while. Another way suboptimal solution is to regularly rotate DynamoDB tables which store message identifiers.

However, recently DynamoDB introduced a very handy feature - Time To Live, which means that now we can control the size of a table by enabling auto-expiry on a per record basis. In that sense DynamoDB seems to be quite similar to ElastiCache, however ElastiCache (at least Memcached cluster) is much less durable - there is no redundancy there, and all data residing on terminated nodes is lost in case of scale in operation or failure.

158

answered Sep 16 '22 12:09

Dmitry Deryabin

The thing you mentioned is a general problem of all queue systems with "at least once" approach. Also, not just the queue systems, the producers and consumers both may process the same message multiple times (due to ReadTimeout errors etc.). Kinesis and Kafka both uses that paradigm. Unfortunately there is not an easy answer for that.

You may also try to use an "exactly-once" message queue, with stricter transaction approach. For example AWS SQS does that: https://aws.amazon.com/about-aws/whats-new/2016/11/amazon-sqs-introduces-fifo-queues-with-exactly-once-processing-and-lower-prices-for-standard-queues/ . Be aware, SQS throughput is far smaller than Kinesis.

To solve your problem, you should be aware of your application domain and try to solve it internally like you suggested (database checks). Especially when you communicate with an external service (let's say an email server for example), you should be able to recover the operation state in order to prevent double processing (because double sending in the email server example, may result in multiple copies of the same post in the recipient's mailbox).

az3

Related questions
                            
                                Can we consider AWS Glue as a replacement for EMR?
                            
                                How to replace root ebs volume with another root ebs volume? [closed]
                            
                                405 method not allowed error in AWS Cognito oauth2/token endpoint
                            
                                Amazon S3 download authentication
                            
                                AWS Secrets Manager can’t find the specified secret
                            
                                How to create amazon sandbox account for developer purpose?
                            
                                Max AWS SQS Queues
                            
                                Can I add custom log files to the logs captured by elastic beanstalk's 'eb logs' command?
                            
                                How to make a HTTP call reaching all instances behind amazon AWS load balancer?
                            
                                Can I use AWS route 53 and Cloudflare at the same time?
                            
                                How to publish sns to a specific endpoint?
                            
                                Write R data as csv directly to s3
                            
                                Using Java to establish a secure connection to MySQL Amazon RDS (SSL/TLS)
                            
                                What's the difference between AWS SSO and AWS Cognito?
                            
                                AWS load balancer and maintenance page
                            
                                aws apigateway import-rest-api returns "Invalid base64" error
                            
                                Amazon Ec2 FTP Write Permission [closed]
                            
                                Error registering: NoCredentialProviders: no valid providers in chain ECS agent error
                            
                                boto3 sessions and aws_session_token management
                            
                                Determine IP Address of Client Behind Amazon ELB

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do you handle Amazon Kinesis Record duplicates?

Tags:

amazon-web-services

amazon-dynamodb

amazon-kinesis

amazon-elasticsearch

amazon-elasticache

Antonio

People also ask

2 Answers

Dmitry Deryabin

az3

Recent Activity

Donate For Us