How to do de-duplication on records from AWS Kinesis Firehose to Redshift?

Tags:

I read the document of official AWS Kinesis Firehose but it doesn't mention how to handle duplicated events. Does anyone have experience on it? I googled someone use ElasticCache to do filtering, does it mean I need to use AWS Lambda to encapsulate such filtering logic? Is there any simple way like firehose to ingest data into Redshift and at the same time has "exactly once" semantics? Thanks a lot!

890

asked Jan 16 '16 07:01

Casel Chen

1 Answers

You can have duplication on both sides of the Kinesis Stream. You might put the same events twice into the Stream, and you might read the event twice by the consumers.

The producers side can happen if you try to put an event to the Kinesis stream, but for some reason you are not sure if it was written successfully or not, and you decide to put it again. The consumer side can happen if you are getting a batch of events and start processing them, and you crash before you managed to checkpoint your location, and the next worker is picking the same batch of events from the Kinesis stream, based on the last checkpoint sequence-ID.

Before you start solving this problem, you should evaluate how often do you have such duplication and what is the business impact of such duplications. Not every system is handling financial transactions that can't tolerate duplication. Nevertheless, if you decide that you need to have such de-duplication, a common way to solve it is to use some event-ID and track if you processed that event-ID already.

ElasticCache with Redis is a good place to track your event-ID. Every time you pick up an event for processing, you check if you already have it in the hash table in Redis, if you find it, you skip it, and if you don't find it, you add it to the table (with some TTL based on the possible time window for such duplication).

If you choose to use Kinesis Firehose (instead of Kinesis Streams), you no longer have control on the consumer application and you can't implement this process. Therefore, you either want to run such de-duplication logic on the producer side, switch to use Kinesis Streams and run your own code in Lambda or KCL, or settle for the de-duplication functions in Redshift (see below).

If you are not too sensitive to duplication, you can use some functions in Redshift, such as COUNT DISTINCT or LAST_VALUE in a WINDOW function.

answered Oct 16 '22 14:10

Guy

Related questions
                            
                                Deleting Duplicates in Access 2003
                            
                                3rd way to find a duplicate
                            
                                How to remove strings in list from another list?
                            
                                Sort rows of DataFrame by duplicate
                            
                                How to expire state of dropDuplicates in structured streaming to avoid OOM?
                            
                                Amazon S3 Deduplication?
                            
                                Remove Duplicates from ArrayList of ArrayLists [duplicate]
                            
                                Removing duplicates using Set in Javascript
                            
                                SQL Query - Delete duplicates if more than 3 dups?
                            
                                Oracle : Identifying duplicates in a table without index
                            
                                Perl: Removing duplicates from a large set of data
                            
                                How can I quickly remove duplicates from a list box?
                            
                                MySQL Insert row, on duplicate: add suffix and re-insert
                            
                                Convert JSON object with duplicate keys to JSON array
                            
                                Consider duplicate index in drop_duplicates method of a pandas DataFrame
                            
                                Python Fuzzy matching strings in list performance
                            
                                php remove duplicates from array
                            
                                Remove duplicates of a List, selecting by a property value in C#?
                            
                                mongodb - Find every document with the same field but different case
                            
                                Removing Only Adjacent Duplicates in Data Frame in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to do de-duplication on records from AWS Kinesis Firehose to Redshift?

Tags:

duplicates

amazon-kinesis-firehose

amazon-redshift

Casel Chen

People also ask

1 Answers

Guy

Recent Activity

Donate For Us