Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to detect duplicate messages in a kafka topic?

Hi I have an architecture similar to the image shown below.

I have two kafka producer which will send messages to kafka topic with frequent duplicate messages.

Is there a way that I can handle the situation in a easy manner something like service bus topic.

Thank you for your help.

enter image description here

like image 696
ankush reddy Avatar asked Jan 03 '18 00:01

ankush reddy


2 Answers

Assuming that you actually have multiple different producers writing the same messages, I can see these two options:

1) Write all duplicates to a single Kafka topic, then use something like Kafka Streams (or any other stream processor like Flink, Spark Streaming, etc.) to deduplicate the messages and write deduplicated results to a new topic.

Here's a great Kafka Streams example using state stores: https://github.com/confluentinc/kafka-streams-examples/blob/4.0.0-post/src/test/java/io/confluent/examples/streams/EventDeduplicationLambdaIntegrationTest.java

2) Make sure that duplicated messages have the same message key. After that you need to enable log compaction and Kafka will eventually get rid of the duplicates. This approach is less reliable, but if you tweak the compaction settings properly it might give you what you want.

like image 196
sap1ens Avatar answered Nov 09 '22 23:11

sap1ens


Now, Apache Kafka supports exactly-once delivery: https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-apache-kafka-does-it/

like image 44
JR ibkr Avatar answered Nov 10 '22 00:11

JR ibkr