Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does Deduplication work in Apache Pulsar?

I'm trying to use Deduplication feature of Apache Pulsar.

brokerDeduplicationEnabled=true is set in standalone.conf file, But when I send the same message from producer multiple times, I get all the messages at consumer end, is this expected behaviour ?

Isn't deduplication means content based deduplication as in AWS SQS ?

Here is my producer code for reference.

import pulsar
import json 
   
client = pulsar.Client('pulsar://localhost:6650')    
producer = client.create_producer(
    'persistent://public/default/my-topic',
    send_timeout_millis=0,
    producer_name="producer-1")

data = {'key1': 0, 'key2' : 1}

for i in range(10):
    encoded_data = json.dumps(data).encode('utf-8') 
    producer.send(encoded_data)

client.close()
like image 754
Shubham Jain Avatar asked Jun 10 '26 22:06

Shubham Jain


1 Answers

In Pulsar, deduplication doesn't work on the content of the message. It works on the individual message. The intention isn't to deduplicate the content but to ensure an individual message cannot be be published more than once.

When you send a message, Pulsar assigns it an unique identifier. Deduplication ensures that in failure scenarios the same message doesn't get stored in (or written to) Pulsar more than once. It does this by comparing the identifier to a list of already stored identifiers. If the identifier of the message has already been stored, Pulsar ignores it. This way, Pulsar will only store the message once. This is part of Pulsar's mechanism to guarantee a message will be sent exactly once.

For more details, see PIP 6: Guaranteed Message Deduplication.

like image 93
Chris Bartholomew Avatar answered Jun 17 '26 18:06

Chris Bartholomew



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!