I am using apache kafka to produce and consume a file 5GB in size. I want to know if there is a way where the message from the topic is automatically removed after it is consumed. Do I have any way to keep track of consumed messages? I don't want to delete it manually.
Purging of messages in Kafka is done automatically by either specifying a retention time for a topic or by defining a disk quota for it so for your case of one 5GB file, this file will be deleted after the retention period you define has passed, regardless of if it has been consumed or not.
The easiest way to purge or delete messages in a Kafka topic is by setting the retention.ms to a low value. retention.ms configuration controls how long messages should be kept in a topic. Once the age of the message in a topic hits the retention time the message will be removed from the topic.
It is not possible to remove a single message from a Kafka topic, even though you know its partition and offset. Keep in mind, that Kafka is not a key/value store but a topic is rather an append-only(!) log that represents a stream of data.
In Kafka, the responsibility of what has been consumed is the responsibility of the consumer and this is also one of the main reasons why Kafka has such great horizontal scalability.
Using the high level consumer API will automatically do this for you by committing consumed offsets in Zookeeper (or a more recent configuration option is using by a special Kafka topic to keep track of consumed messages).
The simple consumer API make you deal with how and where to keep track of consumed messages yourself.
Purging of messages in Kafka is done automatically by either specifying a retention time for a topic or by defining a disk quota for it so for your case of one 5GB file, this file will be deleted after the retention period you define has passed, regardless of if it has been consumed or not.
Kafka does not have a mechanism to directly delete a message when it is consumed.
The closest thing I found at an attempt to do this is this trick but it is untested and by design it will not work on the most recent messages:
A potential trick to do this is to use a combination of (a) a compacted topic and (b) a custom partitioner (c) a pair of interceptors.
The process would follow:
- Use a producer interceptor to add a GUID to the end of the key before it is written.
- Use a custom partitioner to ignore the GUID for the purposes of partitioning
- Use a compacted topic so you can then delete any individual message you need via producer.send(key+GUID, null)
- Use a consumer interceptor to remove the GUID on read.
Have 1 or more consumers, and want a message to be consumed only once in total by them?
Put them in the same consumer group.
Want to avoid too many messages filling up the disk?
Set up retention in terms of disk space and or time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With