I'm reading about log compaction in the latest release of kafka and am curious how this impacts consumers. Do consumers work the same as they ever did, or is there a new process for getting all the latest values?
With 'standard' Kafka topics, I use a consumer group to maintain a pointer to the most recent values. But if Kafka is keeping values based on keys instead of time, I'm wondering how consumer groups will work?
What is a Log Compacted Topics. Kafka documentation says: Log compaction is a mechanism to give finer-grained per-record retention, rather than the coarser-grained time-based retention. The idea is to selectively remove records where we have a more recent update with the same primary key.
There are multiple ways in which you can achieve compaction for your data logs: Method 1: Using the Traditional Method of Discarding Old Data. Method 2: Storing Old Logs in the Compressed Format. Method 3: Kafka Log Compaction, A Hybrid Approach.
With log compaction, the older values with duplicate keys are removed while retaining the newly arrived messages with distinct keys in the topic partition.
max.compaction.lag.ms - the maximum delay between the time a message is written and the time the message becomes eligible for compaction. This configuration parameter overwrites min. cleanable. dirty. ratio and forces a log segment to become compactable even if the “dirty ratio” is lower than the threshold.
It does not effect how consumers work. If you are only interested in the latest value per key and read the whole topic, you might still see "duplicates" for a key (if not all duplicates got eliminated, or new messages got written after last compaction run) and thus you only care about the latest value per key.
About consumer groups: When a topic gets compacted, there are "holes" in the range of valid offsets. While you are consuming a topic regularly, you will skip over those automatically.
From https://kafka.apache.org/documentation.html#design_compactionbasics
Note also that all offsets remain valid positions in the log, even if the message with that offset has been compacted away; in this case this position is indistinguishable from the next highest offset that does appear in the log. For example, in the picture above the offsets 36, 37, and 38 are all equivalent positions and a read beginning at any of these offsets would return a message set beginning with 38.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With