kafka streams session window retention duration

Tags:

We are using Kafka stream's SessionWindows to aggregate arrival of related events. Also along with the aggregation we are specifying the retention time for the window using until() API. Stream info:
The session window (inactivity time) is 1 minute and the retention time passed to until() is 2 minutes. We are using customized TimestampExtractor to map event's time.

Example:
Event: e1; eventTime: 10:00:00 am; arrivalTime:2pm(same day)
Event: e2; eventTime: 10:00:30 am; arrivalTime 2:10 pm (same day)
The arrival time for the second event is 10 minutes after the arrival of e1 which exceeds retention time + inactivity time. But older event e1 is still part of the aggregation despite the retention time being 2 mins.

Questions:
1) How does kafka streams clean up state store using until() API? Since the retention value specified as an argument is "lower bound for how long a window will be maintained." When exactly the window is purged?

2) Is there a background thread that cleans up the state store periodically? If yes, then is there a way to identify the actual time when the window is purged.

3) Any stream configuration that would purge the data for a window after retention time.

754

asked Jun 07 '17 19:06

vinay

1 Answers

Before I answer your concrete question: Note, that retention time is not based on system time, but on "stream time". "Stream time" is an internally tracked time progress based on whatever TimestampExtractor returns. Without going into too much detail: for your example with 2 records, "stream time" will be advance by 30 seconds when the second record arrives and thus retention time did not pass yet.

Also note that "stream time" is not advance if no new data arrives (for at least one partition). This holds for Kafka 0.11.0 and older but might change in future releases.

Update: The computation of stream-time was changed in Kafka 2.1 and stream-time may advance even if one partition does not deliver data. For details see KIP-353: Improve Kafka Streams Timestamp Synchronization

To your questions:

(1) Kafka Streams writes all store update into a changelog topic and a local RocksDB store. Both a divided into so-called segments with certain size. If new data arrives (ie, "stream time" progresses) new segments are created. If this happens, older segment are deleted iff all records in an old segment are older than retention time (ie, record timestamp smaller than "stream time" minus retention time).

(2) Thus, there is no background thread but cleanup is part of regular processing,

and (3) there is no configuration to force purging of older records/windows.

As whole segments are dropped if all record are expired, the older records within a segment (with most likely smaller/older timestamps) are maintained longer than retention time. The motivation behind this design is performance: expiring on a per-record basis would be too expensive.

answered Oct 12 '22 23:10

Matthias J. Sax

Related questions
                            
                                kafka producer unit test (java)
                            
                                Can the same Zookeeper instance be used by number of services?
                            
                                KeeperErrorCode = NoNode for /brokers/topics/test-topic/partitions
                            
                                apache spark streaming - kafka - reading older messages
                            
                                Kafka client for PHP
                            
                                Sending Large CSV to Kafka using python Spark
                            
                                Kafka pattern subscription. Rebalancing is not being triggered on new topic
                            
                                Kafka Partitions Reassignment Performance Impact
                            
                                Heroku CLI does not recognize kafka as a command
                            
                                leader election in zookeeper and Kafka
                            
                                Reset the JDBC Kafka Connector to start pulling rows from the beginning of time?
                            
                                EmbeddedKafka how to check received messages in unit test
                            
                                Log compaction to keep exactly one message per key
                            
                                Is it possible to create ksql table from ksql stream?
                            
                                Kafka Streams Testing : java.util.NoSuchElementException: Uninitialized topic: "output_topic_name"
                            
                                best option to put Nginx logs into Kafka?
                            
                                How to check if a topic was consumed by a consumer in Kafka
                            
                                kafka-node start consume from last offset
                            
                                how to get the group commit offset from kafka(0.10.x)
                            
                                Distributed Kafka Connect topic configuration

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

kafka streams session window retention duration

Tags:

apache-kafka

apache-kafka-streams

vinay

People also ask

1 Answers

Matthias J. Sax

Recent Activity

Donate For Us