Does the Kafka streams aggregation have any ordering guarantee?

Tags:

apache-kafka-streams

My Kafka topic contains statuses keyed by deviceId. I would like to use KStreamBuilder.stream().groupByKey().aggregate(...) to only keep the latest value of a status in a TimeWindow. I guess that, as long as the topic is partitioned by key, the aggregation function can always return the latest values in this fashion:

(key, value, older_value) -> value

Is this a guarantee I can expect from Kafka Streams? Should I roll my own processing method that checks the timestamp?

394

asked Jan 09 '17 12:01

Steve

1 Answers

Kafka Streams guaranteed ordering by offsets but not by timestamp. Thus, by default "last update wins" policy is based on offsets but not on timestamp. Late arriving records ("late" defined on timestamps) are out-of-order based on timestamps and they will not be reordered to keep original offsets order.

If you want to have your window containing the latest value based on timestamps you will need to use Processor API (PAPI) to make this work.

Within Kafka Streams' DSL, you cannot access the record timestamp that is required to get the correct result. A easy way might be to put a .transform() before .groupBy() and add the timestamp to the record (ie, its value) itself. Thus, you can use the timestamp within your Aggregator (btw: a .reduce() that is simpler to use might also work instead of .aggregate()). Finally, you need to do .mapValues() after your .aggregate() to remove the timestamp from the value again.

Using this mix-and-match approach of DSL and PAPI should simplify your code, as you can use DSL windowing support and KTable and do not need to do low-level time-window and state management.

Of course, you can also just do all this in a single low-level stateful processor, but I would not recommend it.

124

answered Oct 03 '22 08:10

Matthias J. Sax

Related questions
                            
                                Kafka-streams: setting internal topics cleanup policy to delete doesn't work
                            
                                How to create a state store with HashMap as value in Kafka streams?
                            
                                Kafka - How to use filter and filternot at the same time?
                            
                                Kafka Streams - Is it possible to run remote interactive queries without a local Kafka Streams instance
                            
                                Read data from KSQL tables
                            
                                Kafka Streams - Retrying a message
                            
                                Failed to delete the state directory in IDE for Kafka Stream Application
                            
                                Use kafka to detect changes on values
                            
                                Kafka Streams 2.5.0 requires input topic
                            
                                Kafka Processor API: Different key for Source and StateStore?
                            
                                Aggregate over multiple partitions in Kafka Streams
                            
                                Kafka GlobalKTable Latency Issue
                            
                                How does kafka streams compute watermarks?
                            
                                How to change timestamp of records?
                            
                                KTable state store persistence
                            
                                Kafka Streams Testing : java.util.NoSuchElementException: Uninitialized topic: "output_topic_name"
                            
                                kafka streams session window retention duration
                            
                                Why does co-partitioning of two Kstreams in kafka require same number of partitions for both the streams?
                            
                                Streaming messages from one Kafka Cluster to another
                            
                                Streaming from particular partition within a topic (Kafka Streams)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With