How to send final kafka-streams aggregation result of a time windowed KTable?

Tags:

apache-kafka-streams

What I'd like to do is this:

Consume records from a numbers topic (Long's)
Aggregate (count) the values for each 5 sec window
Send the FINAL aggregation result to another topic

My code looks like this:

KStream<String, Long> longs = builder.stream(             Serdes.String(), Serdes.Long(), "longs");  // In one ktable, count by key, on a five second tumbling window. KTable<Windowed<String>, Long> longCounts =              longs.countByKey(TimeWindows.of("longCounts", 5000L));  // Finally, sink to the long-avgs topic. longCounts.toStream((wk, v) -> wk.key())           .to("long-counts");

It looks like everything works as expected, but the aggregations are sent to the destination topic for each incoming record. My question is how can I send only the final aggregation result of each window?

658

asked Aug 13 '16 18:08

odavid

2 Answers

In Kafka Streams there is no such thing as a "final aggregation". Windows are kept open all the time to handle out-of-order records that arrive after the window end-time passed. However, windows are not kept forever. They get discarded once their retention time expires. There is no special action as to when a window gets discarded.

See Confluent documentation for more details: http://docs.confluent.io/current/streams/

Thus, for each update to an aggregation, a result record is produced (because Kafka Streams also update the aggregation result on out-of-order records). Your "final result" would be the latest result record (before a window gets discarded). Depending on your use case, manual de-duplication would be a way to resolve the issue (using lower lever API, transform() or process())

This blog post might help, too: https://timothyrenner.github.io/engineering/2016/08/11/kafka-streams-not-looking-at-facebook.html

Another blog post addressing this issue without using punctuations: http://blog.inovatrend.com/2018/03/making-of-message-gateway-with-kafka.html

Update

With KIP-328, a KTable#suppress() operator is added, that will allow to suppress consecutive updates in a strict manner and to emit a single result record per window; the tradeoff is an increase latency.

answered Sep 23 '22 10:09

Matthias J. Sax

From Kafka Streams version 2.1, you can achieve this using suppress.

There is an example from the mentioned apache Kafka Streams documentation that sends an alert when a user has less than three events in an hour:

KGroupedStream<UserId, Event> grouped = ...; grouped   .windowedBy(TimeWindows.of(Duration.ofHours(1)).grace(ofMinutes(10)))   .count()   .suppress(Suppressed.untilWindowCloses(unbounded()))   .filter((windowedUserId, count) -> count < 3)   .toStream()   .foreach((windowedUserId, count) -> sendAlert(windowedUserId.window(), windowedUserId.key(), count));

As mentioned in the update of this answer, you should be aware of the tradeoff. Moreover, note that suppress() is based on event-time.

answered Sep 20 '22 10:09

Amir Masud Zare Bidaki

Related questions
                            
                                How to create topics in apache kafka?
                            
                                Why does kafka producer take a broker endpoint when being initialized instead of the zk
                            
                                How does consumer rebalancing work in Kafka?
                            
                                How does an offset expire for an Apache Kafka consumer group?
                            
                                Bootstrap server vs zookeeper in kafka?
                            
                                ERROR Error when sending message to topic
                            
                                How to minimize the latency involved in kafka messaging framework?
                            
                                How to get data from old offset point in Kafka?
                            
                                How do I use multiple consumers in Kafka?
                            
                                What is the use of __consumer_offsets and _schema topics in Kafka?
                            
                                What is the difference in Kafka between a Consumer Group Coordinator and a Consumer Group Leader?
                            
                                How to write spark streaming DF to Kafka topic
                            
                                Amazon Managed Streaming for Kafka- MSK features and performance
                            
                                In Kafka is each message replicated across all partitions of a topic?
                            
                                Docker Compose Mac Error: Cannot start service zoo1: Mounts denied:
                            
                                How can I get the last/end offset of a kafka topic partition?
                            
                                removing a kafka consumer group in zookeeper
                            
                                Kafka consumer fetching metadata for topics failed
                            
                                Offsets stored in Zookeeper or Kafka?
                            
                                CommitFailedException Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With