I wonder if there's any way to sort records within a window using Kafka Streams DSL or Processor API. Imagine the following situation as an example (arbitrary one, but similar to what I need): <ol> <li>There is a Kafka topic of some events, let's say user clicks. Let's say topic has 10 partitions. Messages are partitioned by key, but each key is unique, so it's sort of a random partitioning. Each record contains a user id, which is used later to repartition the stream.</li> <li>We consume the stream, and publish each message to another topic partitioning the record by it's user id (repartition the original stream by user id).</li> <li>Then we consume this repartitioned stream, and we store consumed records in local state store windowed by 10 minutes. All clicks of a particular user are always in the same partition, but order is not guarantied, because the original topic had 10 partitions.</li> <li>I understand the windowing model of Kafka Streams, and that time is advanced when new records come in, but I need this window to use processing time, not the event time, and then when window is expired, I need to be able to sort buffered events, and emit them in that order to another topic.</li> </ol> Notice: <ol> <li>We need to be able to flush/process records within the window using processing time, not the event time. We can't wait for the next click to advance the time, because it may never happen.</li> <li>We need to remove all the records from the store, as soon window is sorted and flushed.</li> <li>If application crashes, we need to recover (in the same or another instance of the application) and process all the windows that were not processed, without waiting for new records to come for a particular user. </li> </ol> I know Kafka Streams 1.0.0 allows to use wall clock time in Processing API, but I'm not sure what would be the right way to implement what I need (more importantly taking into account the recovery process requirement described above).

You can see my answer to a similar question here: https://stackoverflow.com/a/44345374/7897191 Since your message keys are already unique you can ignore my comments about de-duplication. Now that KIP-138 (wall-clock punctuation semantics) has been released in 1.0.0 you should be able to implement the outlined algorithm without issues. It uses the Processor API. I don't know of a way of doing this with only the DSL.

Kafka Streams Sort Within Processing Time Window

Tags:

apache-kafka

apache-kafka-streams

stream-processing

I wonder if there's any way to sort records within a window using Kafka Streams DSL or Processor API.

Imagine the following situation as an example (arbitrary one, but similar to what I need):

There is a Kafka topic of some events, let's say user clicks. Let's say topic has 10 partitions. Messages are partitioned by key, but each key is unique, so it's sort of a random partitioning. Each record contains a user id, which is used later to repartition the stream.
We consume the stream, and publish each message to another topic partitioning the record by it's user id (repartition the original stream by user id).
Then we consume this repartitioned stream, and we store consumed records in local state store windowed by 10 minutes. All clicks of a particular user are always in the same partition, but order is not guarantied, because the original topic had 10 partitions.
I understand the windowing model of Kafka Streams, and that time is advanced when new records come in, but I need this window to use processing time, not the event time, and then when window is expired, I need to be able to sort buffered events, and emit them in that order to another topic.

Notice:

We need to be able to flush/process records within the window using processing time, not the event time. We can't wait for the next click to advance the time, because it may never happen.
We need to remove all the records from the store, as soon window is sorted and flushed.
If application crashes, we need to recover (in the same or another instance of the application) and process all the windows that were not processed, without waiting for new records to come for a particular user.

I know Kafka Streams 1.0.0 allows to use wall clock time in Processing API, but I'm not sure what would be the right way to implement what I need (more importantly taking into account the recovery process requirement described above).

820

asked Mar 13 '18 15:03

burdiyan

1 Answers

You can see my answer to a similar question here: https://stackoverflow.com/a/44345374/7897191

Since your message keys are already unique you can ignore my comments about de-duplication.

Now that KIP-138 (wall-clock punctuation semantics) has been released in 1.0.0 you should be able to implement the outlined algorithm without issues. It uses the Processor API. I don't know of a way of doing this with only the DSL.

123

answered Oct 21 '22 06:10

Michal Borowiecki

Related questions
                            
                                Disable mirrormaker2 offset-sync topics on source kafka cluster
                            
                                Kafka Consumer startup error: Failed to add leader for partitions [calls,0] - NotLeaderForPartitionException
                            
                                Kafka: How to get last modified time for a topic i.e. last message added to any partition of the topic
                            
                                Kafka topic alias
                            
                                Installing librdkafka on Windows to support Python development
                            
                                Kafka JDBC connector load all data, then incremental
                            
                                Kafka Producer: Got error produce response with correlation NETWORK_EXCEPTION
                            
                                UnknownProducerIdException in Kafka streams when enabling exactly once
                            
                                Lagom service consuming input from Kafka
                            
                                Kafkacat: how to delete a topic or all its messages?
                            
                                How to fix: java.lang.OutOfMemoryError: Direct buffer memory in flink kafka consumer
                            
                                How to stream data from Kafka to MongoDB by Kafka Connector
                            
                                How to determine API version of Kafka?
                            
                                Why don't Kafka's seekToBeginning and seekToEnd work with assign?
                            
                                Kafka Connect: No tasks created for a connector
                            
                                Zookeeper sessions keep expiring...no heartbeats?
                            
                                Reset consumer offset in kafka 0.10

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With