Assume that I have a topic with numerous partitions. Im writing K/V data in there and want to aggregate said data in Tumbling Windows by keys. Assume that I've launched as many worker instances as I have partitions and each worker instance is running on a separate machine. How would I go about insuring that the resultant aggregations include all values for each key? IE I don't want each worker instance to have some subset of the values. Is this something that a StateStore would be used for? Does Kafka manage this on its own or do I need to come up with a method?

<blockquote> How would I go about insuring that the resultant aggregations include all values for each key? IE I don't want each worker instance to have some subset of the values. </blockquote> In general, Kafka Streams ensures that all values for the same key will be processed by the same (and only one) stream task, which also means only one application instance (what you described as "worker instance") will process the values for that key. Note that an app instance may run 1+ stream tasks, but these tasks are isolated. This behavior is achieved through the partitioning of the data, and Kafka Streams ensures that a partition is always processed by the same and only one stream task. The logical link to keys/values is that, in Kafka and Kafka Streams, a key is always sent to the same partition (there is a gotcha here, but I'm not sure whether it makes sense to go into details for the scope of this question), hence one particular partition -- among possible many partitions -- contains all the values for the same key. In some situations, such as when joining two streams <code>A</code> and <code>B</code>, you must ensure though that the aggregation will operate on the same key to ensure that data from both streams are co-located in the same stream task -- which, again, is all about ensuring that the relevant input stream partitions and thus matching the keys (from <code>A</code> and <code>B</code>, respectively) are made available in the same stream task. A typical method you'd use here is <code>selectKey()</code>. Once that is done, Kafka Streams ensures that, for joining the two streams A and B as well as for creating the joined output stream, all values for the same key will be processed by the same stream task and thus the same application instance. Example: <ul> <li>Stream <code>A</code> has key <code>userId</code> with value <code>{ georegion }</code>.</li> <li>Stream <code>B</code> has key <code>georegion</code> with value <code>{ continent, description }</code>.</li> </ul> Joining two streams only works (as of Kafka 0.10.0) when both streams use the same key. In this example, this means that you must re-key (and thus re-partition) stream <code>A</code> so that the resulting key is changed from <code>userId</code> to <code>georegion</code>. Otherwise, as of Kafka 0.10, you can't join <code>A</code> and <code>B</code> because data is not co-located in the stream task that is responsible for actually performing the join. In this example, you could re-key/re-partition stream <code>A</code> via: <pre class="prettyprint"><code>// Kafka 0.10.0.x (latest stable release as of Sep 2016) A.map((userId, georegion) -> KeyValue.pair(georegion, userId)).through("rekeyed-topic") // Upcoming versions of Kafka (not released yet) A.map((userId, georegion) -> KeyValue.pair(georegion, userId)) </code></pre> The <code>through()</code> call is only required in Kafka 0.10.0 to actually trigger re-partitioning, and later versions of Kafka will do these automatically for you (this upcoming functionality is already completed and available in Kafka <code>trunk</code>). <blockquote> Is this something that a StateStore would be used for? Does Kafka manage this on its own or do I need to come up with a method? </blockquote> In general, no. The behavior above is achieved through partitioning, not through state stores. Sometimes state stores are involved because of the operations you have defined for a stream, which might explain why you were asking this question. For example, a windowing operation will require state to be managed, and thus a state store will be created behind the scenes. But your actual question -- "insuring that the resultant aggregations include all values for each key" -- has nothing to do with state stores, it's about the partitioning behavior.

With worker instance, I assume you mean a Kafka Streams application instance, right? (Because there is no master/worker pattern in Kafka Streams -- it's a library and not a framework -- we do not use the term "worker".) If you want to co-locate data per key, you need to partition the data by key. Thus, either your data is partitioned by key by your external producer when data gets written into a topic from the beginning on. Or you explicitly set a new key within Kafka Streams application (using for example <code>selectKey()</code> or <code>map()</code>) and re-distributed via a call to <code>through()</code>. (The explicit call to <code>through()</code> will not be necessary in future releases, ie, <code>0.10.1</code> and Kafka Streams will re-distribute records automatically if necessary.) If messages/record should be partitioned, the key must not be <code>null</code>. You can also change the partitioning schema via producer configuration <code>partitioner.class</code> (see https://kafka.apache.org/documentation.html#producerconfigs). Partitioning is completely independent from StateStores, even if StateStores are usually used on top of partitioned data.

Kafka KTable - shared aggregation across machines

2 Answers

How would I go about insuring that the resultant aggregations include all values for each key? IE I don't want each worker instance to have some subset of the values.

In general, Kafka Streams ensures that all values for the same key will be processed by the same (and only one) stream task, which also means only one application instance (what you described as "worker instance") will process the values for that key. Note that an app instance may run 1+ stream tasks, but these tasks are isolated.

This behavior is achieved through the partitioning of the data, and Kafka Streams ensures that a partition is always processed by the same and only one stream task. The logical link to keys/values is that, in Kafka and Kafka Streams, a key is always sent to the same partition (there is a gotcha here, but I'm not sure whether it makes sense to go into details for the scope of this question), hence one particular partition -- among possible many partitions -- contains all the values for the same key.

In some situations, such as when joining two streams A and B, you must ensure though that the aggregation will operate on the same key to ensure that data from both streams are co-located in the same stream task -- which, again, is all about ensuring that the relevant input stream partitions and thus matching the keys (from A and B, respectively) are made available in the same stream task. A typical method you'd use here is selectKey(). Once that is done, Kafka Streams ensures that, for joining the two streams A and B as well as for creating the joined output stream, all values for the same key will be processed by the same stream task and thus the same application instance.

Example:

Stream A has key userId with value { georegion }.
Stream B has key georegion with value { continent, description }.

Joining two streams only works (as of Kafka 0.10.0) when both streams use the same key. In this example, this means that you must re-key (and thus re-partition) stream A so that the resulting key is changed from userId to georegion. Otherwise, as of Kafka 0.10, you can't join A and B because data is not co-located in the stream task that is responsible for actually performing the join.

In this example, you could re-key/re-partition stream A via:

// Kafka 0.10.0.x (latest stable release as of Sep 2016)
A.map((userId, georegion) -> KeyValue.pair(georegion, userId)).through("rekeyed-topic")

// Upcoming versions of Kafka (not released yet)
A.map((userId, georegion) -> KeyValue.pair(georegion, userId))

The through() call is only required in Kafka 0.10.0 to actually trigger re-partitioning, and later versions of Kafka will do these automatically for you (this upcoming functionality is already completed and available in Kafka trunk).

Is this something that a StateStore would be used for? Does Kafka manage this on its own or do I need to come up with a method?

In general, no. The behavior above is achieved through partitioning, not through state stores.

Sometimes state stores are involved because of the operations you have defined for a stream, which might explain why you were asking this question. For example, a windowing operation will require state to be managed, and thus a state store will be created behind the scenes. But your actual question -- "insuring that the resultant aggregations include all values for each key" -- has nothing to do with state stores, it's about the partitioning behavior.

148

answered Oct 17 '22 22:10

Michael G. Noll

With worker instance, I assume you mean a Kafka Streams application instance, right? (Because there is no master/worker pattern in Kafka Streams -- it's a library and not a framework -- we do not use the term "worker".)

If you want to co-locate data per key, you need to partition the data by key. Thus, either your data is partitioned by key by your external producer when data gets written into a topic from the beginning on. Or you explicitly set a new key within Kafka Streams application (using for example selectKey() or map()) and re-distributed via a call to through(). (The explicit call to through() will not be necessary in future releases, ie, 0.10.1 and Kafka Streams will re-distribute records automatically if necessary.)

If messages/record should be partitioned, the key must not be null. You can also change the partitioning schema via producer configuration partitioner.class (see https://kafka.apache.org/documentation.html#producerconfigs).

Partitioning is completely independent from StateStores, even if StateStores are usually used on top of partitioned data.

answered Oct 17 '22 22:10

Matthias J. Sax

Related questions
                            
                                Sending array list of object between activities with Parcelable
                            
                                Android: Dagger 2 and constructor injection
                            
                                Do client and server need to use same port to connect?
                            
                                Why is casting to short to char is a narrowing conversion?
                            
                                Which is the sybase driver(version) to be used in Weblogic 12c ? Where should I add the downloaded driver?
                            
                                AbstractExcelView is deprecated in Spring-based application
                            
                                Best practice for initializing an ArrayList field in Java [closed]
                            
                                Override a property in Java [duplicate]
                            
                                Is there any case when I should use ensureCapacity() on ArrayList externally?
                            
                                Tomcat7 + Maven plugin: Tomcat will not start with my war
                            
                                Optimal and efficient solution for the heavy number calculation?
                            
                                Aligning Nodes to the right side of a Vbox in javafx
                            
                                Failing to add a second '$or' expression
                            
                                Java Encode and decode string without forward or backward slash
                            
                                Why are JUnit assert methods not generic in Java?
                            
                                Lambda expression to add objects from one list to another type of list
                            
                                Java Concurrency control multiple locks
                            
                                Use DecimalFormat(#.#) with localization
                            
                                when setText on editText TextWatcher.onTextChanged not called
                            
                                Recover from java.lang.OutOfMemoryError

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Kafka KTable - shared aggregation across machines

Tags:

java

apache-kafka

apache-kafka-streams

ethrbunny

People also ask

2 Answers

Michael G. Noll

Matthias J. Sax

Recent Activity

Donate For Us