I have a KTable with data that looks like this (key => value), where keys are customer IDs, and values are small JSON objects containing some customer data: <pre class="prettyprint"><code>1 => { "name" : "John", "age_group": "25-30"} 2 => { "name" : "Alice", "age_group": "18-24"} 3 => { "name" : "Susie", "age_group": "18-24" } 4 => { "name" : "Jerry", "age_group": "18-24" } </code></pre> I'd like to do some aggregations on this KTable, and basically keep a count of the number of records for each <code>age_group</code>. The desired KTable data would look like this: <pre class="prettyprint"><code>"18-24" => 3 "25-30" => 1 </code></pre> Lets say <code>Alice</code>, who is in the <code>18-24</code> group above, has a birthday that puts her in the new age group. The state store backing the first KTable should now look like this: <pre class="prettyprint"><code>1 => { "name" : "John", "age_group": "25-30"} 2 => { "name" : "Alice", "age_group": "25-30"} # Happy Cake Day 3 => { "name" : "Susie", "age_group": "18-24" } 4 => { "name" : "Jerry", "age_group": "18-24" } </code></pre> And I'd like the resulting aggregated KTable results to reflect this. e.g. <pre class="prettyprint"><code>"18-24" => 2 "25-30" => 2 </code></pre> I may be overgeneralizing the issue described here: <blockquote> In Kafka Streams there is no such thing as a final aggregation... Depending on your use case, manual de-duplication would be a way to resolve the issue" </blockquote> But I have only been able to calculate a running total so far, e.g. Alice's birthday would be interpreted as: <pre class="prettyprint"><code>"18-24" => 3 # Old Alice record still gets counted here "25-30" => 2 # New Alice record gets counted here as well </code></pre> <hr> Edit: here is some additional behavior that I noticed that seems unexpected. The topology I'm using looks like: <pre class="prettyprint"><code>dataKTable = builder.table("compacted-topic-1", "users-json") .groupBy((key, value) -> KeyValue.pair(getAgeRange(value), key)) .count("age-range-counts") </code></pre> <hr> <h3>1) Empty State</h3> Now, from the initial, empty state, everything looks like this: <pre class="prettyprint"><code>compacted-topic-1 (empty) dataKTable (empty) // groupBy() Repartition topic: $APP_ID-age-range-counts-repartition (empty) // count() age-range-counts state store (empty) </code></pre> <hr> <h3>2) Send a couple of messages</h3> Now, lets send a message to the <code>compacted-topic-1</code>, which is streamed as a KTable above. Here is what happens: <pre class="prettyprint"><code>compacted-topic-1 3 => { "name" : "Susie", "age_group": "18-24" } 4 => { "name" : "Jerry", "age_group": "18-24" } dataKTable 3 => { "name" : "Susie", "age_group": "18-24" } 4 => { "name" : "Jerry", "age_group": "18-24" } // groupBy() // why does this generate 4 events??? Repartition topic: $APP_ID-age-range-counts-repartition 18-24 => 3 18-24 => 3 18-24 => 4 18-24 => 4 // count() age-range-counts state store 18-24 => 0 </code></pre> <hr> So I'm wondering: <ul> <li>Is what I'm trying to do even possible using Kafka Streams 0.10.1 or 0.10.2? I've tried using <code>groupBy</code> and <code>count</code> in the DSL, but maybe I need to use something like <code>reduce</code>? </li> <li>Also, I'm having a little trouble understanding the circumstances that lead to the <code>add</code> reducer and the <code>subtract</code> reducer being called, so any clarification around any of these points will be greatly appreciated.</li> </ul>

If you have your original <code>KTable</code> containing <code>id -> Json</code> data (let's call it <code>dataKTable</code>) you should be able to get what you want via <pre class="prettyprint lang-java prettyprint-override"><code>KTable countKTablePerRange = dataKTable.groupBy(/* map your age-range to be the key*/) .count("someStoreName"); </code></pre> This should work for all versions of Kafka's Streams API. Update About the 4 values in the re-partitioning topic: that's correct. Each update to the "base <code>KTable</code>" writes a record for it's "old value" and it's "new value". This is required to update the downstream <code>KTable</code> correctly. The old value must be removed from one count and the new value must be added to another count. Because your (count) <code>KTable</code> is potentially distributed (ie, shared over multiple parallel running app instances), both records (old and new) might end up at different instances because both might have different key and thus they must be sent as two independent records. (The record format should be more complex that you show in your question though.) This also explains, why you need a subtractor and an adder. The subtractor removes old record from the agg result, while the adder adds new record to the agg result. Still not sure why you don't see the correct count in the result. How many instanced to you run? Maybe try to disable <code>KTable</code> cache by setting <code>cache.max.bytes.buffering=0</code> in <code>StreamsConfig</code>.

Kafka Streams - updating aggregations on KTable

Tags:

apache-kafka

apache-kafka-streams

I have a KTable with data that looks like this (key => value), where keys are customer IDs, and values are small JSON objects containing some customer data:

1 => { "name" : "John", "age_group":  "25-30"}
2 => { "name" : "Alice", "age_group": "18-24"}
3 => { "name" : "Susie", "age_group": "18-24" }
4 => { "name" : "Jerry", "age_group": "18-24" }

I'd like to do some aggregations on this KTable, and basically keep a count of the number of records for each age_group. The desired KTable data would look like this:

"18-24" => 3
"25-30" => 1

Lets say Alice, who is in the 18-24 group above, has a birthday that puts her in the new age group. The state store backing the first KTable should now look like this:

1 => { "name" : "John", "age_group":  "25-30"}
2 => { "name" : "Alice", "age_group": "25-30"} # Happy Cake Day
3 => { "name" : "Susie", "age_group": "18-24" }
4 => { "name" : "Jerry", "age_group": "18-24" }

And I'd like the resulting aggregated KTable results to reflect this. e.g.

"18-24" => 2
"25-30" => 2

I may be overgeneralizing the issue described here:

In Kafka Streams there is no such thing as a final aggregation... Depending on your use case, manual de-duplication would be a way to resolve the issue"

But I have only been able to calculate a running total so far, e.g. Alice's birthday would be interpreted as:

"18-24" => 3 # Old Alice record still gets counted here
"25-30" => 2 # New Alice record gets counted here as well

Edit: here is some additional behavior that I noticed that seems unexpected.

The topology I'm using looks like:

dataKTable = builder.table("compacted-topic-1", "users-json")
    .groupBy((key, value) -> KeyValue.pair(getAgeRange(value), key))
    .count("age-range-counts")

1) Empty State

Now, from the initial, empty state, everything looks like this:

compacted-topic-1
(empty)


dataKTable
(empty)


// groupBy()
Repartition topic: $APP_ID-age-range-counts-repartition
(empty)

// count()
age-range-counts state store
(empty)

2) Send a couple of messages

Now, lets send a message to the compacted-topic-1, which is streamed as a KTable above. Here is what happens:

compacted-topic-1
3 => { "name" : "Susie", "age_group": "18-24" }
4 => { "name" : "Jerry", "age_group": "18-24" }

dataKTable
3 => { "name" : "Susie", "age_group": "18-24" }
4 => { "name" : "Jerry", "age_group": "18-24" }


// groupBy()
// why does this generate 4 events???
Repartition topic: $APP_ID-age-range-counts-repartition
18-24 => 3
18-24 => 3
18-24 => 4
18-24 => 4

// count()
age-range-counts state store
18-24 => 0

So I'm wondering:

Is what I'm trying to do even possible using Kafka Streams 0.10.1 or 0.10.2? I've tried using groupBy and count in the DSL, but maybe I need to use something like reduce?
Also, I'm having a little trouble understanding the circumstances that lead to the add reducer and the subtract reducer being called, so any clarification around any of these points will be greatly appreciated.

596

asked Mar 09 '17 02:03

foxygen

1 Answers

If you have your original KTable containing id -> Json data (let's call it dataKTable) you should be able to get what you want via

KTable countKTablePerRange
    = dataKTable.groupBy(/* map your age-range to be the key*/)
                .count("someStoreName");

This should work for all versions of Kafka's Streams API.

Update

About the 4 values in the re-partitioning topic: that's correct. Each update to the "base KTable" writes a record for it's "old value" and it's "new value". This is required to update the downstream KTable correctly. The old value must be removed from one count and the new value must be added to another count. Because your (count) KTable is potentially distributed (ie, shared over multiple parallel running app instances), both records (old and new) might end up at different instances because both might have different key and thus they must be sent as two independent records. (The record format should be more complex that you show in your question though.)

This also explains, why you need a subtractor and an adder. The subtractor removes old record from the agg result, while the adder adds new record to the agg result.

Still not sure why you don't see the correct count in the result. How many instanced to you run? Maybe try to disable KTable cache by setting cache.max.bytes.buffering=0 in StreamsConfig.

156

answered Oct 08 '22 05:10

Matthias J. Sax

Related questions
                            
                                Setting up Apache Kafka for developer/integration test environment
                            
                                Optimal number of partition for kafka topic on 5 brokers with replication factor=3 in 1 cluster
                            
                                Unable to convert Kafka topic data into structured JSON with Confluent Elasticsearch sink connector
                            
                                Ordering guarantees when using idempotent Kafka Producer
                            
                                Difference between poll and consume in Kafka Confluent library
                            
                                Combine results from batch RDD with streaming RDD in Apache Spark
                            
                                real time log processing using apache spark streaming
                            
                                Streaming data from Kafka into Cassandra in real time
                            
                                Create Spark DataFrame in Spark Streaming from JSON Message on Kafka
                            
                                how to process data in chunks/batches with kafka streams?
                            
                                Kafka keeps rebalancing consumers
                            
                                Apache Kafka and message delivery assurance
                            
                                Kafka Serializer JSON [duplicate]
                            
                                Debugging process for Kafka SSL security
                            
                                Kafka throws java.nio.channels.ClosedChannelException
                            
                                Avro decoding gives java.io.EOFException
                            
                                Setting Partition Strategy in a Kafka Connector
                            
                                Is it a good way to run Kafka on Kubernetes?
                            
                                Kafka as a message queue for long running tasks
                            
                                how to use Kafka 0.8 Log4j appender

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With