I have a use case like the following. For each incoming event, I want to look at a certain field to see if it's status changed from A to B and if so, send that to an output topic. The flow is like this: An event with key "xyz" comes in with status A, and some time later another event comes in with key "xyz" with status B. I have this code using the high level DSL. <pre class="prettyprint"><code>final KStream<String, DomainEvent> inputStream.... final KStream<String, DomainEvent> outputStream = inputStream .map((k, v) -> new KeyValue<>(v.getId(), v)) .groupByKey(Serialized.with(Serdes.String(), jsonSerde)) .aggregate(DomainStatusMonitor::new, (k, v, aggregate) -> { aggregate.updateStatusMonitor(v); return aggregate; }, Materialized.with(Serdes.String(), jsonSerde)) .toStream() .filter((k, v) -> v.isStatusChangedFromAtoB()) .map((k,v) -> new KeyValue<>(k, v.getDomainEvent())); </code></pre> Is there a better way to write this logic using the DSL? Couple of questions regarding the state store created by the aggregation in the code above. <ol> <li>Is it creating an in-memory state store by default?</li> <li>What will happen if I have an unbounded number of unique incoming keys? If it is using an in-memory store by default, don't I need to switch to a persistent store? How do we handle situations like that in the DSL?</li> <li>If the state store is very large (either in-memory or persistent), how does it affect the startup time? How can I make the stream processing to wait so that the store gets fully initialized? Or will Kafka Streams ensure that the state store is fully initialized before any incoming events are processed?</li> </ol> Thanks in advance!

<ol> <li>By default, a persistent RocksDB store will be used. If you want to use an in-memory store, you would pass in <code>Materialized.as(Stores.inMemoryKeyValueStore(...))</code></li> <li>If you have an infinite number of unique keys, you will eventually run out of main-memory or disk and your application will die. Depending on your semantics, you can get a "TTL" by using a session windowed aggregation with a large "gap" parameter instead to expire old keys.</li> <li>The state will always be restored before processing new data happens. If you use in-memory store, this will happen by consuming the underlying changelog topic. Depending on the size of your state, this can take a while. If you use persistent RocksDB store, the state will be loaded from disk and thus no restore will be required and processing should happen immediately. Only if you loose the state on local disk, a restore from the changelog topic will happen for this case.</li> </ol>

Aggregration and state store retention in kafka streams

Tags:

apache-kafka-streams

I have a use case like the following. For each incoming event, I want to look at a certain field to see if it's status changed from A to B and if so, send that to an output topic. The flow is like this: An event with key "xyz" comes in with status A, and some time later another event comes in with key "xyz" with status B. I have this code using the high level DSL.

final KStream<String, DomainEvent> inputStream....

final KStream<String, DomainEvent> outputStream = inputStream
          .map((k, v) -> new KeyValue<>(v.getId(), v))
                    .groupByKey(Serialized.with(Serdes.String(), jsonSerde))
                    .aggregate(DomainStatusMonitor::new,
                            (k, v, aggregate) -> {
                                aggregate.updateStatusMonitor(v);
                                return aggregate;
                            }, Materialized.with(Serdes.String(), jsonSerde))
                    .toStream()
                    .filter((k, v) -> v.isStatusChangedFromAtoB())
                    .map((k,v) -> new KeyValue<>(k, v.getDomainEvent()));

Is there a better way to write this logic using the DSL?

Couple of questions regarding the state store created by the aggregation in the code above.

Is it creating an in-memory state store by default?
What will happen if I have an unbounded number of unique incoming keys? If it is using an in-memory store by default, don't I need to switch to a persistent store? How do we handle situations like that in the DSL?
If the state store is very large (either in-memory or persistent), how does it affect the startup time? How can I make the stream processing to wait so that the store gets fully initialized? Or will Kafka Streams ensure that the state store is fully initialized before any incoming events are processed?

Thanks in advance!

616

asked Jul 22 '18 01:07

sobychacko

1 Answers

By default, a persistent RocksDB store will be used. If you want to use an in-memory store, you would pass in Materialized.as(Stores.inMemoryKeyValueStore(...))
If you have an infinite number of unique keys, you will eventually run out of main-memory or disk and your application will die. Depending on your semantics, you can get a "TTL" by using a session windowed aggregation with a large "gap" parameter instead to expire old keys.
The state will always be restored before processing new data happens. If you use in-memory store, this will happen by consuming the underlying changelog topic. Depending on the size of your state, this can take a while. If you use persistent RocksDB store, the state will be loaded from disk and thus no restore will be required and processing should happen immediately. Only if you loose the state on local disk, a restore from the changelog topic will happen for this case.

answered Oct 16 '22 20:10

Matthias J. Sax

Related questions
                            
                                How to always consume from latest offset in kafka-streams
                            
                                Stop a Kafka Streams app
                            
                                Kafka KStreams - processing timeouts
                            
                                How to output result of windowed aggregation only when window is finished? [duplicate]
                            
                                Kafka streams.allMetadata() method returns empty list
                            
                                KafkaStreams - InconsistentGroupProtocolException
                            
                                Can we disable log4j logs only for kafka
                            
                                External system queries during Kafka Stream processing
                            
                                Kafka multiple partition ordering
                            
                                How to slow down or set given speed on the Kafka stream consumer?
                            
                                kafka streams - how to set a new key for KTable
                            
                                java.lang.NoSuchMethodError: org.apache.kafka.common.metrics.Sensor.add error in kafka streams
                            
                                how to process data in chunks/batches with kafka streams?
                            
                                Kafka Streams - updating aggregations on KTable
                            
                                Kafka Streams processors - state store and input topic partitioning
                            
                                How to filter keys and value with a Processor using Kafka Stream DSL
                            
                                max.poll.intervals.ms set to int.Max by default
                            
                                RecordTooLargeException in Kafka streams join
                            
                                How to manage Kafka KStream to Kstream windowed join?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With