I would like to fully understand the rules that kafka-streams processors must obey with respect to partitioning of a processor's input and its state(s). Specifically I would like to understand: <ol> <li>Whether or not it is possible and what are the potential consequences of using a key for the state store(s) that is not the same as the key of the input topic</li> <li>Whether or not state store keys are shared across partitions, i.e. whether or not I will get the same value if I try to access the same key in a processor while it is processing records belonging to two different partitions</li> </ol> I have been doing some research on this and the answers I found seem not to be very clear and sometimes contradictory: e.g. this one seems to suggest that the stores are totally independent and you can use any key while this one says that you should never use a store with a different key than the one in the input topic. Thanks for any clarification.

You have to distinguish between input partitions and store shards/changelog topic partitions for a complete picture. Also, it depends if you use the DSL or the Processor API, because the DSL does some auto-repartitioning but the Processor API doesn't. Because the DSL compiles down to the Processor API, I'll start with this. If you have a topic with let's say 4 partitions and you create a stateful processor that consumes this topic, you will get 4 tasks, each task running a processor instance that maintains one shard of the store. Note, that the overall state is split into 4 shards and each shard is basically isolated from the other shards. From an Processor API runtime point of view, the input topic partitions and the state store shards (including their corresponding changelog topic partitions) are a unit of parallelism. Hence, the changelog topic for the store is create with 4 partitions, and changelog-topic-partition-X is mapped to input-topic-partition-X. Note, that Kafka Streams does not use hash-based partitioning when writing into a changelog topic, but provides the partition number explicitly, to ensure that "processor instance X", that processes input-topic-partition-X, only reads/write from/into changelog-topic-partition-X. Thus, the runtime is agnostic to keys if you wish. If your input topic is not partitioned by keys, messages with the same key will be processed by different task. Depending on the program, this might be ok (eg. filtering), or not (eg, count per key). Similar to state: you can put any key into a state store, but this key is "local" to the corresponding shard. Other tasks, will never see this key. Thus, if you use the same key in a store on different tasks, they will be completely independent from each other (as if they would be two keys). Using Processor API, it's your responsibility to partition input data correctly and to use stores correctly, depending on the operator semantics you need. At DSL level, Kafka Streams will make sure that data is partitioned correctly to ensure correct operator semantics. First, it's assumed that input topics are partitioned by key. If the key is modified, for example via <code>selectKey()</code> and a downstream operator is an aggregation, Kafka Streams is repartitioning the data first, to insure that records with the same key are in the same topic partition. This ensures, that each key will be used in a single store shard. Thus, the DSL will always partition the data such that one key is never processed on different shards.

Kafka Streams processors - state store and input topic partitioning

Tags:

apache-kafka

apache-kafka-streams

I would like to fully understand the rules that kafka-streams processors must obey with respect to partitioning of a processor's input and its state(s). Specifically I would like to understand:

Whether or not it is possible and what are the potential consequences of using a key for the state store(s) that is not the same as the key of the input topic
Whether or not state store keys are shared across partitions, i.e. whether or not I will get the same value if I try to access the same key in a processor while it is processing records belonging to two different partitions

I have been doing some research on this and the answers I found seem not to be very clear and sometimes contradictory: e.g. this one seems to suggest that the stores are totally independent and you can use any key while this one says that you should never use a store with a different key than the one in the input topic.

Thanks for any clarification.

296

asked Oct 10 '18 16:10

Aldo Stracquadanio

1 Answers

You have to distinguish between input partitions and store shards/changelog topic partitions for a complete picture. Also, it depends if you use the DSL or the Processor API, because the DSL does some auto-repartitioning but the Processor API doesn't. Because the DSL compiles down to the Processor API, I'll start with this.

If you have a topic with let's say 4 partitions and you create a stateful processor that consumes this topic, you will get 4 tasks, each task running a processor instance that maintains one shard of the store. Note, that the overall state is split into 4 shards and each shard is basically isolated from the other shards.

From an Processor API runtime point of view, the input topic partitions and the state store shards (including their corresponding changelog topic partitions) are a unit of parallelism. Hence, the changelog topic for the store is create with 4 partitions, and changelog-topic-partition-X is mapped to input-topic-partition-X. Note, that Kafka Streams does not use hash-based partitioning when writing into a changelog topic, but provides the partition number explicitly, to ensure that "processor instance X", that processes input-topic-partition-X, only reads/write from/into changelog-topic-partition-X.

Thus, the runtime is agnostic to keys if you wish.

If your input topic is not partitioned by keys, messages with the same key will be processed by different task. Depending on the program, this might be ok (eg. filtering), or not (eg, count per key).

Similar to state: you can put any key into a state store, but this key is "local" to the corresponding shard. Other tasks, will never see this key. Thus, if you use the same key in a store on different tasks, they will be completely independent from each other (as if they would be two keys).

Using Processor API, it's your responsibility to partition input data correctly and to use stores correctly, depending on the operator semantics you need.

At DSL level, Kafka Streams will make sure that data is partitioned correctly to ensure correct operator semantics. First, it's assumed that input topics are partitioned by key. If the key is modified, for example via selectKey() and a downstream operator is an aggregation, Kafka Streams is repartitioning the data first, to insure that records with the same key are in the same topic partition. This ensures, that each key will be used in a single store shard. Thus, the DSL will always partition the data such that one key is never processed on different shards.

111

answered Nov 02 '22 22:11

Matthias J. Sax

Related questions
                            
                                Apache Kafka and message delivery assurance
                            
                                Kafka Serializer JSON [duplicate]
                            
                                Debugging process for Kafka SSL security
                            
                                Kafka throws java.nio.channels.ClosedChannelException
                            
                                Avro decoding gives java.io.EOFException
                            
                                Setting Partition Strategy in a Kafka Connector
                            
                                Is it a good way to run Kafka on Kubernetes?
                            
                                Kafka as a message queue for long running tasks
                            
                                how to use Kafka 0.8 Log4j appender
                            
                                Kafka Streams - updating aggregations on KTable
                            
                                Kafka Maven Dependencies
                            
                                Connecting Kafka-Python with a cluster with Kerberos
                            
                                Kubernetes how to access a service from another namespace
                            
                                Kafka Connect Out of Java heap space after enabling SSL
                            
                                Kafka: The message when serialized is larger than the maximum request size you have configured with the max.request.size configuration
                            
                                Failing to write offset data to zookeeper in kafka-storm
                            
                                Why isn't kafka continuing to work on fail of one of the brokers?
                            
                                Kafka python consumer reading all the messages when started
                            
                                What is the different between kafka artifactIds kafka_2.10 and kafka-clients?
                            
                                How to stop a flink streaming job from program

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With