Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does kafka streams threads die when the source topic partitions changes ? Can anyone point to reading material around this?

We increased the number of partitions to parallel process the messages as the throughput of the message was high. As soon as we increased the number of partitions all the streams thread which were subscribed to that topic died. We changed the consumer group id then we restarted the application it worked fine.

I know that the number of partitions changelog topic of application should be same as source topic. I would like to know the reason behind this.

I saw this link - https://issues.apache.org/jira/browse/KAFKA-6063?jql=project%20%3D%20KAFKA%20AND%20component%20%3D%20streams%20AND%20text%20~%20%22partition%22

Couldn't find the reason

https://github.com/apache/kafka/blob/fdc742b1ade420682911b3e336ae04827639cc04/streams/src/main/java/org/apache/kafka/streams/processor/internals/InternalTopicManager.java#L122

Basically, reason behind this if condition.

like image 287
kartik7153 Avatar asked Feb 12 '19 12:02

kartik7153


People also ask

Can you change number of partitions in Kafka topic?

If you want to change the number of partitions or replicas of your Kafka topic, you can use a streaming transformation to automatically stream all of the messages from the original topic into a new Kafka topic that has the desired number of partitions or replicas.

When should you not use Kafka Streams?

As point 1 if having just a producer producing message we don't need Kafka Stream. If consumer messages from one Kafka cluster but publish to different Kafka cluster topics. In that case, you can even use Kafka Stream but have to use a separate Producer to publish messages to different clusters.

Is stream immutable in Kafka?

A stream is the most important abstraction provided by Kafka Streams: it represents an unbounded, continuously updating data set. A stream is an ordered, replayable, and fault-tolerant sequence of immutable data records, where a data record is defined as a key-value pair.

What is Kafka stream thread?

Kafka Streams allows the user to configure the number of threads that the library can use to parallelize processing within an application instance. Each thread can execute one or more stream tasks with their processor topologies independently.

How does Kafka stream work?

Kafka Streams automatically handles the distribution of Kafka topic partitions to stream threads. Launching more stream threads or more instances of an application means replicating the topology and letting another subset of Kafka partitions process it effectively parallelizing the process.

How does partitioning work in Kafka?

Partitioning takes the single topic log and breaks it into multiple logs each of which can live on a separate node in the Kafka cluster. You can have as many partitions per topic as you want. The benefits with Kafka are owing to topic partitioning where messages are stored in the right partition to share data evenly.

What is Kafka Streams processor topology?

This provides a logical view of Kafka Streams application that can contain multiple stream threads, which can in turn contain multiple stream tasks. A Processor topology (or topology in simple terms) is used to define the Stream Processing Computational logic for your application.

What is the difference between Kaka consumer and Kaka streams?

Kafka Streams supports stateless and stateful operations, but Kaka Consumer only supports stateless operations. Kafka Consumer offers you the capability to write in several Kafka Clusters, whereas Kafka Streams lets you interact with a single Kafka Cluster only. Here are the steps you can follow to connect Kafka Streams to Confluent Cloud:


1 Answers

Input topic partitions define the level of parallelism, and if you have stateful operations like aggregation or join, the state of those operations in sharded. If you have X input topic partitions you get X tasks each with one state shard. Furthermore, state is backed by a changelog topic in Kafka with X partitions and each shard is using exactly one of those partitions.

If you change the number of input topic partitions to X+1, Kafka Streams tries to create X+1 tasks with X store shards, however the exiting changelog topic has only X partitions. Thus, the whole partitioning of your application breaks and Kafka Streams cannot guaranteed correct processing and thus shuts down with an error.

Also note, that Kafka Streams assume, that input data is partitioned by key. If you change the number of input topic partitions, the hash-based partitioning changes what may result in incorrect output, too.

In general, it's recommended to over-partition topics in the beginning to avoid this issue. If you really need to scale out, it is best to create a new topic with the new number of partitions, and start a copy of the application (with new application ID) in parallel. Afterwards, you update your upstream producer applications to write into the new topic, and finally shutdown the old application.

like image 72
Matthias J. Sax Avatar answered Sep 30 '22 17:09

Matthias J. Sax