Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does Kafka Streams allocate partitions?

Tags:

I have a Kafka Streams application that is receiving data from topic-1 as KStream and topic-2 as KTable. Both topics have 4 partitions each. Let's say that I have 4 instances of the application running, then each instance will receive data from a single partition for topic-1. How about topic-2 which is received as KTable? Are all instances going to receive data from all 4 partitions in that case? If both the topics are keyed the same, then I guess Kafka Streams will ensure that the same partitions are allocated for an application. If topic-2 doesn't have any keys, but rather the application is going to infer that from the value itself, then that means that all the instances need to get all partitions from topic-2. How does Kafka Streams handle this situation?

Thank you!

like image 764
sobychacko Avatar asked Apr 27 '18 21:04

sobychacko


People also ask

How do partitions get assigned in Kafka?

Consumer partition assignmentWhenever a consumer enters or leaves a consumer group, the brokers rebalance the partitions across consumers, meaning Kafka handles load balancing with respect to the number of partitions per application instance for you. This is great—it's a major feature of Kafka.

How are Kafka partitions distributed?

Partitions are the way that Kafka provides scalability A Kafka cluster is made of one or more servers. In the Kafka universe, they are called Brokers. Each broker holds a subset of records that belongs to the entire cluster. Kafka distributes the partitions of a particular topic across multiple brokers.

How does Kafka producer choose partition?

By default, Kafka producer relies on the key of the record to decide to which partition to write the record. For two records with the same key, the producer will always choose the same partition.

How Kafka stream works internally?

Kafka Streams partitions data for processing it. In both cases, this partitioning is what enables data locality, elasticity, scalability, high performance, and fault tolerance. Kafka Streams uses the concepts of stream partitions and stream tasks as logical units of its parallelism model.


1 Answers

KTables are sharded according to the input partitions. Thus, similar to a KStream, each instance will get one topic-partition assigned and materialize this topic-partition as shard of the KTable. Kafka Streams make sure, that topic partitions of different topic are co-located, ie, one instance will get assigned topic-1 partition-0 and topic-2 partition-0 (and so forth).

If topic-2 has no key set, data will be randomly distributed in the topic. For this case, you can use a GlobalKTable instead. A GlobalKTable is a full replication of all partitions per instance. If you do a KStream-GlobalKTable-join, you can specify a "mapper" that extracts the join attribute from the table (ie, you can extract the join attribute from the value).

Note: a KStream-GlobalKTable join has different semantics than a KStream-KTable join. It is not time synchronized in contrast to the later, and thus, the join is non-deterministic by design with regard to GlobalKTable updates; i.e., there is no guarantee what KStream record will be the first to "see" a GlobalKTable updates and thus join with the updated GlobalKTable record.

like image 147
Matthias J. Sax Avatar answered Sep 28 '22 18:09

Matthias J. Sax