Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

KafKa partitioner class, assign message to partition within topic using key

Tags:

apache-kafka

I am new to kafka so apology if I sound stupid but what I understood so far is .. A stream of message can be defined as a topic, like a category. And every topic is divided into one or more partitions (each partition can have multiple replicas). so they act in parallel

From the Kafka main site they say

The producer is able to chose which message to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the message).

Does this mean while consuming I will be able to choose the message offset from particular partition? While running multiple partitions is it possible to choose from one specific partition i.e partition 0?

In Kafka 0.7 quick start they say

Send a message with a partition key. Messages with the same key are sent to the same partition.

And the key can be provided while creating the producer as below

    ProducerData<String, String> data = new ProducerData<String, String>("test-topic", "test-key", "test-message");     producer.send(data); 

Now how do I consume message based on this key? what is the actual impact of using this key while producing in Kafka ?

While creating producer in 0.8beta we can provide the partitioner class attribute through the config file. The custom partitioner class can be perhaps created implementing the kafka partitioner interface. But m little confused how exactly it works. 0.8 doc also does not explain much. Any advice or m i missing something ?

like image 357
Hild Avatar asked Aug 13 '13 07:08

Hild


People also ask

Does Kafka partition by key?

As I mentioned in the part about the Kafka record, the key is used for partitioning. By default, Kafka producer relies on the key of the record to decide to which partition to write the record. For two records with the same key, the producer will always choose the same partition.

How are messages stored in topic partitions in Kafka?

Topic partitions contain an ordered set of messages and each message in the partition has a unique offset. Kafka does not track which messages were read by a task or consumer. Consumers must track their own location within each log; the Datastax Connector task store the offsets in config. offset.

How does Kafka partition key work?

As per the defined key, the Kafka producer will elect the specific partition and pushing the Kafka messages or data into the partition. If we haven't provided it, Kafka will use the default hash key partition and push the messages into it. Majorly the Kafka partition is deal with parallelism.

How does Kafka partitioning work?

As per the defined key, the Kafka producer will elect the specific partition and pushing the Kafka messages or data into the partition. If we haven’t provided it, Kafka will use the default hash key partition and push the messages into it.

How to assign all messages with the same key to same partition?

All messages with the same key will go to the same partition. Run Kafka server as described here. Using Kafka Admin API to create the example topic with 4 partitions. As seen above key-0 is always assigned partition 1, key-1 is always assigned partition 0, key-2 is always assigned partition 2 and key-3 is always assigned partition 3.

What is the key of a process message in Kafka?

The key of the message would be the ProcessId. This way i could be sure that all process messages would be partitioned and kafka would guarantee the order. As i am new to Kafka what i managed to figure out that partitions has to be created in advance and that makes everything to difficult. so my questions are:

How to define broker ID in Kafka partition?

In the Kafka partition, we need to define the broker id by the non-negative integer id. The broker’s name will include the combination of the hostname as well as the port name. We have used single or multiple brokers as per the requirement.


Video Answer


1 Answers

Each topic in Kafka is split into many partitions. Partition allows for parallel consumption increasing throughput.

Producer publishes the message to a topic using the Kafka producer client library which balances the messages across the available partitions using a Partitioner. The broker to which the producer connects to takes care of sending the message to the broker which is the leader of that partition using the partition owner information in zookeeper. Consumers use Kafka’s High-level consumer library (which handles broker leader changes, managing offset info in zookeeper and figuring out partition owner info etc implicitly) to consume messages from partitions in streams; each stream may be mapped to a few partitions depending on how the consumer chooses to create the message streams.

For example, if there are 10 partitions for a topic and 3 consumer instances (C1,C2,C3 started in that order) all belonging to the same Consumer Group, we can have different consumption models that allow read parallelism as below

Each consumer uses a single stream. In this model, when C1 starts all 10 partitions of the topic are mapped to the same stream and C1 starts consuming from that stream. When C2 starts, Kafka rebalances the partitions between the two streams. So, each stream will be assigned to 5 partitions(depending on the rebalance algorithm it might also be 4 vs 6) and each consumer consumes from its stream. Similarly, when C3 starts, the partitions are again rebalanced between the 3 streams. Note that in this model, when consuming from a stream assigned to more than one partition, the order of messages will be jumbled between partitions.

Each consumer uses more than one stream (say C1 uses 3, C2 uses 3 and C3 uses 4). In this model, when C1 starts, all the 10 partitions are assigned to the 3 streams and C1 can consume from the 3 streams concurrently using multiple threads. When C2 starts, the partitions are rebalanced between the 6 streams and similarly when C3 starts, the partitions are rebalanced between the 10 streams. Each consumer can consume concurrently from multiple streams. Note that the number of streams and partitions here are equal. In case the number of streams exceed the partitions, some streams will not get any messages as they will not be assigned any partitions.

like image 84
java_geek Avatar answered Sep 19 '22 21:09

java_geek