Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In Apache Kafka why can't there be more consumer instances than partitions?

I'm learning about Kafka, reading the introduction section here

https://kafka.apache.org/documentation.html#introduction

specifically the portion about Consumers. In the second to last paragraph in the Introduction it reads

Kafka does it better. By having a notion of parallelism—the partition—within the topics, Kafka is able to provide both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes the data in order. Since there are many partitions this still balances the load over many consumer instances. Note however that there cannot be more consumer instances than partitions.

My confusion stems from that last sentence, because in the image right above that paragraph where the author depicts two consumer groups and a 4-partition topic, there are more consumer instances than partitions!

It also doesn't make sense that there can't be more consumer instances than partitions, because then partitions would be incredibly small and it seems like the overhead in creating a new partition for each consumer instance would bog down Kafka. I understand that partitions are used for fault-tolerance and reducing the load on any one server, but the sentence above does not make sense in the context of a distributed system that's supposed to be able to handle thousands of consumers at a time.

like image 683
almel Avatar asked Sep 17 '14 16:09

almel


People also ask

Can there be more consumers than partitions in Kafka?

A consumer can be assigned to consume multiple partitions. So the rule in Kafka is only one consumer in a consumer group can be assigned to consume messages from a partition in a topic and hence multiple Kafka consumers from a consumer group can not read the same message from a partition.

What happens if there are more consumers than partitions in Kafka?

More consumers in a group than partitions means idle consumers. The main way we scale data consumption from a Kafka topic is by adding more consumers to a consumer group. It is common for Kafka consumers to do high-latency operations such as write to a database or a time-consuming computation on the data.

What happens if there are more partitions than consumers?

You can have fewer consumers than partitions (in which case consumers get messages from multiple partitions), but if you have more consumers than partitions some of the consumers will be “starved” and not receive any messages until the number of consumers drops to (or below) the number of partitions.

Can Kafka topic have multiple consumers?

Optimum Number of Kafka Consumers You can have as many consumers as you want. However, the only limitation is that the number of consumers within a consumer group should always be less than or equal to the number of partitions of a topic.


2 Answers

Ok, to understand it, one needs to understand several parts.

  1. In order to provide ordering total order, the message can be sent only to one consumer. Otherwise it would be extremely inefficient, because it would need to wait for all consumers to recieve the message before sending the next one:

However, although the server hands out messages in order, the messages are delivered asynchronously to consumers, so they may arrive out of order on different consumers. This effectively means the ordering of the messages is lost in the presence of parallel consumption. Messaging systems often work around this by having a notion of "exclusive consumer" that allows only one process to consume from a queue, but of course this means that there is no parallelism in processing.

Kafka does it better. By having a notion of parallelism—the partition—within the topics, Kafka is able to provide both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes the data in order. Since there are many partitions this still balances the load over many consumer instances. Note however that there cannot be more consumer instances than partitions.

Kafka only provides a total order over messages within a partition, not between different partitions in a topic.

Also what you think is a performance penalty (multiple partitions) is actually a performance gain, as Kafka can perform actions of different partitions completely in parallel, while waiting for other partitions to finish.

  1. The picture show different consumer groups, but the limitation of maximum one consumer per partition is only within a group. You still can have multiple consumer groups.

In the beginning the two scenarios are described:

If all the consumer instances have the same consumer group, then this works just like a traditional queue balancing load over the consumers.

If all the consumer instances have different consumer groups, then this works like publish-subscribe and all messages are broadcast to all consumers.

So, the more subscriber groups you have, the lower the performance is, as kafka needs to replicate the messages to all those groups and guarantee the total order.

On the other hand, the less group, and more partitions you have the more you gain from parallizing the message processing.

like image 181
peter Avatar answered Sep 28 '22 11:09

peter


It is important to recall that Kafka keeps one offset per [consumer-group, topic, partition]. That is the reason.

I guess the sentence

Note however that there cannot be more consumer instances than partitions.

is referring to the "automatic consumer group re-balance" mode, the default consumer mode when you just subscribe() some number of consumers to a list of topics.

I assume that because, at least with Kafka 0.9.x, nothing prevents having several consumer instances, members of the same group, reading from the same partition.

You can do something like this in two or more different threads

Properties props = new Properties(); props.put(ConsumerConfig.GROUP_ID_CONFIG, "MyConsumerGroup"); props.put("enable.auto.commit", "false"); consumer = new KafkaConsumer<>(props); TopicPartition partition0 = new TopicPartition("mytopic", 0); consumer.assign(Arrays.asList(partition0)); ConsumerRecords<Integer, String> records = consumer.poll(1000); 

and you will have two (or more) consumers reading from the same partition.

Now, the "issue" is that both consumers will be sharing the same offset, you don't have other option since there is only one group, topic and partition into play.

If both consumers read the current offset at the same time, then both of them will read the same value, and both of them will get the same messages.

If you want each consumer to read different messages you will have to sync them so only one can fetch and commit the offset at at time.

like image 34
Luciano Afranllie Avatar answered Sep 28 '22 11:09

Luciano Afranllie