Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I have 100s of thousands of topics in a Kafka Cluster?

Tags:

apache-kafka

I have a data flow use case where I want to have topics defined based on each of the customer repositories (which might be in the order of 100,000s) Each data flow would be a topic with partitions (in the order of a few 10s) defining the different stages of the flow.

Is Kafka good for a scenario like this? If not how would I remodel my use case to handle such scenarios. Also it is the case that each customer repository data cannot be mingled with others even during processing.

like image 711
Swami PR Avatar asked Oct 05 '15 14:10

Swami PR


People also ask

What is the correct number of topics partitions in a Kafka cluster?

A rough formula for picking the number of partitions is based on throughput. You measure the throughout that you can achieve on a single partition for production (call it p) and consumption (call it c). Let's say your target throughput is t. Then you need to have at least max(t/p, t/c) partitions.

How many topics are there in Kafka?

In Kafka, we can create n number of topics as we want. It is identified by its name, which depends on the user's choice. A producer publishes data to the topics, and a consumer reads that data from the topic by subscribing it.

Can we have multiple topics in Kafka?

Multi-Topic ConsumersWe may have a consumer group that listens to multiple topics. If they have the same key-partitioning scheme and number of partitions across two topics, we can join data across the two topics.


1 Answers

Update March 2021: With Kafka's new KRaft mode, which entirely removes ZooKeeper from Kafka's architecture, a Kafka cluster can handle millions of topics/partitions. See https://www.confluent.io/blog/kafka-without-zookeeper-a-sneak-peek/ for details.

*short for "Kafka Raft Metadata mode"; in Early Access as of Kafka v2.8


Update September 2018: As of Kafka v2.0, a Kafka cluster can have hundreds of thousands of topics. See https://blogs.apache.org/kafka/entry/apache-kafka-supports-more-partitions.


Initial answer below for posterity:

The rule of thumb is that the number of Kafka topics can be in the thousands.

Jun Rao (Kafka committer; now at Confluent but he was formerly in LinkedIn's Kafka team) wrote:

At LinkedIn, our largest cluster has more than 2K topics. 5K topics should be fine. [...]

With more topics, you may hit one of those limits: (1) # dirs allowed in a FS; (2) open file handlers (we keep all log segments open in the broker); (3) ZK nodes.

The Kafka FAQ gives the following abstract guideline:

Kafka FAQ: How many topics can I have?

Unlike many messaging systems Kafka topics are meant to scale up arbitrarily. Hence we encourage fewer large topics rather than many small topics. So for example if we were storing notifications for users we would encourage a design with a single notifications topic partitioned by user id rather than a separate topic per user.

The actual scalability is for the most part determined by the number of total partitions across all topics not the number of topics itself (see the question below for details).

The article http://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/ (written by the aforementioned Jun Rao) adds further details, and particularly focuses on the impact of the number of partitions.

IMHO your use case / model is a bit of a stretch for a single Kafka cluster, though not necessarily for Kafka in general. With the little information you shared (I understand that a public forum is not the best place for sensitive discussions :-P) the only off-the-hip comment I can provide you with is to consider using more than one Kafka cluster because you mentioned that customer data must be very much isolated anyways (including the processing steps).

I hope this helps a bit!

like image 198
Michael G. Noll Avatar answered Sep 19 '22 09:09

Michael G. Noll