I am working on data ingestion use case where data comes on multiple topics and had to be pushed to multiple tables based on the topic name. I was trying to understand will having one consumer for all the topics has any performance difference with having one consumer for each topic/partition.
The performance difference between these 2 scenarios depends on the numbers of brokers, partitions and on the expected throughput.
When the number of brokers, partitions and throughput is high, if you only have a single consumer for all partitions it's very likely it won't be able to cope with all the traffic.
For example, if you have 5 brokers with 5 partitions on each and each partitions has 5MB/s traffic:
if you have a single consumer: it will need to have a connection to each broker and will have to handle 5 x 5 x 5 MB/s = 125MB/s. Depending on your consumer config this might not be feasable. Even if you can handle 125MB/s, this does not scale well if you end up adding more partitions.
if you have multiple consumers: each consumer will grab a subset of the partitions. With 5 consumers, each will only have to handle 25MB/s which is easily feasable with a standard VM.
Kafka's consumer group feature makes it very easy to add consumers on the fly. So you can start with only a single consumer and add more if/when the throughput increases.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With