How can i calculate how much memory and cpu my Kafka cluster needs? My cluster consists from 3 nodes, with throughput of ~800 messages per second.
Currently they have (each) 6 GB ram, 2 CPU, 1T disk, and it seems to be not enough. How much would you allocate?
You would need to provide some more details regarding your use-case like average size of messages etc. but here's my 2 cents anyway:
Confluent's documentation might shed some light:
CPUs
Most Kafka deployments tend to be rather light on CPU requirements. As such, the exact processor setup matters less than the other resources. Note that if SSL is enabled, the CPU requirements can be significantly higher (the exact details depend on the CPU type and JVM implementation).You should choose a modern processor with multiple cores. Common clusters utilize 24 core machines.
If you need to choose between faster CPUs or more cores, choose more cores. The extra concurrency that multiple cores offers will far outweigh a slightly faster clock speed.
How to compute your throughput
It might also be helpful to compute the throughput. For example, if you have 800 messages per second, of 500 bytes each then your throughput is 800*500/(1024*1024) = ~0.4MB/s. Now if your topic is partitioned and you have 3 brokers up and running with 3 replicas that would lead to 0.4/3*3=0.4MB/s per broker. 
More details regarding your architecture can be found in Confluent's whitepaper Apache Kafka and Confluent Reference Architecture. Here's the section for memory usage,
ZooKeeper uses the JVM heap, and 4GB RAM is typically sufficient. Too small of a heap will result in high CPU due to constant garbage collection while too large heap may result in long garbage collection pauses and loss of connectivity within the ZooKeeper cluster.
Kafka brokers use both the JVM heap and the OS page cache. The JVM heap is used for replication of partitions between brokers and for log compaction. Replication requires 1MB (default replica.max.fetch.size) for each partition on the broker. In Apache Kafka 0.10.1 (Confluent Platform 3.1), we added a new configuration (replica.fetch.response.max.bytes) that limits the total RAM used for replication to 10MB, to avoid memory and garbage collection issues when the number of partitions on a broker is high. For log compaction, calculating the required memory is more complicated and we recommend referring to the Kafka documentation if you are using this feature. For small to medium-sized deployments, 4GB heap size is usually sufficient. In addition, it is highly recommended that consumers always read from memory, i.e. from data that was written to Kafka and is still stored in the OS page cache. The amount of memory this requires depends on the rate at this data is written and how far behind you expect consumers to get. If you write 20GB per hour per broker and you allow brokers to fall 3 hours behind in normal scenario, you will want to reserve 60GB to the OS page cache. In cases where consumers are forced to read from disk, performance will drop significantly
Kafka Connect itself does not use much memory, but some connectors buffer data internally for efficiency. If you run multiple connectors that use buffering, you will want to increase the JVM heap size to 1GB or higher.
Consumers use at least 2MB per consumer and up to 64MB in cases of large responses from brokers (typical for bursty traffic). Producers will have a buffer of 64MB each. Start by allocating 1GB RAM and add 64MB for each producer and 16MB for each consumer planned.
There are many different factors that need to be taken into consideration when it comes to tune the configuration of your architecture. I would suggest to go through the aforementioned documentation, monitor your existing cluster and resources and finally tune them accordingly.
I think you want to start by profiling your kafka cluster.
See the answer to this post: CPU Profiling kafka brokers.
It basically recommends that you use a prometheus and grafana stack to visualize your load on a timeline - from this you should be able to determine your bottleneck. And links to an article that describes how.
Also, you may find the post interresting, because the poster seems to have about the same workload as you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With