Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Kafka Capacity Planning

My employer has a Kafka cluster handling valuable data. Is there any way we can get an idea of what percent capacity our cluster is running at? Can our cluster handle larger volumes of traffic? Can we survive for an hour or a day if a single node goes down?

like image 419
clay Avatar asked Nov 28 '22 22:11

clay


1 Answers

I'm unsure exactly what you mean, so I'm going to take a wide spread approach.

By capacity do you mean, "will my Kafka cluster hold all my logs?". That is a function of:

  • the topic's retention period
  • your log compaction strategy
  • the average size of your Kafka messages
  • the amount of messages you expect to push through the system.
  • your replication factor
  • if you have compression turned on or not. See also: Cloudflare's Squeezing The Firehose article

If you have a 2 week retention period, with no log compaction (when a message is gone it's gone), no log compression, and during those two weeks you expect to push 10,000 messages (during those 2 weeks) that are 1k large and are replicated 3 times... you better have well 30,000k of storage, or 30MB.

As far as further computations, around size of your cluster and how many machines can go down before you have Problems, disk space, IO, - operational questions like that, here's some awesome looking links on that topic:

  • SO: Kafka Topology Best Practice (answer)
  • SO: How to decide Kafka Cluster size (answer)
  • Hortonworks: Kafka 0.9 Configuration Best Practices (I don't think much has changed on that front in the interim couple of years).

If by capacity you mean, "How much Kafka traffic can my Kafka cluster, aka the "physical" boxes in my Kafka cluster handle?": ie how fast can Kafka store data on your boxes, then that's another question. If you are wondering (for example) what kind of AWS instance types are fastest for handling Kafka data, or how much memory to give the JVM / what else you can run on that broker, then that's a good thing.

It's worth noting here that from a Unix perspective, the more free memory you have on the box the more the Unix kernel can use for a file cache (so don't just naively give it all to the JVM ;) ). And type / capacity of network card matters very much too.

There's a couple interesting things to read here:

  • Jay Kreps: Benchmarking Apache Kafka: 2M writes per second on cheap machines
  • Load testing Kafka with Ranger

With the idea of that theoretical maximum ("more than you'll ever need"), it is probably worth it to test your individual brokers / installation. Either with Ranger, a similar tool, or just dump ton of real data at it (maybe testing your data pipeline, at the same time, transitioning to my next point...)

If by capacity you mean, "How long, mean or median time, does it take for a message to pass through my data pipeline, getting produced into Kafka, consumed by a microservice, transformed, produced into a new topic, consumed again... and eventually landing at the end of the microservice cluster / data pipeline?"

That's a function around:

  • how much you can partition the data
  • if you have enough consumers in your consumer group to handle all the partitions
  • how long each microservice takes to process

Assuming you have a good strategy around partition level concurrency, I would add tracing information to every message. If you wanted to Keep It Simple, Silly, maybe add "time of initial ingest" field to your messages. For more complex tracing you could pass along a tracing ID with each message (the initial producer creates this, all other consumers just pass it along, or use it for parentage if you split the message into bits, etc). If you have time of initial ingest, then your last microservice can check the current time and compute your computation length metric.

Different microservices will take different times to process their message. If you had a tracing ID you could do something interesting like have each microservice write to a Kafka topic about how long it has taken the current service to process the current message. (Apply more Kafka to your Kafka problem!). Or have every topic write to a search data store with a small TTL on the data: using Elasticsearch to query recent Kafka data so you can get search results across topics is a neat trick I've seen, for example. Then you could see that microservice 5 is slow, and you need to spend some time performance tuning it.

Edit: You also may have luck monitoring your production pipeline with LinkedIn's Burrow tool for Kafka (which looks like it's still in 2017 actively getting love), will monitor to see if your consumers are falling behind, along with other things.

I hope this helps. This is an unfortunately broader question that it appears on the surface. Ultimately it's a function of % disk space, % CPU and % what your SLAs are around the data pipeline... and this sometimes comes down to unique factors like what your message size is, what kind of machines you are or want to run, and how fast your microservices are. Kafka the technology can handle an amazing amount of traffic: LinkedIn is not a small site, and Kafka is used by some of the most trafficked sites on the internet. A well constructed broker cluster should be able to handle whatever you throw at it, theoretically. The practical parts are when it comes down to your workflow, what your demands are, what you're actually doing with it, etc etc.

like image 69
RyanWilcox Avatar answered Mar 11 '23 18:03

RyanWilcox