Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

During rolling upgrade/restart, how to detect when a kafka broker is "done"?

I need to automate a rolling restart of a kafka cluster (3 kafka brokers). I can easily do it manually - restart one after the other, while checking the log to see when it's fine (e.g., when the new process has joined the cluster).

What is a good way to automate this check? How can I ask the broker whether it's up and running, connected to its peers, all topics up-to-date and such? In my restart script, I have access to the metrics, but to be frank, I did not really see one there which gives me a clear picture.

Another way would be to ask what a good "readyness" probe would be that does not simply check some TCP/IP port, but looks at the actual server...

like image 708
AnoE Avatar asked Jan 18 '19 08:01

AnoE


People also ask

What happens when Kafka broker restarts?

Restart Kafka CFK automatically restarts Kafka clusters when required. It restarts one broker at a time, starting with the highest numbered broker to 0, i.e., broker-n to broker-0, and checks that there are no under replicated partitions on the broker before proceeding to the next broker.

What is rolling restart in Kafka?

A rolling restart means that only one Kafka broker is restarted at a time. The rolling restart doesn't proceed to restart another broker until the first one has been started again and is in sync with the cluster. This keeps your cluster online all the time, and with no message lost.

How do I check a broker in Kafka?

There are 2 ways to get the list of available brokers in a Kafka cluster. Both with the help of scripts from zookeeper. Zookeeper manages the leader election and other coordination things for a Kafka cluster. So Zookeeper has a list of all the Kafka brokers in the cluster.

What happens when a Kafka broker dies?

if a broker dies, then kafka divides up leadership of its topic partitions to the remaining brokers in the cluster.


1 Answers

I would suggest exposing JMX metrics and tracking the following for cluster health

  • the controller count (must be 1 over the whole cluster)
  • under replicated partitions (should be zero for healthy cluster)
  • unclean leader elections (if you don't disable this in server.properties make sure there are none in the metric counts)
  • ISR shrinks within a reasonable time period, like 10 minute window (should be none)

Also, Yelp has tooling for rolling restarts implemented in Python, which requires Jolokia JMX Agents installed on the brokers, and it polls the metrics to make sure some of the above conditions are true

like image 183
OneCricketeer Avatar answered Nov 15 '22 08:11

OneCricketeer