Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

kafka new producer is not able to update metadata after one of the broker is down

I have an kafka environment which has 2 brokers and 1 zookeeper.

While I am trying to produce messages to kafka, if i stop broker 1(which is the leader one) the client stops producing messaging and give me the below error although the broker 2 is elected as a new leader for the topic and partions.

org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 60000 ms.

After 10 minutes passed, since broker 2 is new leader i expected producer to send data to broker 2 but it continued failing by giving above exception. lastRefreshMs and lastSuccessfullRefreshMs is still same although the metadataExpireMs is 300000 for producer.

I am using kafka new Producer implementation on producer side.

It seems that when producer is initiated, it binds to one broker and if that broker goes down it is not even trying to connect to another brokers in cluster.

But my expectation is if a broker goes down, it should directly check metadata for another brokers that are available and send data to them.

Btw my topic is 4 partition and has replication factor of 2. Giving this info in case it makes sense.

Configuration params.

{request.timeout.ms=30000, retry.backoff.ms=100, buffer.memory=33554432, ssl.truststore.password=null, batch.size=16384, ssl.keymanager.algorithm=SunX509, receive.buffer.bytes=32768, ssl.cipher.suites=null, ssl.key.password=null, sasl.kerberos.ticket.renew.jitter=0.05, ssl.provider=null, sasl.kerberos.service.name=null, max.in.flight.requests.per.connection=5, sasl.kerberos.ticket.renew.window.factor=0.8, bootstrap.servers=[10.201.83.166:9500, 10.201.83.167:9500], client.id=rest-interface, max.request.size=1048576, acks=1, linger.ms=0, sasl.kerberos.kinit.cmd=/usr/bin/kinit, ssl.enabled.protocols=[TLSv1.2, TLSv1.1, TLSv1], metadata.fetch.timeout.ms=60000, ssl.endpoint.identification.algorithm=null, ssl.keystore.location=null, value.serializer=class org.apache.kafka.common.serialization.ByteArraySerializer, ssl.truststore.location=null, ssl.keystore.password=null, key.serializer=class org.apache.kafka.common.serialization.ByteArraySerializer, block.on.buffer.full=false, metrics.sample.window.ms=30000, metadata.max.age.ms=300000, security.protocol=PLAINTEXT, ssl.protocol=TLS, sasl.kerberos.min.time.before.relogin=60000, timeout.ms=30000, connections.max.idle.ms=540000, ssl.trustmanager.algorithm=PKIX, metric.reporters=[], compression.type=none, ssl.truststore.type=JKS, max.block.ms=60000, retries=0, send.buffer.bytes=131072, partitioner.class=class org.apache.kafka.clients.producer.internals.DefaultPartitioner, reconnect.backoff.ms=50, metrics.num.samples=2, ssl.keystore.type=JKS}

Use Case:

1- Start BR1 and BR2 Produce data (Leader is BR1)

2- Stop BR2 produce data(fine)

3- Stop BR1(which means there is no active working broker in cluster at this time) and then Start BR2 and produce data (failed although leader is BR2)

4- Start BR1 produce data(leader is still BR2 but data is produced finely)

5- Stop BR2(now BR1 is leader)

6- Stop BR1(BR1 is still leader)

7- Start BR1 produce data(message is produced fine again)

If producer send the latest successful data to BR1 and then all brokers goes down, the producer expects BR1 to get up again although BR2 is up and new leader. Is this an expected behaviour?

like image 722
jit Avatar asked Feb 29 '16 16:02

jit


2 Answers

After spending hours I figured out the behaviour of kafka in my situation. May be this is a bug or may be this needs to be done this way for the reasons lie under the hood but actually if i would do such implementation i wouldn't do this way :)

When all brokers goes down, if you are able to get up only one broker this must be the broker which went down last in order to produce messages successfully.

Let's say you have 5 brokers; BR1, BR2, BR3, BR4 and BR5. If all goes down and if the lastly dead broker is BR3(which was the last leader), although you start all brokers BR1, BR2, BR4 and BR5, it will not make any sense unless you start BR3.

like image 124
jit Avatar answered Sep 20 '22 14:09

jit


You need to increase the number of retries. In your case you need to set it to >=5.

That is the only way for your producer to know that your cluster has a new leader.

Besides that, make sure that all your brokers have a copy of your partition(s). Else you aren't going to get a new leader.

like image 28
kiran.gilvaz Avatar answered Sep 21 '22 14:09

kiran.gilvaz