Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark streaming + kafka throughput

In my spark application I'm reading from kafka topic. This topic has 10 partitions so I've created 10 receivers with one thread per receiver. With such configuration I can can observe weird behavior of the receivers. Median rates for these consumers are:

Receiver-0 node-1 10K
Receiver-1 node-2 2.5K
Receiver-2 node-3 2.5K
Receiver-3 node-4 2.5K
Receiver-4 node-5 2.5K
Receiver-5 node-1 10K
Receiver-6 node-2 2.6K
Receiver-7 node-3 2.5K
Receiver-8 node-4 2.5K
Receiver-9 node-5 2.5K

Problem 1: node-1 is receiving as many messages as the other 4 together.

Problem 2: App is not reaching batch performance limit(30 sec batches are computed in median time of 17 sec). I would like it to consume enough messages to make this at least 25 sec of computation time.

Where I should look for the bottleneck ?

To be clear, there are more messages to be consumed.

@Edit: I had lag on only two partitions, so the first problem is solved. Still, reading 10k msgs per second is not very much.

like image 368
Krever Avatar asked Feb 27 '26 07:02

Krever


1 Answers

Use Sparks built in backpressure (since Spark 1.5, which wasn't available at the time of your question): https://github.com/jaceklaskowski/mastering-apache-spark-book/blob/master/spark-streaming-backpressure.adoc

Just set

spark.streaming.backpressure.enabled=true
spark.streaming.kafka.maxRatePerPartition=X (really high in your case)

To find the bottleneck you should use the WebUI of Sparkstreaming and look at the DAG of the process taking most of the time...

like image 179
Timomo Avatar answered Mar 02 '26 08:03

Timomo



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!