We are facing performance issue while integrating Spark-Kafka streams.
Project setup: We are using Kafka topics with 3 partitions and producing 3000 messages in each partition and processing it in Spark direct streaming.
Problem we are facing: In the processing end we are having Spark direct stream approach to process the same. As per the below documentation. Spark should create parallel direct streams as many as the number of partitions in the topic (which is 3 in this case). But while reading we can see all the messages from partition 1 is getting processed first then second then third. Any help why it is not processing parallel? as per my understanding if it is reading parallel from all the partition at the same time then the message output should be random.
http://spark.apache.org/docs/latest/streaming-kafka-0-8-integration.html#approach-2-direct-approach-no-receivers
Did you try setting the spark.streaming.concurrentJobs parameter. May be in your case, it can be set to three.
sparkConf.set("spark.streaming.concurrentJobs", "3").
Thanks.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With