Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Direct Stream is not creating parallel streams per kafka partition

We are facing performance issue while integrating Spark-Kafka streams.

Project setup: We are using Kafka topics with 3 partitions and producing 3000 messages in each partition and processing it in Spark direct streaming.

Problem we are facing: In the processing end we are having Spark direct stream approach to process the same. As per the below documentation. Spark should create parallel direct streams as many as the number of partitions in the topic (which is 3 in this case). But while reading we can see all the messages from partition 1 is getting processed first then second then third. Any help why it is not processing parallel? as per my understanding if it is reading parallel from all the partition at the same time then the message output should be random.

http://spark.apache.org/docs/latest/streaming-kafka-0-8-integration.html#approach-2-direct-approach-no-receivers

like image 719
Aru Avatar asked Nov 09 '22 05:11

Aru


1 Answers

Did you try setting the spark.streaming.concurrentJobs parameter. May be in your case, it can be set to three.

sparkConf.set("spark.streaming.concurrentJobs", "3").

Thanks.

like image 55
Saravanan Subramanian Avatar answered Dec 21 '22 07:12

Saravanan Subramanian