Spark Direct Stream is not creating parallel streams per kafka partition

Question

We are facing performance issue while integrating Spark-Kafka streams.

Project setup: We are using Kafka topics with 3 partitions and producing 3000 messages in each partition and processing it in Spark direct streaming.

Problem we are facing: In the processing end we are having Spark direct stream approach to process the same. As per the below documentation. Spark should create parallel direct streams as many as the number of partitions in the topic (which is 3 in this case). But while reading we can see all the messages from partition 1 is getting processed first then second then third. Any help why it is not processing parallel? as per my understanding if it is reading parallel from all the partition at the same time then the message output should be random.

http://spark.apache.org/docs/latest/streaming-kafka-0-8-integration.html#approach-2-direct-approach-no-receivers

Saravanan Subramanian · Accepted Answer

Did you try setting the spark.streaming.concurrentJobs parameter. May be in your case, it can be set to three.

sparkConf.set("spark.streaming.concurrentJobs", "3").

Thanks.

Spark Direct Stream is not creating parallel streams per kafka partition

Tags:

parallel-processing

spark-streaming

Aru

1 Answers

Saravanan Subramanian

Recent Activity

Donate For Us

Spark Direct Stream is not creating parallel streams per kafka partition

Tags:

parallel-processing

spark-streaming

Aru

1 Answers

Saravanan Subramanian

Related questions

Recent Activity

Donate For Us