Concurrent transformations on RDD in foreachDD function of Spark DStream

Question

In the following code it appears to be that functions fn1 & fn2 are applied to inRDD in sequential manner as I see in the Stages section of Spark Web UI.

 DstreamRDD1.foreachRDD(new VoidFunction<JavaRDD<String>>()
 { 
     public void call(JavaRDD<String> inRDD)
        {
          inRDD.foreach(fn1)
          inRDD.foreach(fn2)
        }
 }

How is is different when streaming job is run this way. Are the below functions run in parallel on input Dstream?

DStreamRDD1.foreachRDD(fn1)
DStreamRDD2.foreachRDD(fn2)

Yuval Itzchakov · Accepted Answer

Both foreach on RDD and foreachRDD on DStream will run sequentially because they are output transformations, meaning they cause the materialization of the graph. This would not be the case for any general lazy transformation in Spark, which can run in parallel when the execution graph diverges into multiple separate stages.

For example:

dStream: DStream[String] = ???
val first = dStream.filter(x => x.contains("h"))
val second = dStream.filter(x => !x.contains("h"))

first.print()
second.print()

The first part need not execute sequentially when you have sufficient cluster resources to run underlying stages in parallel. Then, calling count, which again is an output transformation will cause the print statements to be printed one after the other.

Concurrent transformations on RDD in foreachDD function of Spark DStream

Tags:

java

apache-spark

rdd

spark-streaming

dstream

darkknight444

1 Answers

Yuval Itzchakov

Recent Activity

Donate For Us

Concurrent transformations on RDD in foreachDD function of Spark DStream

Tags:

java

apache-spark

rdd

spark-streaming

dstream

darkknight444

1 Answers

Yuval Itzchakov

Related questions

Recent Activity

Donate For Us