Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Concurrent transformations on RDD in foreachDD function of Spark DStream

In the following code it appears to be that functions fn1 & fn2 are applied to inRDD in sequential manner as I see in the Stages section of Spark Web UI.

 DstreamRDD1.foreachRDD(new VoidFunction<JavaRDD<String>>()
 { 
     public void call(JavaRDD<String> inRDD)
        {
          inRDD.foreach(fn1)
          inRDD.foreach(fn2)
        }
 }

How is is different when streaming job is run this way. Are the below functions run in parallel on input Dstream?

DStreamRDD1.foreachRDD(fn1)
DStreamRDD2.foreachRDD(fn2)
like image 848
darkknight444 Avatar asked Oct 29 '22 16:10

darkknight444


1 Answers

Both foreach on RDD and foreachRDD on DStream will run sequentially because they are output transformations, meaning they cause the materialization of the graph. This would not be the case for any general lazy transformation in Spark, which can run in parallel when the execution graph diverges into multiple separate stages.

For example:

dStream: DStream[String] = ???
val first = dStream.filter(x => x.contains("h"))
val second = dStream.filter(x => !x.contains("h"))

first.print()
second.print()

The first part need not execute sequentially when you have sufficient cluster resources to run underlying stages in parallel. Then, calling count, which again is an output transformation will cause the print statements to be printed one after the other.

like image 170
Yuval Itzchakov Avatar answered Nov 09 '22 05:11

Yuval Itzchakov