In the following code it appears to be that functions fn1 & fn2 are applied to inRDD in sequential manner as I see in the Stages section of Spark Web UI.
DstreamRDD1.foreachRDD(new VoidFunction<JavaRDD<String>>()
{
public void call(JavaRDD<String> inRDD)
{
inRDD.foreach(fn1)
inRDD.foreach(fn2)
}
}
How is is different when streaming job is run this way. Are the below functions run in parallel on input Dstream?
DStreamRDD1.foreachRDD(fn1)
DStreamRDD2.foreachRDD(fn2)
Both foreach
on RDD
and foreachRDD
on DStream
will run sequentially because they are output transformations, meaning they cause the materialization of the graph. This would not be the case for any general lazy transformation in Spark, which can run in parallel when the execution graph diverges into multiple separate stages.
For example:
dStream: DStream[String] = ???
val first = dStream.filter(x => x.contains("h"))
val second = dStream.filter(x => !x.contains("h"))
first.print()
second.print()
The first part need not execute sequentially when you have sufficient cluster resources to run underlying stages in parallel. Then, calling count
, which again is an output transformation will cause the print
statements to be printed one after the other.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With