I am curious what exactly passing a RDD to a function does in Spark.
def my_func(x : RDD[String]) : RDD[String] = {
do_something_here
}
Suppose we define a function as above. When we call the function and pass an existing RDD[String] object as the input parameter, does this my_function make a "copy" for this RDD as the function parameter? In other words, is it being called-by-reference or called-by-value?
In Scala nothing get's copied (in the sense of pass-by-value you have in C/C++) when passed around. Most of the basic types Int, String, Double, etc. are immutable, so passing them by reference is very safe. (Note: If you are passing a mutable object and you change it, then anyone with a reference to that object will see the change).
On top of that, RDDs are lazy, distributed, immutable collections. Passing RDDs through functions and applying transformation to them (map, filter, etc.) doesn't really transfer any data or triggers any computation.
All chained transformations are "remembered" and will automatically get triggered in the right order when you enforce and action on the RDD, such as persisting it, or collecting it locally at the driver (through collect()
, take(n)
, etc.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With