Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark / Scala: Passing RDD to Function

I am curious what exactly passing a RDD to a function does in Spark.

def my_func(x : RDD[String]) : RDD[String] = {
  do_something_here
}

Suppose we define a function as above. When we call the function and pass an existing RDD[String] object as the input parameter, does this my_function make a "copy" for this RDD as the function parameter? In other words, is it being called-by-reference or called-by-value?

like image 303
Jes Avatar asked Jun 25 '15 02:06

Jes


1 Answers

In Scala nothing get's copied (in the sense of pass-by-value you have in C/C++) when passed around. Most of the basic types Int, String, Double, etc. are immutable, so passing them by reference is very safe. (Note: If you are passing a mutable object and you change it, then anyone with a reference to that object will see the change).

On top of that, RDDs are lazy, distributed, immutable collections. Passing RDDs through functions and applying transformation to them (map, filter, etc.) doesn't really transfer any data or triggers any computation.

All chained transformations are "remembered" and will automatically get triggered in the right order when you enforce and action on the RDD, such as persisting it, or collecting it locally at the driver (through collect(), take(n), etc.)

like image 105
marios Avatar answered Oct 20 '22 20:10

marios