Transformations create new RDD based on the existing RDD. Basically, RDDs are immutable and all transformations in Spark are lazy. Data in RDDs is not processed until an action is performed but without processing the data, how are new RDDs created?? For example, in filter
operation how are new RDD created without actually loading the RDDs into memory and processing it?
Question : For example, in filter operation how are new RDD created without actually loading the RDDs into memory and processing it?
For ex :
firstRDD=spark.textFile("hdfs://...")
secondRDD=firstRDD.filter(someFunction);
thirdRDD = secondRDD.map(someFunction);
result = thirdRDD.count()
Since RDDs are created over a set of transformations , it logs those transformations, rather than actual data(this is like action plan of what needs to be done if we filter with this particular predivate).Graph of these transformations to produce one RDD is called as Lineage Graph like below.
Please see RDD.scala
It will create new RDD only if encounters that predicate using your filter
..
which is like action plan. This plan will be executed only when if you call action like count
.
/*** Return a new RDD containing only the elements that satisfy a predicate.
*/
def filter(f: T => Boolean): RDD[T] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[T, T](
this,
(context, pid, iter) => iter.filter(cleanF),
preservesPartitioning = true)
}
Lazy evaluations : (correcting your quote "all transformations in Spark are lazy" to "all transformations in Spark are lazily evaluated")
Spark computes RDDs lazily the first time they are used in an action, so that it can pipeline transformations. So , in above example RDD will be evaluated only when
count()
action is invoked.
Hope that helps...
Spark transformations are lazy in operation. These operations does not get computed right away, it just remember the transformation applied on RDD and returns the pointer to the operation output. The transformation operations get computed only when action gets applied on it. Once an action gets applied, spark breaks the operations into tasks and distribute them on the nodes for execution.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With