Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Transformation - Why is it lazy and what is the advantage?

Tags:

Spark Transformations are lazily evaluated - when we call the action it executes all the transformations based on lineage graph.

What is the advantage of having the Transformations Lazily evaluated?

Will it improve the performance and less amount of memory consumption compare to eagerly evaluated?

Is there any disadvantage of having the Transformation lazily evaluated?

like image 200
Shankar Avatar asked Jun 25 '16 11:06

Shankar


People also ask

Why transformation is lazy in spark?

Why Transformation is lazy in Spark ? Whenever a transformation operation is performed in Apache Spark, it is lazily evaluated. It won’t be executed until an action is performed.

What are the advantages of lazy evaluation in spark?

By this lazy evaluation in Spark, the number of switches between driver program and cluster is also reduced thereby saving time and resources in memory, and also there is an increase in the speed of computation.

How does spark handle transformations?

For transformations, Spark adds them to a DAG of computation and only when driver requests some data, does this DAG actually gets executed. One advantage of this is that Spark can make many optimization decisions after it had a chance to look at the DAG in entirety.

What is an RDD transformation in spark?

Transformations are lazy in nature meaning when we call some operation in RDD, it does not execute immediately. Spark maintains the record of which operation is being called (Through DAG). We can think Spark RDD as the data, that we built up through transformation.


Video Answer


2 Answers

For transformations, Spark adds them to a DAG of computation and only when driver requests some data, does this DAG actually gets executed.

One advantage of this is that Spark can make many optimization decisions after it had a chance to look at the DAG in entirety. This would not be possible if it executed everything as soon as it got it.

For example -- if you executed every transformation eagerly, what does that mean? Well, it means you will have to materialize that many intermediate datasets in memory. This is evidently not efficient -- for one, it will increase your GC costs. (Because you're really not interested in those intermediate results as such. Those are just convnient abstractions for you while writing the program.) So, what you do instead is -- you tell Spark what is the eventual answer you're interested and it figures out best way to get there.

like image 76
Sachin Tyagi Avatar answered Sep 19 '22 17:09

Sachin Tyagi


Consider a 1 GB log file where you have error,warning and info messages and it is present in HDFS as blocks of 64 or 128 MB(doesn't matter in this context).You first create a RDD called "input" of this text file. Then,you create another RDD called "errors" by applying filter on the "input" RDD to fetch only the lines containing error messages and then call the action first() on the "error" RDD. Spark will here optimize the processing of the log file by stopping as soon as it finds the first occurrence of an error message in any of the partitions. If the same scenario had been repeated in eager evaluation, Spark would have filtered all the partitions of the log file even though you were only interested in the first error message.

like image 43
Aniketh Jain Avatar answered Sep 19 '22 17:09

Aniketh Jain