Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why persist () are lazily evaluated in Spark

I understood the point that in Scala there are 2 types of operations

  1. Transformations
  2. Actions

Transformations like map(), filter() are evaluated lazily. So, that optimization can be done on Action execution. For example if I execute action first() then Spark will optimize to read only first line.

But why persist() operation is evaluated lazily. Because either ways I go, eagerly or lazily, it is going to persist entire RDD as per Storage level.

Can you please detail me why persist() is transformation instead of action.

like image 775
dinesh028 Avatar asked Dec 23 '15 15:12

dinesh028


People also ask

Is Spark persist lazy?

Caching or persisting of Spark DataFrame or Dataset is a lazy operation, meaning a DataFrame will not be cached until you trigger an action.

Why is Spark lazily evaluated?

In Spark, Lazy Evaluation means that You can apply as many TRANSFORMATIONs as you want, but Spark will not start the execution of the process until an ACTION is called. 💡 So transformations are lazy but actions are eager.

What does persist () do in Spark?

cache() and persist() functions are used to cache intermediate results of a RDD or DataFrame or Dataset. You can mark an RDD, DataFrame or Dataset to be persisted using the persist() or cache() methods on it.

Which operation will not be lazily evaluated in Spark?

We know that Spark is written in Scala, and Scala can run lazily, but the execution is Lazy by default for Spark. This means all the operations over an RDD/DataFrame/Dataset are never computed until the action is called.


1 Answers

For starters eager persistence would pollute a whole pipeline. cache or persist only expresses intention. It doesn't mean we'll ever get to the point when RDD is materialized and can be actually cached. Moreover there are contexts where data is cached automatically.

Because either ways I go, eagerly or lazily, it is going to persist entire RDD as per Storage level.

It is not exactly true. Thing is, persist is not persistent. As it is clearly stated in the documentation for MEMORY_ONLY persistence level:

If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed.

With MEMORY_AND_DISK remaining data is stored to the disk but still can be evicted if there is not enough memory for subsequent caching. What is even more important:

Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion.

You can also argue that cache / persist is semantically different from Spark actions which are executed for specific IO side-effects. cache is more a hint for a Spark engine that we may want to reuse this piece of code later.

like image 138
zero323 Avatar answered Sep 21 '22 12:09

zero323