In Spark, each time we do any action on an RDD, the RDD is re-computed. So if we know that the RDD is going to be reused, we should cache the RDD explicitly.
Let's say, Spark decides to lazily cache all the RDDs and uses LRU to keep the most relevant RDDs in memory automatically (which is how most caching works any way). It will be of great help for the developer as he does not have to think about caching and concentrate on the application. Also I do not see how can it negatively impact performance, as it is difficult to keep track of, how many times a variable (RDD) is used inside the program, most programmers will decide to cache most of the RDDs any way.
Caching usually happens automatically. Take the examples of either an OS/platform or a framework or a tool. But with the complexities of caching in distributed computing, I might be missing why caching cannot be automatic or the performance implications.
So I fail to understand, why I have to explicitly cache as,
cache() is an Apache Spark transformation that can be used on a DataFrame, Dataset, or RDD when you want to perform more than one action. cache() caches the specified DataFrame, Dataset, or RDD in the memory of your cluster's workers.
When to cache? If you're executing multiple actions on the same DataFrame then cache it. Every time the following line is executed (in this case 3 times), spark reads the Parquet file, and executes the query. Now, Spark will read the Parquet, execute the query only once and then cache it.
unpersist() . If the caching layer becomes full, Spark will start evicting the data from memory using the LRU (least recently used) strategy. So it is good practice to use unpersist to stay more in control about what should be evicted.
A subjective list of reasons:
it is impossible to cache automatically without making assumptions about application semantics. In particular:
It is also worth to note that:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With