Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do I have to explicitly tell Spark what to cache?

In Spark, each time we do any action on an RDD, the RDD is re-computed. So if we know that the RDD is going to be reused, we should cache the RDD explicitly.

Let's say, Spark decides to lazily cache all the RDDs and uses LRU to keep the most relevant RDDs in memory automatically (which is how most caching works any way). It will be of great help for the developer as he does not have to think about caching and concentrate on the application. Also I do not see how can it negatively impact performance, as it is difficult to keep track of, how many times a variable (RDD) is used inside the program, most programmers will decide to cache most of the RDDs any way.

Caching usually happens automatically. Take the examples of either an OS/platform or a framework or a tool. But with the complexities of caching in distributed computing, I might be missing why caching cannot be automatic or the performance implications.

So I fail to understand, why I have to explicitly cache as,

  1. It looks ugly
  2. It can easily be missed
  3. It can easily be over/under used
like image 940
rakesh Avatar asked Dec 06 '15 12:12

rakesh


People also ask

What does cache () do in Spark?

cache() is an Apache Spark transformation that can be used on a DataFrame, Dataset, or RDD when you want to perform more than one action. cache() caches the specified DataFrame, Dataset, or RDD in the memory of your cluster's workers.

When should I cache my Spark data frame?

When to cache? If you're executing multiple actions on the same DataFrame then cache it. Every time the following line is executed (in this case 3 times), spark reads the Parquet file, and executes the query. Now, Spark will read the Parquet, execute the query only once and then cache it.

What happens when cache memory is full in Spark?

unpersist() . If the caching layer becomes full, Spark will start evicting the data from memory using the LRU (least recently used) strategy. So it is good practice to use unpersist to stay more in control about what should be evicted.


1 Answers

A subjective list of reasons:

  • in practice caching is rarely needed and is useful mostly for iterative algorithms, breaking long lineages. For example typical ETL pipelines may not require caching at all. Cache most of the RDDs is definitely not the right choice.
  • there is no universal caching strategy. Actual choice depends on available recourses like amount of memory, disks (local, remote, storage service), file system (in-memory, on-disk) and particular application.
  • on-disk persistence is expensive, in memory persistence puts more stress on a JVM and is using the most valuable resource in Spark
  • it is impossible to cache automatically without making assumptions about application semantics. In particular:

    • expected behavior when data source changes. There is no universal answer and in many situations it can be impossible to automatically track changes
    • differentiating between deterministic and non-deterministic transformations and choosing between caching and re-computing
  • comparing Spark caching to OS level caching doesn't make sense. The main goal of OS caching is to reduce latency. In Spark latency is usually not the most important factor and caching is used for other purposes like consistency, correctness and reducing stress on different parts of the system.
  • if cache doesn't use off-heap storage than caching introduces additional pressure on the garbage collector. GC cost can be actually higher than a cost of recomputing the data.
  • depending on the data and caching method reading data from cache can be significantly less efficient memory-wise.
  • Caching interferes with more advanced optimizations available in Spark SQL, effectively disabling partition pruning or predicate and projection pushdown.

It is also worth to note that:

  • removing cached data is handled automatically using LRU
  • some data (like intermediate shuffle data) is persisted automatically. I acknowledge that it makes some of the previous arguments at least partially invalid.
  • Spark caching doesn't affect system level or JVM level mechanisms
like image 180
zero323 Avatar answered Sep 18 '22 01:09

zero323