Why do I have to explicitly tell Spark what to cache?

Tags:

apache-spark

In Spark, each time we do any action on an RDD, the RDD is re-computed. So if we know that the RDD is going to be reused, we should cache the RDD explicitly.

Let's say, Spark decides to lazily cache all the RDDs and uses LRU to keep the most relevant RDDs in memory automatically (which is how most caching works any way). It will be of great help for the developer as he does not have to think about caching and concentrate on the application. Also I do not see how can it negatively impact performance, as it is difficult to keep track of, how many times a variable (RDD) is used inside the program, most programmers will decide to cache most of the RDDs any way.

Caching usually happens automatically. Take the examples of either an OS/platform or a framework or a tool. But with the complexities of caching in distributed computing, I might be missing why caching cannot be automatic or the performance implications.

So I fail to understand, why I have to explicitly cache as,

It looks ugly
It can easily be missed
It can easily be over/under used

940

asked Dec 06 '15 12:12

rakesh

1 Answers

A subjective list of reasons:

in practice caching is rarely needed and is useful mostly for iterative algorithms, breaking long lineages. For example typical ETL pipelines may not require caching at all. Cache most of the RDDs is definitely not the right choice.
there is no universal caching strategy. Actual choice depends on available recourses like amount of memory, disks (local, remote, storage service), file system (in-memory, on-disk) and particular application.
on-disk persistence is expensive, in memory persistence puts more stress on a JVM and is using the most valuable resource in Spark
it is impossible to cache automatically without making assumptions about application semantics. In particular:
- expected behavior when data source changes. There is no universal answer and in many situations it can be impossible to automatically track changes
- differentiating between deterministic and non-deterministic transformations and choosing between caching and re-computing
comparing Spark caching to OS level caching doesn't make sense. The main goal of OS caching is to reduce latency. In Spark latency is usually not the most important factor and caching is used for other purposes like consistency, correctness and reducing stress on different parts of the system.
if cache doesn't use off-heap storage than caching introduces additional pressure on the garbage collector. GC cost can be actually higher than a cost of recomputing the data.
depending on the data and caching method reading data from cache can be significantly less efficient memory-wise.
Caching interferes with more advanced optimizations available in Spark SQL, effectively disabling partition pruning or predicate and projection pushdown.

It is also worth to note that:

removing cached data is handled automatically using LRU
some data (like intermediate shuffle data) is persisted automatically. I acknowledge that it makes some of the previous arguments at least partially invalid.
Spark caching doesn't affect system level or JVM level mechanisms

180

answered Sep 18 '22 01:09

zero323

Related questions
                            
                                how to use jni in spark?
                            
                                saveTocassandra could not find implicit value for parameter rwf
                            
                                how to print out snippets of a RDD in the spark-shell / pyspark?
                            
                                Permission denied when starting spark Command line on AWS EMR cluster
                            
                                Spark 1.6.1 S3 MultiObjectDeleteException
                            
                                Spark - Datediff for months?
                            
                                Is querying against a Spark DataFrame based on CSV faster than one based on Parquet?
                            
                                sparksql drop hive table
                            
                                Connect sparklyr to remote spark connection
                            
                                How to save Spark RDD to local filesystem
                            
                                Will Spark SQL completely replace Apache Impala or Apache Hive?
                            
                                Filter dataframe by value NOT present in column of other dataframe [duplicate]
                            
                                Pyspark read multiple csv files into a dataframe (OR RDD?)
                            
                                how to handle millions of smaller s3 files with apache spark
                            
                                pyspark merge two rdd together
                            
                                How to make onehotencoder in Spark to work like onehotencoder in Pandas?
                            
                                How long does RDD remain in memory?
                            
                                Pyspark ML - How to save pipeline and RandomForestClassificationModel
                            
                                Efficient string suffix detection
                            
                                Spark / Scala: Passing RDD to Function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With