Understanding Spark's caching

Tags:

apache-spark

I'm trying to understand how Spark's cache work.

Here is my naive understanding, please let me know if I'm missing something:

val rdd1 = sc.textFile("some data") rdd1.cache() //marks rdd1 as cached val rdd2 = rdd1.filter(...) val rdd3 = rdd1.map(...) rdd2.saveAsTextFile("...") rdd3.saveAsTextFile("...")

In the above, rdd1 will be loaded from disk (e.g. HDFS) only once. (when rdd2 is saved I assume) and then from cache (assuming there is enough RAM) when rdd3 is saved)

Now here is my question. Let's say I want to cache rdd2 and rdd3 as they will both be used later on, but I don't need rdd1 after creating them.

Basically there is duplication, isn't it? Since once rdd2 and rdd3 are calculated, I don't need rdd1 anymore, I should probably unpersist it, right? the question is when?

Will this work? (Option A)

val rdd1 = sc.textFile("some data") rdd1.cache()   // marks rdd as cached val rdd2 = rdd1.filter(...) val rdd3 = rdd1.map(...) rdd2.cache() rdd3.cache() rdd1.unpersist()

Does spark add the unpersist call to the DAG? or is it done immediately? if it's done immediately, then basically rdd1 will be non cached when I read from rdd2 and rdd3, right?

Should I do it this way instead (Option B)?

val rdd1 = sc.textFile("some data") rdd1.cache()   // marks rdd as cached val rdd2 = rdd1.filter(...) val rdd3 = rdd1.map(...)  rdd2.cache() rdd3.cache()  rdd2.saveAsTextFile("...") rdd3.saveAsTextFile("...")  rdd1.unpersist()

So the question is this: Is Option A good enough? i.e. will rdd1 still load the file only once? Or do I need to go with Option B?

561

asked Apr 27 '15 18:04

2 Answers

It would seem that Option B is required. The reason is related to how persist/cache and unpersist are executed by Spark. Since RDD transformations merely build DAG descriptions without execution, in Option A by the time you call unpersist, you still only have job descriptions and not a running execution.

This is relevant because a cache or persist call just adds the RDD to a Map of RDDs that marked themselves to be persisted during job execution. However, unpersist directly tells the blockManager to evict the RDD from storage and removes the reference in the Map of persistent RDDs.

persist function

unpersist function

So you would need to call unpersist after Spark actually executed and stored the RDD with the block manager.

The comments for the RDD.persist method hint towards this: rdd.persist

answered Oct 02 '22 18:10

Rich

In option A, you have not shown when you are calling the action (call to save)

val rdd1 = sc.textFile("some data") rdd.cache() //marks rdd as cached val rdd2 = rdd1.filter(...) val rdd3 = rdd1.map(...) rdd2.cache() rdd3.cache() rdd1.unpersist() rdd2.saveAsTextFile("...") rdd3.saveAsTextFile("...")

If the sequence is as above, Option A should use cached version of rdd1 for computing both rdd2 and rdd 3

answered Oct 02 '22 17:10

ayan guha

Related questions
                            
                                Count number of non-NaN entries in each column of Spark dataframe with Pyspark
                            
                                Spark union of multiple RDDs
                            
                                How to set amount of Spark executors?
                            
                                How to build a sparkSession in Spark 2.0 using pyspark?
                            
                                Aggregating multiple columns with custom function in Spark
                            
                                Specifying the filename when saving a DataFrame as a CSV [duplicate]
                            
                                Calling Java/Scala function from a task
                            
                                Getting the count of records in a data frame quickly
                            
                                pyspark: rolling average using timeseries data
                            
                                Where do you need to use lit() in Pyspark SQL?
                            
                                Spark on yarn concept understanding
                            
                                Is there better way to display entire Spark SQL DataFrame?
                            
                                PySpark row-wise function composition
                            
                                SPARK SQL - case when then
                            
                                How to conditionally replace value in a column based on evaluation of expression based on another column in Pyspark?
                            
                                Can I add arguments to python code when I submit spark job?
                            
                                PySpark create new column with mapping from a dict
                            
                                DataFrame join optimization - Broadcast Hash Join
                            
                                How to exclude multiple columns in Spark dataframe in Python
                            
                                “value $ is not a member of StringContext” - Missing Scala plugin?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With