Would Spark unpersist the RDD itself when it realizes it won't be used anymore?

Tags:

We can persist an RDD into memory and/or disk when we want to use it more than once. However, do we have to unpersist it ourselves later on, or does Spark does some kind of garbage collection and unpersist the RDD when it is no longer needed? I notice that If I call unpersist function myself, I get slower performance.

696

asked Sep 17 '15 17:09

MetallicPriest

1 Answers

Yes, Apache Spark will unpersist the RDD when it's garbage collected.

In RDD.persist you can see:

sc.cleaner.foreach(_.registerRDDForCleanup(this))

This puts a WeakReference to the RDD in a ReferenceQueue leading to ContextCleaner.doCleanupRDD when the RDD is garbage collected. And there:

sc.unpersistRDD(rddId, blocking)

For more context see ContextCleaner in general and the commit that added it.

A few things to be aware of when relying on garbage collection for unperisting RDDs:

The RDDs use resources on the executors, and the garbage collection happens on the driver. The RDD will not be automatically unpersisted until there is enough memory pressure on the driver, no matter how full the disk/memory of the executors gets.
You cannot unpersist part of an RDD (some partitions/records). If you build one persisted RDD from another, both will have to fit entirely on the executors at the same time.

115

answered Sep 20 '22 13:09

Daniel Darabos

Related questions
                            
                                How spark read a large file (petabyte) when file can not be fit in spark's main memory
                            
                                Pyspark: get list of files/directories on HDFS path
                            
                                Create spark dataframe schema from json schema representation
                            
                                Apache Spark: Splitting Pair RDD into multiple RDDs by key to save values
                            
                                Spark / Scala: forward fill with last observation
                            
                                How do I stop a spark streaming job?
                            
                                Spark final task takes 100x times longer than first 199, how to improve
                            
                                How to find the master URL for an existing spark cluster
                            
                                What's the most efficient way to filter a DataFrame
                            
                                Warnings while building Scala/Spark project with SBT
                            
                                Spark DataFrame: does groupBy after orderBy maintain that order?
                            
                                Difference between createOrReplaceTempView and registerTempTable
                            
                                Adding a group count column to a PySpark dataframe
                            
                                how to get max(date) from given set of data grouped by some fields using pyspark?
                            
                                Google Dataflow vs Apache Spark
                            
                                Building a row from a dict in pySpark
                            
                                Column name with dot spark
                            
                                How to uncache RDD?
                            
                                Spark Equivalent of IF Then ELSE
                            
                                apache spark - check if file exists

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Would Spark unpersist the RDD itself when it realizes it won't be used anymore?

Tags:

distributed-computing

apache-spark

rdd

hadoop

MetallicPriest

People also ask

1 Answers

Daniel Darabos

Recent Activity

Donate For Us