Drop spark dataframe from cache

Tags:

I am using Spark 1.3.0 with python api. While transforming huge dataframes, I cache many DFs for faster execution;

df1.cache() df2.cache()

Once use of certain dataframe is over and is no longer needed how can I drop DF from memory (or un-cache it??)?

For example, df1 is used through out the code while df2 is utilized for few transformations and after that, it is never needed. I want to forcefully drop df2 to release more memory space.

281

asked Aug 26 '15 05:08

ankit patel

2 Answers

just do the following:

df1.unpersist() df2.unpersist()

Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method.

166

answered Sep 25 '22 19:09

Alexander

If the dataframe registered as a table for SQL operations, like

df.createGlobalTempView(tableName) // or some other way as per spark verision

then the cache can be dropped with following commands, off-course spark also does it automatically

Spark >= 2.x

Here spark is an object of SparkSession

Drop a specific table/df from cache
 spark.catalog.uncacheTable(tableName) 
Drop all tables/dfs from cache
 spark.catalog.clearCache() 

Spark <= 1.6.x

Drop a specific table/df from cache
 sqlContext.uncacheTable(tableName) 
Drop all tables/dfs from cache
 sqlContext.clearCache() 

answered Sep 23 '22 19:09

mrsrinivas

Related questions
                            
                                How to create DataFrame from Scala's List of Iterables?
                            
                                Filter spark DataFrame on string contains
                            
                                How to change a column position in a spark dataframe?
                            
                                Unable to infer schema when loading Parquet file
                            
                                Spark: Add column to dataframe conditionally
                            
                                How to run a script in PySpark
                            
                                I can't seem to get --py-files on Spark to work
                            
                                How Spark works internally
                            
                                How can I update a broadcast variable in spark streaming?
                            
                                scala.reflect.internal.MissingRequirementError: object java.lang.Object in compiler mirror not found
                            
                                Understanding Spark serialization
                            
                                Resolving dependency problems in Apache Spark
                            
                                Pivot String column on Pyspark Dataframe
                            
                                Difference between SparkContext, JavaSparkContext, SQLContext, and SparkSession?
                            
                                What is the difference between rowsBetween and rangeBetween?
                            
                                Calculating the averages for each KEY in a Pairwise (K,V) RDD in Spark with Python
                            
                                How do I split an RDD into two or more RDDs?
                            
                                Encoder error while trying to map dataframe row to updated row
                            
                                How to convert unix timestamp to date in Spark
                            
                                NoClassDefFoundError com.apache.hadoop.fs.FSDataInputStream when execute spark-shell

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Drop spark dataframe from cache

Tags:

apache-spark

apache-spark-sql

spark-streaming