Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Drop spark dataframe from cache

I am using Spark 1.3.0 with python api. While transforming huge dataframes, I cache many DFs for faster execution;

df1.cache() df2.cache() 

Once use of certain dataframe is over and is no longer needed how can I drop DF from memory (or un-cache it??)?

For example, df1 is used through out the code while df2 is utilized for few transformations and after that, it is never needed. I want to forcefully drop df2 to release more memory space.

like image 281
ankit patel Avatar asked Aug 26 '15 05:08

ankit patel


People also ask

How do I drop a DataFrame in spark?

The Spark DataFrame provides the drop() method to drop the column or the field from the DataFrame or the Dataset. The drop() method is also used to remove the multiple columns from the Spark DataFrame or the Database.

How do I clear my spark cache?

Right now, the only way to clear the cache is to reboot the machine.

What does cache () do in PySpark?

cache() is an Apache Spark transformation that can be used on a DataFrame, Dataset, or RDD when you want to perform more than one action. cache() caches the specified DataFrame, Dataset, or RDD in the memory of your cluster's workers.

How do I know if a data frame is cached?

You can call getStorageLevel. useMemory on the Dataframe and the RDD to find out if the dataset is in memory.


2 Answers

just do the following:

df1.unpersist() df2.unpersist() 

Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method.

like image 166
Alexander Avatar answered Sep 25 '22 19:09

Alexander


If the dataframe registered as a table for SQL operations, like

df.createGlobalTempView(tableName) // or some other way as per spark verision 

then the cache can be dropped with following commands, off-course spark also does it automatically

Spark >= 2.x

Here spark is an object of SparkSession

  • Drop a specific table/df from cache

     spark.catalog.uncacheTable(tableName) 
  • Drop all tables/dfs from cache

     spark.catalog.clearCache() 

Spark <= 1.6.x

  • Drop a specific table/df from cache

     sqlContext.uncacheTable(tableName) 
  • Drop all tables/dfs from cache

     sqlContext.clearCache() 
like image 29
mrsrinivas Avatar answered Sep 23 '22 19:09

mrsrinivas