Un-persisting all dataframes in (py)spark

Tags:

I am a spark application with several points where I would like to persist the current state. This is usually after a large step, or caching a state that I would like to use multiple times. It appears that when I call cache on my dataframe a second time, a new copy is cached to memory. In my application, this leads to memory issues when scaling up. Even though, a given dataframe is a maximum of about 100 MB in my current tests, the cumulative size of the intermediate results grows beyond the alloted memory on the executor. See below for a small example that shows this behavior.

cache_test.py:

Click to copy

from pyspark import SparkContext, HiveContext

spark_context = SparkContext(appName='cache_test')
hive_context = HiveContext(spark_context)

df = (hive_context.read
      .format('com.databricks.spark.csv')
      .load('simple_data.csv')
     )
df.cache()
df.show()

df = df.withColumn('C1+C2', df['C1'] + df['C2'])
df.cache()
df.show()

spark_context.stop()

simple_data.csv:

Click to copy

1,2,3
4,5,6
7,8,9

Looking at the application UI, there is a copy of the original dataframe, in adition to the one with the new column. I can remove the original copy by calling df.unpersist() before the withColumn line. Is this the recommended way to remove cached intermediate result (i.e. call unpersist before every cache()).

Also, is it possible to purge all cached objects. In my application, there are natural breakpoints where I can simply purge all memory, and move on to the next file. I would like to do this without creating a new spark application for each input file.

Thank you in advance!

975

asked Apr 28 '16 05:04

bjack3

3 Answers

Spark 2.x

You can use Catalog.clearCache:

Click to copy

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate
...
spark.catalog.clearCache()

Spark 1.x

You can use SQLContext.clearCache method which

Removes all cached tables from the in-memory cache.

Click to copy

from pyspark.sql import SQLContext
from pyspark import SparkContext

sqlContext = SQLContext.getOrCreate(SparkContext.getOrCreate())
...
sqlContext.clearCache()

answered Oct 16 '22 22:10

zero323

We use this quite often

Click to copy

for (id, rdd) in sc._jsc.getPersistentRDDs().items():
    rdd.unpersist()
    print("Unpersisted {} rdd".format(id))

where sc is a sparkContext variable.

answered Oct 17 '22 00:10

Tagar

When you use cache on dataframe it is one of the transformation and gets evaluated lazily when you perform any action on it like count(),show() etc.

In your case after doing first cache you are calling show() that is the reason the dataframe is cached in memory. Now then you are again performing transformation on the dataframe to add additional column and again caching the new dataframe and then calling the action command show again and this would cache the second dataframe in memory. In case if size of your dataframe is big enough to just hold one dataframe then when you cache the second dataframe it would remove the first dataframe from the memory as it does not have enough space to hold the second dataframe.

Thing to keep in mind: You should not cache a dataframe unless you are using it in multiple actions otherwise it would be an overload in terms of performance as caching itself is costlier operation.

answered Oct 16 '22 22:10

Nikunj Kakadiya

Related questions
                            
                                Python Regex Engine - "look-behind requires fixed-width pattern" Error
                            
                                HSV to RGB Color Conversion
                            
                                Python lightweight database wrapper for SQLite
                            
                                How to add an image in Tkinter?
                            
                                How to write Pandas dataframe to sqlite with Index
                            
                                How can I check if a Pandas dataframe's index is sorted
                            
                                Python parse CSV ignoring comma with double-quotes
                            
                                How to find first non-zero value in every column of a numpy array?
                            
                                Concise way to getattr() and use it if not None in Python
                            
                                Download and decompress gzipped file in memory?
                            
                                bbox_to_anchor and loc in matplotlib
                            
                                What is the order of evaluation in python when using pop(), list[-1] and +=?
                            
                                How to convert a boto3 Dynamo DB item to a regular dictionary in Python?
                            
                                running code if try statements were successful in python
                            
                                assign operator to variable in python?
                            
                                changing the values of the diagonal of a matrix in numpy
                            
                                how to post multiple value with same key in python requests?
                            
                                How do I reverse a part (slice) of a list in Python?
                            
                                Flask-WTF - validate_on_submit() is never executed
                            
                                How to keep multiple independent celery queues?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Un-persisting all dataframes in (py)spark

Tags:

python

caching

apache-spark

apache-spark-sql

pyspark

bjack3

People also ask

3 Answers

zero323

Tagar

Nikunj Kakadiya

Recent Activity

Donate For Us