Need to release the memory used by unused spark dataframes

Question

I am not caching or persisting the spark dataframe. If I have to do many additional things in the same session by aggregating and modifying content of the dataframe as part of the process then when and how would the initial dataframe be released from memory?

Example:

I load a dataframe DF1 with 10 million records. Then I do some transformation on the dataframe which creates a new dataframe DF2. Then there are a series of 10 steps I do on DF2. All through this, I do not need DF1 anymore. How can I be sure that DF1 no longer exists in memory and hampering performance? Is there any approach using which I can directly remove DF1 from memory? Or does DF1 get automatically removed based on Least Recently Used (LRU) approach?

Steven · Accepted Answer

That's not how spark work. Dataframes are lazy ... the only things stored in memories are the structures and the list of tranformation you have done on your dataframes. The data are not stored in memory (unless you cache them and apply an action).

Therefore, I do not see any problem in your question.

Need to release the memory used by unused spark dataframes

Tags:

memory

apache-spark

pyspark

Bonson

1 Answers

Steven

Recent Activity

Donate For Us

Need to release the memory used by unused spark dataframes

Tags:

memory

apache-spark

pyspark

Bonson

1 Answers

Steven

Related questions

Recent Activity

Donate For Us