I am not caching or persisting the spark dataframe. If I have to do many additional things in the same session by aggregating and modifying content of the dataframe as part of the process then when and how would the initial dataframe be released from memory?
Example:
I load a dataframe DF1 with 10 million records. Then I do some transformation on the dataframe which creates a new dataframe DF2. Then there are a series of 10 steps I do on DF2. All through this, I do not need DF1 anymore. How can I be sure that DF1 no longer exists in memory and hampering performance? Is there any approach using which I can directly remove DF1 from memory? Or does DF1 get automatically removed based on Least Recently Used (LRU) approach?
That's not how spark work. Dataframes are lazy ... the only things stored in memories are the structures and the list of tranformation you have done on your dataframes. The data are not stored in memory (unless you cache them and apply an action).
Therefore, I do not see any problem in your question.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With