I've read a lot of questions and answers here about unpersist() on dataframes. I so far haven't found an answer to this question:
In Spark, once I am done with a dataframe, is it a good idea to call .unpersist() to manually force that dataframe to be unpersisted from memory, as opposed to waiting for GC (which is an expensive task)? In my case I am loading many dataframes so that I can perform joins and other transformations.
So, for example, if I wish to load and join 3 dataframes A, B and C: I load dataframe A and B, join these two to create X, and then .unpersist() B because I don't need it any more (but I will need A), and could use the memory to load C (which is big). So then I load C, and join C to X, .unpersist() on C so I have more memory for the operations I will now perform on X and A.
I understand that GC will unpersist for me eventually, but I also understand than GC is an expensive task that should be avoided if possible. To re-phrase my question: Is this an appropriate method of manually managing memory, to optimise my spark jobs?
My understanding (please correct if wrong):
To re-phrase my question again if unclear: is this an appropriate use of .unpersist(), or should I just let Spark and GC do their job?
Thanks in advance :)
There seem to be some misconception. While using unpersist
is a valid approach to get better control over the storage, it doesn't avoid garbage collection. In fact all the on heap objects associated with cached data will be left garbage collector.
So while operation itself is relatively cheap, chain of events it triggers might not be cheap. Luckily explicit persist is not worse than waiting for automatic cleaner or GC triggered cleaner, so if you want to clean specific objects, go ahead and do it.
To limit GC on unpersist it might be worth to take a look at the OFF_HEAP
StorageLevel
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With