Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is manually managing memory with .unpersist() a good idea?

I've read a lot of questions and answers here about unpersist() on dataframes. I so far haven't found an answer to this question:

In Spark, once I am done with a dataframe, is it a good idea to call .unpersist() to manually force that dataframe to be unpersisted from memory, as opposed to waiting for GC (which is an expensive task)? In my case I am loading many dataframes so that I can perform joins and other transformations.

So, for example, if I wish to load and join 3 dataframes A, B and C: I load dataframe A and B, join these two to create X, and then .unpersist() B because I don't need it any more (but I will need A), and could use the memory to load C (which is big). So then I load C, and join C to X, .unpersist() on C so I have more memory for the operations I will now perform on X and A.

I understand that GC will unpersist for me eventually, but I also understand than GC is an expensive task that should be avoided if possible. To re-phrase my question: Is this an appropriate method of manually managing memory, to optimise my spark jobs?

My understanding (please correct if wrong):

  • I understand that .unpersist() is a very cheap operation.
  • I understand that GC calls .unpersist() on my dataframes eventually anyway.
  • I understand that spark monitors cache and drops based on Last Recently Used policy. But in some cases I do not want the 'Last Used' DF to be dropped, so instead I can call.unpersist() on the datafames I know I will not need in future, so that I don't drop the DFs I will need and have to reload them later.

To re-phrase my question again if unclear: is this an appropriate use of .unpersist(), or should I just let Spark and GC do their job?

Thanks in advance :)

like image 882
Dan Carter Avatar asked Nov 16 '17 14:11

Dan Carter


1 Answers

There seem to be some misconception. While using unpersist is a valid approach to get better control over the storage, it doesn't avoid garbage collection. In fact all the on heap objects associated with cached data will be left garbage collector.

So while operation itself is relatively cheap, chain of events it triggers might not be cheap. Luckily explicit persist is not worse than waiting for automatic cleaner or GC triggered cleaner, so if you want to clean specific objects, go ahead and do it.

To limit GC on unpersist it might be worth to take a look at the OFF_HEAP StorageLevel.

like image 157
user8952110 Avatar answered Sep 24 '22 21:09

user8952110