Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to make sure my DataFrame frees its memory?

I have a Spark/Scala job in which I do this:

  • 1: Compute a big DataFrame df1 + cache it into memory
  • 2: Use df1 to compute dfA
  • 3: Read raw data into df2 (again, its big) + cache it

When performing (3), I do no longer need df1. I want to make sure its space gets freed. I cached at (1) because this DataFrame gets used in (2) and its the only way to make sure I do not recompute it each time but only once.

I need to free its space and make sure it gets freed. What are my options?

I thought of these, but it doesn't seem to be sufficient:

  • df=null
  • df.unpersist()

Can you document your answer with a proper Spark documentation link?

like image 260
belka Avatar asked Mar 02 '18 17:03

belka


1 Answers

df.unpersist should be sufficient, but it won't necessarily free it right away. It merely marks the dataframe for removal.

You can use df.unpersist(blocking = true) which will block until the dataframe is removed before continuing on.

like image 175
puhlen Avatar answered Sep 21 '22 03:09

puhlen