Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Would Spark unpersist the RDD itself when it realizes it won't be used anymore?

We can persist an RDD into memory and/or disk when we want to use it more than once. However, do we have to unpersist it ourselves later on, or does Spark does some kind of garbage collection and unpersist the RDD when it is no longer needed? I notice that If I call unpersist function myself, I get slower performance.

like image 696
MetallicPriest Avatar asked Sep 17 '15 17:09

MetallicPriest


People also ask

Does Spark reuse RDDs?

Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures. A second abstraction in Spark is shared variables that can be used in parallel operations.

Is RDD stored in memory?

The RDDs can also be stored in-memory while we use persist() method. Also, we can use it across parallel operations. There is only one difference between cache() and persist(). while using cache() the default storage level is MEMORY_ONLY.

What does it mean to cache an RDD?

Caching RDDs in Spark It is one mechanism to speed up applications that access the same RDD multiple times. An RDD that is not cached, nor checkpointed, is re-evaluated again each time an action is invoked on that RDD. There are two function calls for caching an RDD: cache() and persist(level: StorageLevel).

When should I cache RDD?

Caching is recommended in the following situations: For RDD re-use in iterative machine learning applications. For RDD re-use in standalone Spark applications. When RDD computation is expensive, caching can help in reducing the cost of recovery in the case one executor fails.


1 Answers

Yes, Apache Spark will unpersist the RDD when it's garbage collected.

In RDD.persist you can see:

sc.cleaner.foreach(_.registerRDDForCleanup(this))

This puts a WeakReference to the RDD in a ReferenceQueue leading to ContextCleaner.doCleanupRDD when the RDD is garbage collected. And there:

sc.unpersistRDD(rddId, blocking)

For more context see ContextCleaner in general and the commit that added it.

A few things to be aware of when relying on garbage collection for unperisting RDDs:

  • The RDDs use resources on the executors, and the garbage collection happens on the driver. The RDD will not be automatically unpersisted until there is enough memory pressure on the driver, no matter how full the disk/memory of the executors gets.
  • You cannot unpersist part of an RDD (some partitions/records). If you build one persisted RDD from another, both will have to fit entirely on the executors at the same time.
like image 115
Daniel Darabos Avatar answered Sep 20 '22 13:09

Daniel Darabos