Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark always remove RDD

Tags:

apache-spark

I've stopped my data feeding to Spark, but I still could see Spark is keeping remove RDDs as following:

15/07/30 10:03:10 INFO BlockManager: Removing RDD 136661
15/07/30 10:03:10 INFO BlockManager: Removing RDD 136662
15/07/30 10:03:10 INFO BlockManager: Removing RDD 136664
15/07/30 10:03:10 INFO BlockManager: Removing RDD 136663

I'm confused why spark keep removing RDD even though no new data and RDD generated.

like image 835
Jack Avatar asked Jun 07 '26 16:06

Jack


1 Answers

As you are probably aware, Spark manages persisted RDDs with an LRU algorithm. Although you're not adding more data, it is entirely possible Spark is removing those RDDs because they've gone out of scope in the Spark Application (job) or are just "too old."

The lifecycle of cached RDDs is managed by TimeStampedWeakValueHashMap. Basically, if the timestamp of an RDD is older than a particular threshold, the RDD will be deleted when clearOldValues() is called.

You answer implies you'd like to make sure these RDDs are not deleted so you might want to look at persisting your Spark data directly into Cassandra since they play so nicely together.

like image 150
devonlazarus Avatar answered Jun 10 '26 09:06

devonlazarus