Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark: unpersist RDDs for which I have lost the reference

How can I unpersist RDD that were generated in an MLlib model for which I don't have a reference?

I know in pyspark you could unpersist all dataframes with sqlContext.clearCache(), is there something similar but for RDDs in the scala API? Furthermore, is there a way I could unpersist only some RDDs without having to unpersist all?

like image 648
germanium Avatar asked Feb 06 '17 16:02

germanium


People also ask

What does Unpersist do in PySpark?

Marks the DataFrame as non-persistent, and remove all blocks for it from memory and disk.

What is Unpersist?

(transitive, computing) To remove from permanent storage; to make temporary again.

How do I remove RDD from PySpark?

A call to gc. collect() also usually works. Almost. You should remove the last reference to it (i.e. del thisRDD ), and then, if you really need the RDD to be unpersisted immediately**, call gc.

How many ways can you create RDD in Spark?

There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.


1 Answers

You can call

val rdds = sparkContext.getPersistentRDDs(); // result is Map[Int, RDD]

and then filter values to get this value that you want (1) :

rdds.filter (x => filterLogic(x._2)).foreach (x => x._2.unpersist())

(1) - written by hand, without compiler - sorry if there's some error, but there shouldn't be ;)

like image 195
T. Gawęda Avatar answered Oct 10 '22 23:10

T. Gawęda