Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scope of Spark's `persist` or `cache`

I am confused about RDD's scoping in Spark.

According to this thread

Whether an RDD is cached or not is part of the mutable state of the RDD object. If you call rdd.cache it will be marked for caching from then on. It does not matter what scope you access it from.

So, if I defined a function with a new rdd created inside, for example (python code)

# there is an rdd called "otherRdd" outside the function

def myFun(args):
    ...
    newRdd = otherRdd.map(some_function)
    newRdd.persist()
    ...

Will the newRdd lives in the global namespace? or it is only visible inside the environment of myFun?

If it is only visible inside the environment of myFun, after myFun finishes execution, will Spark automatically unpersist the newRdd?

like image 727
panc Avatar asked Jul 10 '16 13:07

panc


1 Answers

Yes, when an RDD is garbage collected, it is unpersisted. So outside of myFun, newRdd is unpersisted (assuming you do not return it nor a derived rdd), you can also check this answer.

like image 62
geoalgo Avatar answered Oct 10 '22 14:10

geoalgo