I am confused about RDD's scoping in Spark.
According to this thread
Whether an RDD is cached or not is part of the mutable state of the RDD object. If you call rdd.cache it will be marked for caching from then on. It does not matter what scope you access it from.
So, if I defined a function with a new rdd created inside, for example (python code)
# there is an rdd called "otherRdd" outside the function
def myFun(args):
...
newRdd = otherRdd.map(some_function)
newRdd.persist()
...
Will the newRdd
lives in the global namespace? or it is only visible inside the environment of myFun
?
If it is only visible inside the environment of myFun
, after myFun
finishes execution, will Spark automatically unpersist
the newRdd
?
Yes, when an RDD is garbage collected, it is unpersisted. So outside of myFun, newRdd is unpersisted (assuming you do not return it nor a derived rdd), you can also check this answer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With