How far will Spark RDD cache go?

Question

Say I have three RDD transformation function called on rdd1:

def rdd2 = rdd1.f1
def rdd3 = rdd2.f2
def rdd4 = rdd3.f3

Now I want to cache rdd4, so I call rdd4.cache().

My question:

Will only the result from the action on rdd4 be cached or will every RDD above rdd4 be cached? Say I want to cache both rdd3 and rdd4, do I need to cache them separately?

aaronman · Accepted Answer

The whole idea of cache is that spark is not keeping the results in memory unless you tell it to. So if you cache the last RDD in the chain it only keeps the results of that one in memory. So, yes, you do need to cache them separately, but keep in mind you only need to cache an RDD if you are going to use it more than once, for example:

rdd4.cache()
val v1 = rdd4.lookup("key1")
val v2 = rdd4.lookup("key2")

If you do not call cache in this case rdd4 will be recalculated for every call to lookup (or any other function that requires evaluation). You might want to read the paper on RDD's it is pretty easy to understand and explains the ideas behind certain choices they made regarding how RDD's work.

How far will Spark RDD cache go?

Tags:

distributed-computing

apache-spark

EdwinGuo

1 Answers

aaronman

Recent Activity

Donate For Us

How far will Spark RDD cache go?

Tags:

distributed-computing

apache-spark

EdwinGuo

1 Answers

aaronman

Related questions

Recent Activity

Donate For Us